CouchDB’s view sort order isn’t just about ascending or descending; it’s a sophisticated dance driven by Unicode collation rules, and understanding them unlocks precise data retrieval.
Let’s watch a view in action. Imagine you have a database of books, and you want to sort them not just by title, but by title and author, with a specific handling of case and accents.
Here’s our sample data in a books database:
[
{"_id": "1", "title": "The Great Gatsby", "author": "F. Scott Fitzgerald"},
{"_id": "2", "title": "the great gatsby", "author": "f. scott fitzgerald"},
{"_id": "3", "title": "Gatsby, The Great", "author": "Fitzgerald, F. Scott"},
{"_id": "4", "title": "Moby Dick", "author": "Herman Melville"},
{"_id": "5", "title": "moby dick", "author": "herman melville"},
{"_id": "6", "title": "War and Peace", "author": "Leo Tolstoy"},
{"_id": "7", "title": "La Peste", "author": "Albert Camus"},
{"_id": "8", "title": "Le Rouge et le Noir", "author": "Stendhal"}
]
And here’s a simple map function:
function (doc) {
emit([doc.title, doc.author], null);
}
If we query this view without any special collation, we get something like this:
[
["Gatsby, The Great", "Fitzgerald, F. Scott"],
["Moby Dick", "Herman Melville"],
["War and Peace", "Leo Tolstoy"],
["The Great Gatsby", "F. Scott Fitzgerald"],
["The Great Gatsby", "F. Scott Fitzgerald"], // Duplicate due to case difference
["moby dick", "herman melville"],
["Le Rouge et le Noir", "Stendhal"],
["La Peste", "Albert Camus"]
]
Notice how "Gatsby, The Great" comes before "Moby Dick", and then "The Great Gatsby" appears twice, out of order relative to the lowercase version. This is standard lexicographical sorting.
CouchDB’s view engine uses ICU (International Components for Unicode) for collation, which is incredibly powerful. You can specify collation rules directly in your view query. The key is the collation parameter, which takes a string defining the sorting behavior.
Here’s how we can use collation to achieve case-insensitive and accent-insensitive sorting for our book titles. We’ll use the en_US.utf8 locale, which generally provides good defaults for English.
To perform a case-insensitive and accent-insensitive sort, we’ll use the en_US.utf8 collation and modify our emit to include the collation rules:
function (doc) {
emit([doc.title, doc.author], null);
}
Now, let’s query this view with specific collation:
GET /books/_design/books/_view/by_title_author?key=["The Great Gatsby", "F. Scott Fitzgerald"]&collation=en_US.utf8&descending=false&group_level=2
This query, when executed with the en_US.utf8 collation, will treat 'A' and 'a', and 'é' and 'e' as equivalent for sorting purposes. The output becomes much more predictable and user-friendly:
[
["Gatsby, The Great", "Fitzgerald, F. Scott"],
["La Peste", "Albert Camus"],
["Le Rouge et le Noir", "Stendhal"],
["Moby Dick", "Herman Melville"],
["The Great Gatsby", "F. Scott Fitzgerald"],
["moby dick", "herman melville"],
["the great gatsby", "f. scott fitzgerald"],
["War and Peace", "Leo Tolstoy"]
]
Notice how "The Great Gatsby" and "the great gatsby" are now grouped together and sorted consistently. "La Peste" and "Le Rouge et le Noir" are also sorted correctly, demonstrating accent handling.
The power of collation lies in its ability to define custom sorting behaviors. You can specify rules for ignoring diacritics, case, punctuation, and even tailor it to specific languages. The collation parameter accepts strings that follow the ICU collation specification. For example, en_US.utf8 implies a standard English sort. More complex rules can be constructed, like en_US.utf8;&kn; to enable "numeric collation" where numbers within strings are sorted numerically (e.g., "file10.txt" comes after "file2.txt").
The most surprising true thing about CouchDB view sorting is that the default behavior you see is actually a very specific, often language-agnostic, Unicode code point sort. When you ask for a "sorted" list, CouchDB is essentially sorting based on the raw numerical value of each character’s Unicode representation. This is fast but rarely matches human expectations for alphabetical order, especially across different languages or with case and accent variations. Collation rules are how you bridge that gap, telling CouchDB to interpret those code points in a culturally meaningful way.
The mental model for CouchDB views and sorting is this:
- Emit: Your map function defines what data goes into the index. The order of elements in the emitted array (
[doc.title, doc.author]) dictates the primary and secondary sorting keys. - Index: CouchDB builds an index based on these emitted keys.
- Query: When you query the view, you specify parameters like
descending,limit,skip, and crucially,collation. - Collation: The
collationparameter tells CouchDB how to compare those emitted keys. It’s applied to each comparison CouchDB makes during the query to determine the final order. Without it, it uses the raw Unicode code point comparison.
The exact levers you control are the collation string itself, and the structure of your emit array. The collation string is the most powerful tool for fine-tuning sort order. You can find extensive documentation on ICU collation rules online, which CouchDB leverages.
A common pitfall is assuming the default sort order will be "correct" for all languages or use cases. It’s not. For instance, if you’re dealing with German umlauts (ä, ö, ü), you’d want a German-specific collation rule if the default en_US.utf8 doesn’t produce the desired grouping (e.g., 'ä' sorting alongside 'a'). The collation parameter is where you inject that linguistic intelligence.
Once you’ve mastered collation rules for basic sorting, the next step is to explore how to combine them with other view features like group_level for aggregated results that are also sorted intelligently.