Couchbase Search indexes, often called FTS (Full-Text Search) indexes, are a powerful but often misunderstood feature. The most surprising thing about them is that they don’t just store text; they analyze it, breaking it down into meaningful tokens based on language and specific rules, which is crucial for effective searching.
Let’s see this in action. Imagine you have a collection of documents about books, and you want to search for books containing the word "adventure" in their descriptions.
[
{
"title": "The Hobbit",
"author": "J.R.R. Tolkien",
"description": "Bilbo Baggins, a hobbit, is persuaded by Gandalf the wizard and a band of dwarves to steal treasure from Smaug the dragon. A grand adventure ensues."
},
{
"title": "Pride and Prejudice",
"author": "Jane Austen",
"description": "A classic novel of manners, love, and social standing in Georgian England."
},
{
"title": "Treasure Island",
"author": "Robert Louis Stevenson",
"description": "Young Jim Hawkins embarks on a perilous journey to find buried treasure, facing pirates and mutiny on the high seas. A true adventure story."
}
]
To search this effectively, we need an FTS index. Here’s how you’d create one using the Couchbase CLI:
/opt/couchbase/bin/couchbase-cli fts-create-index --cluster localhost:8091 --username Administrator --password password \
--bucket travel-sample --name books_fts_index --type fulltext-v1 \
--design-name books_index --source-type couchbase --source-name books_collection
This command tells Couchbase to create an FTS index named books_fts_index on the books_collection within the travel-sample bucket. The --type fulltext-v1 specifies the FTS index version, and --design-name books_index is a logical grouping for your FTS indexes.
Once created, Couchbase analyzes the description field. It doesn’t just store "adventure" as a string; it breaks it down into tokens. For English, this typically involves:
- Lowercasing: "Adventure" becomes "adventure".
- Punctuation removal: "adventure." becomes "adventure".
- Stop word removal: Common words like "a", "the", "is" might be removed.
- Stemming: Words with similar roots might be reduced to a common stem (e.g., "adventures" might become "adventur").
This tokenization is what allows you to search for "adventure" and find documents containing "adventures" or "adventure."
Now, let’s perform a query. You can do this via the Couchbase UI, the SDK, or the REST API. Using the REST API for demonstration:
curl -X POST \
http://localhost:8093/api/index/books_fts_index/query \
-H 'Content-Type: application/json' \
-d '{
"query": {
"match": {
"description": "adventure"
}
}
}'
The result would look something like this, showing the "The Hobbit" and "Treasure Island" documents because their descriptions contain the analyzed token "adventure":
{
"requestID": "...",
"status": "success",
"hits": [
{
"id": "doc_id_for_the_hobbit",
"score": 0.5,
"fields": { ... }
},
{
"id": "doc_id_for_treasure_island",
"score": 0.4,
"fields": { ... }
}
],
"total": 2,
"maxScore": 0.5,
"took": 10
}
The score indicates relevance. Higher scores mean a better match.
Understanding the mapping and analysis is key to tuning your FTS indexes. The mapping defines which fields to index and how. The analysis defines how text is tokenized. By default, Couchbase uses language-specific analyzers (like en for English). You can customize these. For example, you might want to index a tags field differently than a description field, perhaps using a keyword analyzer for exact matches on tags or a custom analyzer for specific domain jargon.
Consider a scenario where you’re indexing product descriptions. If you want to ensure that searches for "USB-C" match documents containing "USB-C cable" but not documents simply mentioning "USB" and "C" separately, you’d need to adjust the analyzer. A common mistake is assuming the default analyzer will perfectly handle all specific naming conventions or technical terms. You might need to define a custom analyzer that treats "USB-C" as a single token, perhaps by using a custom tokenizer or by carefully configuring the character filters and token filters. This is often done by defining a JSON object within the index definition that specifies mapping and analysis sections, allowing you to control tokenizers, token_filters, and char_filters.
The index_ கட்டுமானம் command is crucial for managing your FTS indexes. It allows you to create, update, and delete them. When updating an index, Couchbase typically needs to rebuild it, which can take time and resources depending on the index size. You can monitor the build progress via the Couchbase UI or the fts-get-index command.
When you’re dealing with multiple languages in your documents, you’ll want to specify the correct language for each field being indexed. If you have a product_name field that could be in English, Spanish, or French, you can configure your FTS index mapping to use different analyzers for each language, or even to attempt language detection. This is done within the mapping section of your index definition, specifying default_analyzer or per-field analyzers.
The next hurdle you’ll likely encounter is optimizing query performance, especially as your data and index grow. This often involves understanding how to use different query types like match, match_phrase, fuzzy, and term, and how to tune the underlying index settings like index_timing and scan_accuracy.