CouchDB’s built-in _all_docs is surprisingly slow for anything beyond a few thousand documents, but adding full-text search to CouchDB doesn’t require a complete re-architecture.
Let’s see couchdb-lucene in action. Imagine we have a CouchDB database named books with documents like this:
{
"_id": "book1",
"title": "The Hitchhiker's Guide to the Galaxy",
"author": "Douglas Adams",
"genre": "Science Fiction",
"description": "A comedy science fiction series created by Douglas Adams."
}
{
"_id": "book2",
"title": "Pride and Prejudice",
"author": "Jane Austen",
"genre": "Romance",
"description": "A classic novel of manners by Jane Austen."
}
{
"_id": "book3",
"title": "The Restaurant at the End of the Universe",
"author": "Douglas Adams",
"genre": "Science Fiction",
"description": "The second book in the Hitchhiker's Guide to the Galaxy series."
}
We want to be able to search these documents by keywords within the title, author, genre, and description fields. couchdb-lucene indexes these fields and allows us to query them efficiently.
To set this up, we’ll install couchdb-lucene as a CouchDB external process. On a Linux system, this typically involves downloading the .jar file and running it. The critical part is configuring couchdb-lucene to talk to CouchDB. This is done via command-line arguments when starting couchdb-lucene:
java -jar couchdb-lucene-x.y.z.jar \
--couchdb-url http://admin:password@localhost:5984 \
--lucene-index-dir /opt/couchdb/var/lib/couchdb/lucene_indexes \
--couchdb-database books
Here, --couchdb-url points to your CouchDB instance, --lucene-index-dir is where couchdb-lucene will store its index files, and --couchdb-database books tells it which database to index. couchdb-lucene then acts as a CouchDB "external index," meaning CouchDB itself will proxy search requests to it.
Once running, couchdb-lucene exposes a new search endpoint for each CouchDB database it’s configured to index. For our books database, this would be /_search/books. A typical search query looks like this:
curl "http://localhost:5984/_search/books?q=author:\"Douglas Adams\" AND genre:\"Science Fiction\""
This query would return book1 and book3. The q parameter accepts Lucene query syntax, allowing for boolean operators (AND, OR, NOT), phrase matching ("..."), wildcards (*, ?), and fuzzy matching (~).
The mental model for couchdb-lucene is that it’s a separate process that mirrors the data from a specific CouchDB database into a Lucene index. When you query /_search/books, CouchDB doesn’t do the heavy lifting; it forwards the query to the running couchdb-lucene instance, which performs the search on its optimized index and returns the results. couchdb-lucene listens for changes in CouchDB (via its _changes feed) and updates its index incrementally, so you don’t need to re-index everything manually.
The key levers you control are the Lucene query syntax itself and the configuration of the couchdb-lucene process. This includes how often it polls CouchDB for changes (though it’s usually efficient enough to leave at defaults), the directory for its index files, and crucially, which fields are indexed. By default, couchdb-lucene indexes all fields. If you only want to index specific fields, you can configure this in couchdb-lucene’s own configuration file (often couchdb-lucene.properties or similar, depending on the version) or by providing a _design document with a indexes section to couchdb-lucene’s setup. For example, to only index title and author:
// POST to /_design/search in your CouchDB
{
"indexes": {
"search": {
"analyzer": "standard",
"index": "function(doc) { index('default', doc.title); index('default', doc.author); }"
}
}
}
Then, your queries would be directed to /_search/books/search.
The most surprising thing about couchdb-lucene is how seamlessly it integrates without requiring you to manage complex replication or data synchronization. CouchDB itself handles the "push" of changes to couchdb-lucene via its built-in _changes feed, and couchdb-lucene simply consumes these events and updates its Lucene index in real-time. This means that once configured, your search index is always up-to-date with your CouchDB data, and you can query it as if it were part of CouchDB.
The next problem you’ll run into is managing multiple search indexes for different databases or different indexing strategies within the same database.