Saturday, July 31, 2010

Updating a Lucene Index – The “Green” Version

There are plenty of examples available on the internet that are good introductions into the basics of a Lucene.NET index. They explain how to create an index and then how to use it for a search.

At some point you’ll find yourself in the situation that you want to update the index. Furthermore you want to update certain elements only.

One option is to throw away the entire index and then recreate it from the sources. For some scenarios this might be the best choices. For example you may have a lot of changes in your data and a high latency for updating the index is acceptable. In that case it might be the cheapest to do a full re-index each time. The trade-off is at different points, e.g. when less than 10% have changed updating can be more time efficient. In some cases you probably want to experiment with this a little.

If you go for recreating the entire index then you probably want to build the new index first (in a different directory if file based) and to replace the index in use only once the new index is complete.

Another option is to update in the index only the documents that have changed (The “green” option as we are re-using the index). This of course would require you to be able to identify the documents that need to be updated. Depending on your application and your design this might be relatively easy to achieve.

If you opt for updating in the index just the documents that have changed then some blogs are suggesting to remove the existing version of the document first and then insert/add the new version of the document. For example the code from the discussion on the question “How to Update a Lucene.NET Index” at Stackoverflow:

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

There is, however, another option. Lucene.NET (I’m using version 2.9.2) can update an existing document. Here is the code:

readonly Lucene.Net.Util.Version LuceneVersion = Lucene.Net.Util.Version.LUCENE_29;
var IndexLocationPath = "..." // Set to your location
var directoryInfo = new DirectoryInfo(IndexLocationPath);
var directory = FSDirectory.Open(directoryInfo);
var writer = new IndexWriter(directory, 
            new StandardAnalyzer(LuceneVersion),
            false, // Don't create index
            IndexWriter.MaxFieldLength.LIMITED);
writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);
writer.Optimize(); // Should be done with low load only ...
writer.Close();

Be aware that the field you are using for identifying the document needs to be unique. Also when you add the document, the field has to be added as follows:

doc.Add(new Field("patient_id", id.ToString(), 
                  Field.Store.YES, 
                  Field.Index.NOT_ANALYZED));

The good thing about this option is that you don’t have to find or remove the old version. IndexWriter.UpdateDocument() takes care of that.

Happy coding!

2 comments:

michelle said...

how about if you want to check first if the doc is in the index before? because in my case, i just don't want to have to index it in the first place. in this way, it will only update what is new

Manfred said...

@michelle: In that case you could search the full-text index first, e.g. using a unique document identifier, to determine whether you have the document in the index already. Or you can employ some other algorithm for determining whether you want to add a document to the index in the first place.

Post a Comment

All comments, questions and other feedback is much appreciated. Thank you!