I was inspired by the InfoQ articla Implementing Google’s “Did you mean” Feature In Java which shows the way to create the “Suggester” service based on Lucene. Here is two more links you should read if interested in that topic
- http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
- http://sujitpal.blogspot.com/2007/12/spelling-checker-with-lucene.html
The articles are related to what I’ve done before – spellchecker, yeah, “spellchecker” and “suggester” are quite common but have different areas of usage
- “Spellchecker” is used then you type/edit some text and want to check if it has correct spelling, you need some good and big dictionary to implement it as well as good spellchecking algorithm
- “Suggester” is much more commercialized tool (for sure it can be used for all non-commerce searches too) but it’s commerce usage is really obvious and so “good-to-have”
Just imagine the situation then you are looking for something on e-commerce shop and just mistyped one or more letters (I swear! I do quite often). As a common-rule I will not see anything in that case and just give up the idea to find this product on that site. I actually think that it will be cool to have it for every “text based search”.
Hey, but google will suggest you something even for mistyped searches. Yeah, it’s true, google does that, but don’t you think that it’s good to have not only at `google`?
In InfoQ article the PlainTextDictionary is used as a source-dictionary (word-source) for SpellChecker, good for start but I don’t remember any application which have such list (it has to be generated from the db-data). For sure, we can write an algorithm to create it based on a data in DB-table, but I think it’s not perfect solution for production systems which has data which is updated often (list of products, companies, customers, etc) to do that (and let’s thing about synchronization of DB and spellchecker indexes).
And the tool which already do the integration of DB/entity-mapping/Lucene already exists it’s another cool tool from JBoss/Hibernate – “Hibernate-Search”. So I decided to use it as a base of doing that integration.
Actually the first task we need to solve is how to create the IndexReader which will provide the list of words. Originally it was instantiated as
IndexReader.indexReader = IndexReader.open(originalIndexDirectory);
happily with Hibernate-Search it is easy to get an access to IndexReader for particular entity/field, so the code to do it in my application will look like
//"indexedClass" is the indexed entity entity (in my case it is Product entity). SearchFactory searchFactory = ((FullTextEntityManager) entityManager).getSearchFactory(); DirectoryProvider[] directoryProviders = searchFactory.getDirectoryProviders(indexedClass); ReaderProvider readerProvider = searchFactory.getReaderProvider(); IndexReader reader = readerProvider.openReader(directoryProviders); //the instance of index reader is created by ReaderProvider and could be actually compound index-reader if we use sharded index
After that we create the Dictinoary object to index with the same code and index it.
SpellChecker sp = new SpellChecker(getSpellCheckerDirectory(indexedClass, indexedField)); Dictionary dictionary = new LuceneDictionary(reader, indexedField); sp.indexDictionary(dictionary); /** * @param indexedClass * @param indexedField * @return the Lucene Directory object for indexedClass and Entity. it is constructed as * "${base-spellchecker-directory}/${indexed-class-name}/${indexedField}" so each field indexes are stored in it's * own file-directory inside owning-class directory * @throws IOException */ private Directory getSpellCheckerDirectory(Class indexedClass, String indexedField) throws IOException { new FSDirectoryProvider().getDirectory(); String path = "./spellchecker/" + indexedClass.getName() + "/" + indexedField; return FSDirectory.getDirectory(path); }
Generally with those two small changes we can create the search engine as it is described in java.net article
In the next post I will show the way how to create web-application with did-you-mean feature using Seam, Hibernate-Search and Lucene and it will also have full-text-search against the Product’s entity plus suggestion-engine will be used if full-text-search will be failed to find results.
