Did-you-mean feature with Hibernate Search, Lucene and Seam. Example application


In the previous post I described list of changes which to build a Spellchecker-Index based on existent Lucene indexes created by Hibernate-Search.

In this post I will show the working web-application with full-text search and lucene based did-you-mean feature (I call it suggester service). I think that could help anybody to have a good starting point for later extension and show the way it could be integrated with applications.

To make it really fast-start let’s use the existent Seam example application “dvdstore” (use this article if you are not familiar with Seam).

In short to start this example you need

  1. JBoss5 (I have issues to run it on JBoss4.2.3 because of hibernate library-version I believe)
  2. Download Seam 2.2.1-CR1
  3. in $seam-home define jboss.home in build.properties
  4. Go to /examples/dvdstore and run “ant deploy”

That should build and deploy dvdstore example application on your JBoss (exploded deployment doesn’t work for me by default because of issues with JBoss-5 classloading/hot-redeployment, so just deploy it as a EAR file). After that you should access it by path http://localhost:${jboss-port}/seam-dvdstore.
Once you open it you can go to “Shopping” tab and use the searchbox which is actually execute full-text search against Product entity (list of dvd’s)

Now, lets update this application to  have “did-you-mean” feature in addition to existent full-text search.

1. Add Lucene-Spellcheker library to the project

  1. Download Lucene 2.4.1 (it is used by Seam 2.2)
  2. Put the “contrib/spellchecker/lucene-spellchecker-2.4.1.jar” to the dvdstore/lib directory
  3. Add following code to “dvdstore/build.xml” (to add libraries to the EAR)
<!--new library was added to the lib/lucene-spellchecker.jar directory, include it to the EAR-->
<fileset id="ear.lib.extras" dir=".">
    <include name="/lib/*.jar"/>
</fileset>

<!--new library was added to the lib, include it to the compilation classpath-->
<path id="build.classpath.extras">
    <fileset refid="ear.lib.extras"/>
</path>

2. Create a Lucene indexes for spellchecker

/**
 * Create indexes for suggestion n-gram based algorithm for selected entities/fields
 * @author: Andrey Chorniy
 */

@Name("spellCheckIndexerProcessor")
@AutoCreate
public class SpellCheckIndexerProcessor {

@In
private EntityManager entityManager;

@Logger
private Log logger;

@Asynchronous
public void scheduleIndexing(@Duration long duration) {
    process();
}

private void process() {
    indexSpellchecker(Product.class, "title");
    indexSpellchecker(Product.class, "description");
}

private void indexSpellchecker(Class indexedClass, String indexedField) {
    SearchFactory searchFactory = getSearchFactory();

    DirectoryProvider[] directoryProviders = searchFactory.getDirectoryProviders(indexedClass);

    ReaderProvider readerProvider = searchFactory.getReaderProvider();
    IndexReader reader = readerProvider.openReader(directoryProviders);
    try {
        final SpellChecker sp = new SpellChecker(getSpellCheckerDirectory(indexedClass, indexedField));
        Dictionary dictionary = new LuceneDictionary(reader, indexedField);
        sp.indexDictionary(dictionary);
        logger.info("Create spellchecker index for {0} field {1}", indexedClass.toString(), indexedField);
    } catch (IOException e) {
        logger.error("Failed to create SpellChecker", e);
    } finally {
        readerProvider.closeReader(reader);
    }
}

private SearchFactory getSearchFactory() {
    return ((FullTextEntityManager) entityManager).getSearchFactory();
}

/**
 * @param indexedClass
 * @param indexedField
 * @return the Lucene Directory object for indexedClass and Entity. it is constructed as
 * "${base-spellchecker-directory}/${indexed-class-name}/${indexedField}" so each field indexes are stored in it's
 * own file-directory inside owning-class directory
 * @throws IOException
 */
private Directory getSpellCheckerDirectory(Class indexedClass, String indexedField) throws IOException {
    new FSDirectoryProvider().getDirectory();
    String path = "./spellchecker/" + indexedClass.getName() + "/" + indexedField;
    return FSDirectory.getDirectory(path);
}
}

in the process() method we create two indexes for “title” and “description” field of the Product entity. For some reason “description” field isn’t indexed by default, let’s add Hibernate-Search annotation to that field in the Product class (@Field annotation)

    @Column(name="DESCRIPTION",length=1024)
    @Field(index=Index.TOKENIZED)
    public String getDescription() {
        return description;
    }

Now lets launch the indexing at the application startup, that indexing should happen after the indexes for the Product fields will be created by Hibernate-Search. And they are launched at the application startup by the IndexerAction.index() method (on EJB-3 bean creation). Here we have issue since Hibernate-Search/Lucene create indexes asynchronously and there is no guarantee that they will be created in some period of time after starting the IndexerAction.index() method.  So, I put 60 seconds delay to launch spellchecker index creation. Here is the code

@Name("spellCheckIndexer")
@Scope(ScopeType.APPLICATION)
@Startup(depends = "indexer")
public class SpellCheckIndexer {

@Logger
private Log logger;

@In
private SpellCheckIndexerProcessor spellCheckIndexerProcessor;

//method auto-startup on Seam postInitialization event
@Observer("org.jboss.seam.postInitialization")
public void scheduleProcess() {
    //delay is needed since initial Lucene-index for entities may not be created at this moment
    int delayInSeconds = 60;
    logger.info("SpellCheckIndexer will start in 60 seconds");
    spellCheckIndexerProcessor.scheduleIndexing(delayInSeconds * 1000L);
}
}

Now we everything to run a spellchecker. After deploying the updated application you should find new directory ${jboss.home/bin/spellchecker with subdirs for “title” and “description” indexes.
In the real applications with a lot of indexed data we should do a bit advanced index creation to ensure that indexes which we use for spellchecker creation already created or even rework the spellchecker-index creation and index the entity-attribute values directly like it is done in the Hibernate-Search org.hibernate.search.impl.FullTextSessionImpl.index() method

3. Use Lucene spellchecker to create did-you-mean feature

The findSuggestions of FullTextSuggestionService is the place were the magic happens. It run the suggestion for each word in the query for each suggested field (“title” and “description” in our case) and then merge the results in the single list of suggestions. Code is not perfect since we shouldn’t include suggestions for the particular word if one of the field has exactly match to it

/**
*
* @author: Andrey Chorniy
* Date: 21.04.2010
*/
@Name("fullTextSuggestionService")
@AutoCreate
public class FullTextSuggestionService {

@Logger
private Log logger;

/**
 * @param searchQuery user defined search criteria (used as a list of words)
 * @param indexedClass entity class
 * @param maxSuggestionsPerFieldCount maximum number of suggestions per field (usually 2..3 is enough)
 * @param suggestedFields list of entity fields to look for suggestions
 * @return list of suggestions
 */
public List<String> findSuggestions(String searchQuery, Class<Product> indexedClass, int maxSuggestionsPerFieldCount,
                                    String... suggestedFields) {
    Map<String, List<String>> fieldSuggestionsMap = new LinkedHashMap<String, List<String>>();

    for (String suggestedField : suggestedFields) {
        List<String> fieldSuggestions = findSuggestionsForField(searchQuery, indexedClass, maxSuggestionsPerFieldCount,
                suggestedField);
        fieldSuggestionsMap.put(suggestedField, fieldSuggestions);
    }

    return mergeSuggestions(maxSuggestionsPerFieldCount, fieldSuggestionsMap);
}

public List<String> findSuggestionsForField(String searchQuery, Class<Product> indexedClass,
                                            int maxSuggestionsCount,
                                            String suggestedField) {
    try {
        final SpellChecker sp = new SpellChecker(getSpellCheckerDirectory(indexedClass, suggestedField));

        //get the suggested words
        String[] words = searchQuery.split("\\s+");
        for (String word : words) {
            if (sp.exist(word)) {
                //no need to include suggestions for that word
                //TODO in case of multiple-fields suggestion that word should be excluded from suggestion in other fields
                continue;
            }
            String[] suggestions = sp.suggestSimilar(word, maxSuggestionsCount);
            return Arrays.asList(suggestions);
        }
    } catch (IOException e) {
        logger.error("Failed to create SpellChecker for {0} field of class {1}", suggestedField,
                indexedClass.getName(), e);
    }
    return Collections.emptyList();
}

private List<String> mergeSuggestions(int suggestionNumber, Map<String, List<String>> fieldSuggestionsMap) {
    List<String> suggestionList = new ArrayList<String>();
    for (int suggestionPosition = 0; suggestionPosition <= suggestionNumber; suggestionPosition++) {
        for (Map.Entry<String, List<String>> fieldSuggestionsEntry : fieldSuggestionsMap.entrySet()) {
            String fieldName = fieldSuggestionsEntry.getKey();
            List<String> suggestedTerms = fieldSuggestionsEntry.getValue();
            if (suggestedTerms.size() > suggestionPosition) {
                String suggestion = suggestedTerms.get(suggestionPosition);
                if (!suggestionList.contains(suggestion)){
                    suggestionList.add(suggestion);
                }
            }
        }
    }
    return suggestionList;
}
/**
 *
 * @param indexedClass
 * @param indexedField
 * @return Lucene Directory object in which spellchecker indexes are located for specified entity-class and entity-field
 * @throws IOException
 */
public Directory getSpellCheckerDirectory(Class indexedClass, String indexedField) throws IOException {
    new FSDirectoryProvider().getDirectory();
    String path = "./spellchecker/" + indexedClass.getName() + "/" + indexedField;
    return FSDirectory.getDirectory(path);
}
}

As you may see the code to have the suggestion for single word is pretty simple.

String[] suggestions = sp.suggestSimilar(word, maxSuggestionsCount);

however we could increase the relevancy of suggested results by using the alternative method by providing the entity index-reader (which is easy to get) and the fieldname which we know.

//return only the suggest words that are as frequent or more frequent than the searched word
String[] suggestions = sp.suggestSimilar(word, maxSuggestionsCount, entityIndex, fieldName, true);

Ok, we actually have all the code to run the suggestions and now just need to use it from the existent full-text search in cases then it return no results. To do that we can update FullTextSearchAction class

//add list of suggestions
private List<String> suggestions;

//inject the fullTextSuggestionService we created before
@In private FullTextSuggestionService fullTextSuggestionService;

//run suggestions is search doesn't return results in the end of method updateResults()
//look for suggestions if full-text search return nothing
suggestions = null;
if (numberOfResults == 0){
    suggestions = fullTextSuggestionService.findSuggestions(searchQuery, Product.class, 2, "title", "description");
}

//add the method to run a search from the page (just for testing purposes)
/**
 * Helper method to run a search with query (used in browse.xhtml to launch search for one of the suggestion)
 * tobe updated: replace it with restful-links (low importance for this example project)
 */
@Begin(join = true)
public String searchFor(String query) {
    currentPage = 0;
    searchQuery = query;
    updateResults();
    return "browse";
}

I skipped the getter method for suggestions and @Local interface method declarations for that EJB3 bean in the article, see the link to the full-code

And in the end add the code in the browse.xhtml to show suggestion results with links

<f:subview id="searchresults" rendered="#{searchResults.rowCount == 0}">
<h2>
    <h:outputText id="NoResultsMessage" value="#{messages.noSearchResultsHeader}" />
</h2>
<h:panelGroup rendered="#{not empty search.suggestions}">
    <h:outputText value="Did you mean..."/>
    <h:form>
        <h:dataTable value="#{search.suggestions}" var="suggestion">
            <h:column>
                <h:commandLink action="#{search.searchFor(suggestion)}">
                    <h:outputText value="#{suggestion}"/>
                </h:commandLink>
            </h:column>
        </h:dataTable>
    </h:form>
</h:panelGroup>
</f:subview>

That’s all, now we can redeploy application and try the searches with wrong titles/words. “tarminul” or “fuction” or “flawers”. It wroks, what else is needed ? the answer is improvements, there is a lot of places to work on, extend to achieve better search results and easier integration but essentially it just work on the moment and we already have that engine which shows quite relevant results as for me.

Links

Steps for the future

  1. Check that correctness to use Lucene indexes to create indexes for spellchecker. The question is – what set of words is returned by index created by Hibernate-Search, does it corresponds to the set of words which could be created by iteration of the “title” property of the Product ? Since SpellChecker algorithm uses word frequency with (more-popular option) the word frequency in the index makes difference, so it should have the same value as in direct attribute values iteration. The fast way to check it – will be creation of spellchecker index by iterating the attribute value and compare the indexes.
  2. update spellchecker index creation approach. options are write code like in the org.hibernate.search.impl.FullTextSessionImpl.index() or even extend FullTextSessionImpl to do that. (of course the last variant is possible if Hibernate-Search team will find that feature worth an integration it in the framework). On the moment I see that it’s possible to create few custom annotations and process them
  3. Enable closer integration with Hibernate-Search by looking not for just a suggestion text, but for related object
  4. Probably it is related to the previous one – update spellchecker algorithm to store index-fields in the same directory with entity indexes
  5. Compound search. current algorithm is well suited to find suggestions to single-word queries. Even taking into account we could find suggestion for each word – it looks that it is not the greatest approach and it works well as a SpellChecker, but not cool for “Did you mean”
About these ads

https://www.facebook.com/achorniy

Tagged with: , , , , , , , , ,
Posted in Software Development
3 comments on “Did-you-mean feature with Hibernate Search, Lucene and Seam. Example application
  1. Georg says:

    Really nice article, thanx !

  2. Ron says:

    Can you please fix the download links?
    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 31 other followers

%d bloggers like this: