Lucene spellchecker for TinyMCE


I just finished creation of Lucene based spellchecker for TinyMCE editor. It is based on the same code as the two previous ones “Jazzy-based” and “JMySpell-based

You can download code from the jspellchecker (http://jspellchecker.svn.sourceforge.net/viewvc/jspellchecker/trunk/)

All configuration for TinyMCE still the same, just use updated path to spellchecker-servlet

spellchecker_rpc_url    : "/spellchecker/lucene-spellchecker",

Current implementation is based on the org.apache.lucene.search.spell.PlainTextDictionary (it is just the list of words delimited with newlines) and have additional memory-configuration servlet parameter “max_memory_usage” (value in megabytes which define the maximum size of Lucene indexes which could be stored in memory)

Indexes for spellchecker created at the first access to the particular language after web-application startup (or pre-created for “preloadedLanguages” on servlet-startup).
To speed-up index access (and the spell checking as the result) spellchecker indexes initially created on the file-system and after that they are moved to memory
It use 2-level cache to achieve the maximum performance and memory-management.

  • 1-st level of the cache is the cache of SpellCheckers which use In-Memory (RAMDirectory) Lucene indexes
  • 2-nd level cache store File-System SpellCheckers (FSDirectory) which don’t take memory but just hold the reference to the Directory object

1-st level cache implementation (based on LinkedHashMap) is also responsible for memory-management, it guarantees that summary size of all In-Memory indexes is less than “maxMemoryUsage” (this parameter is configured in servlet init parameters in megabytes)

On the moment I’ve found one-issue of Lucene spellchecker, it’s related to multi-word processing.  For example I have “New-York” in my dictionary, but it doesn’t processed as one-word (Lucene index-reader split it into two words of course).

The extension points of that spellchecker could be

  1. Usage of IndexReader to read existent Lucene indexes
    Dictionary dictionary = new LuceneDictionary(reader, indexedField);
  2. Use extended form of suggestSimilar which boost “most-popular” terms (it need the initial index-reader, so applicable only for LuceneDictionary based on IndexReader)
    suggestions = spellChecker.suggestSimilar(word, maxSuggestionsCount, fieldIR, suggestedField, true);
    See the code examples for that in the  “Did-you-mean feature with Hibernate Search, Lucene and Seam. Example application”.

https://www.facebook.com/achorniy

Tagged with: , , , ,
Posted in Software Development
6 comments on “Lucene spellchecker for TinyMCE
  1. Spocke says:

    This is really cool stuff. Do you guys generate the lucene dictionaries from open formats such as Open Office or Aspell? Anyway, keep up the good work.

    • Andrey Chorniy says:

      Thanks Spocke,
      actually Lucene spellchecker implementation now just support the easiest form of source-dictionary based on PlainTextDictionary
      checker.indexDictionary(new PlainTextDictionary(dictionariesFile));

      If you want to use OpenOffice dictionaries – the easiest way to do that is to use another implementation based on JMySpell library (JMySpellCheckerServlet).

      Probably the most interest usage of Lucene spellchecker here could be the integration with your dynamic data stored in DB to build domain-specific dictionaries ( Hibernate-Search could simplify the process of Lucene-DB integration – https://achorniy.wordpress.com/2010/04/23/suggestion-engine-with-hibernate-search-and-lucene-intro ).

      Of course it’s possible to write Lucene org.tinymce.spellchecker.Dictionary which will read OpenOffice dictionaries (JMySpell OpenOfficeSpellDictionary class have the code which reads the data from the open-office dictionaries which can be used as an example) but open-office dictioaries itself represents something more than just list of words, so probably it doesn’t make a lot of sense to read open-office dictionaries to build just list of words for Lucene spellchecker.

  2. Aman says:

    Hi ,

    I setup the Lucene spell checker after trying Jazzy since Jazzy had some performance concerns. After following the steps from the site I get the following error:

    12-Mar-2012 14:21:58.738 SEVERE [ajp-bio-9080-exec-2] org.apache.catalina.core.ApplicationContext.log StandardWrapper.Throwable
    java.lang.NoSuchMethodError: org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getSpellCheckerDirectory(LuceneSpellCheckerServlet.java:257)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.reindexSpellchecker(LuceneSpellCheckerServlet.java:182)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.loadSpellChecker(LuceneSpellCheckerServlet.java:167)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getChecker(LuceneSpellCheckerServlet.java:140)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.preloadLanguageChecker(LuceneSpellCheckerServlet.java:90)
    at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.preloadSpellcheckers(TinyMCESpellCheckerServlet.java:92)
    at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.init(TinyMCESpellCheckerServlet.java:79)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.init(LuceneSpellCheckerServlet.java:75)
    at javax.servlet.GenericServlet.init(GenericServlet.java:160)
    at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1266)
    at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1185)
    at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:857)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:135)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
    at lithium.apps.main.webserver.Tomcat70Bootstrap$3.invoke(Tomcat70Bootstrap.java:387)
    at lithium.apps.main.webserver.Tomcat70Bootstrap$2.invoke(Tomcat70Bootstrap.java:341)
    at lithium.apps.main.webserver.ApplicationWebserverConfigurationValve.invoke(ApplicationWebserverConfigurationValve.java:69)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
    at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

    12-Mar-2012 14:21:58.740 SEVERE [ajp-bio-9080-exec-2] org.apache.catalina.core.StandardWrapperValve.invoke Allocate exception for servlet lucene
    java.lang.NoSuchMethodError: org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getSpellCheckerDirectory(LuceneSpellCheckerServlet.java:257)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.reindexSpellchecker(LuceneSpellCheckerServlet.java:182)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.loadSpellChecker(LuceneSpellCheckerServlet.java:167)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getChecker(LuceneSpellCheckerServlet.java:140)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.preloadLanguageChecker(LuceneSpellCheckerServlet.java:90)
    at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.preloadSpellcheckers(TinyMCESpellCheckerServlet.java:92)
    at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.init(TinyMCESpellCheckerServlet.java:79)
    at org.tinymce.spellchecker.LuceneSpellCheckerServlet.init(LuceneSpellCheckerServlet.java:75)
    at javax.servlet.GenericServlet.init(GenericServlet.java:160)
    at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1266)
    at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1185)
    at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:857)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:135)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
    at lithium.apps.main.webserver.Tomcat70Bootstrap$3.invoke(Tomcat70Bootstrap.java:387)
    at lithium.apps.main.webserver.Tomcat70Bootstrap$2.invoke(Tomcat70Bootstrap.java:341)
    at lithium.apps.main.webserver.ApplicationWebserverConfigurationValve.invoke(ApplicationWebserverConfigurationValve.java:69)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
    at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

    I tried researching on the possible causes and most people said to check whether I had multiple Lucene version causing a conflict.

    I tried doing a sysout of LucenePackage.get().getImplementationVersion() and it gives me 3.0.1 912433 – 2010-02-21 23:51:22

    Can you guide on the steps I might be missing.

    Thanks,
    Aman

    • Aman says:

      Turns out getDirectory is deprecated since version 2.9.

      • Andrey Chorniy says:

        Yep, I did it with “Lucene 2.4.1”
        You can find the version info in the “https://achorniy.wordpress.com/2010/04/24/did-you-mean-feature-hibernate-search-lucene-seam-example-application/”

  3. Ben Roberts says:

    I implemented the lucene spellchecker for TinyMCE and created a dictionary file and it thinks all one and two letter words are misspelled. Any Ideas?

Leave a comment