Lucene spellchecker for TinyMCE

Posted on May 11, 2010 by Andrey Chorniy — 6 Comments

I just finished creation of Lucene based spellchecker for TinyMCE editor. It is based on the same code as the two previous ones “Jazzy-based” and “JMySpell-based”

You can download code from the jspellchecker (http://jspellchecker.svn.sourceforge.net/viewvc/jspellchecker/trunk/)

All configuration for TinyMCE still the same, just use updated path to spellchecker-servlet

spellchecker_rpc_url&nbsp;   : "/spellchecker/lucene-spellchecker",

Current implementation is based on the org.apache.lucene.search.spell.PlainTextDictionary (it is just the list of words delimited with newlines) and have additional memory-configuration servlet parameter “max_memory_usage” (value in megabytes which define the maximum size of Lucene indexes which could be stored in memory)

Indexes for spellchecker created at the first access to the particular language after web-application startup (or pre-created for “preloadedLanguages” on servlet-startup).
To speed-up index access (and the spell checking as the result) spellchecker indexes initially created on the file-system and after that they are moved to memory
It use 2-level cache to achieve the maximum performance and memory-management.

1-st level of the cache is the cache of SpellCheckers which use In-Memory (RAMDirectory) Lucene indexes
2-nd level cache store File-System SpellCheckers (FSDirectory) which don’t take memory but just hold the reference to the Directory object

1-st level cache implementation (based on LinkedHashMap) is also responsible for memory-management, it guarantees that summary size of all In-Memory indexes is less than “maxMemoryUsage” (this parameter is configured in servlet init parameters in megabytes)

On the moment I’ve found one-issue of Lucene spellchecker, it’s related to multi-word processing. For example I have “New-York” in my dictionary, but it doesn’t processed as one-word (Lucene index-reader split it into two words of course).

The extension points of that spellchecker could be

Usage of IndexReader to read existent Lucene indexes
Dictionary dictionary = new LuceneDictionary(reader, indexedField);
Use extended form of suggestSimilar which boost “most-popular” terms (it need the initial index-reader, so applicable only for LuceneDictionary based on IndexReader)
suggestions = spellChecker.suggestSimilar(word, maxSuggestionsCount, fieldIR, suggestedField, true);
See the code examples for that in the “Did-you-mean feature with Hibernate Search, Lucene and Seam. Example application”.

About Andrey Chorniy

https://www.facebook.com/achorniy

Tagged with: java, lucene, spellchecker, web, web-application
Posted in Software Development

6 comments on “Lucene spellchecker for TinyMCE”

Spocke says:

May 11, 2010 at 10:38 pm

This is really cool stuff. Do you guys generate the lucene dictionaries from open formats such as Open Office or Aspell? Anyway, keep up the good work.

Reply
- Andrey Chorniy says:
  
  May 12, 2010 at 12:02 am
  
  Thanks Spocke,
  actually Lucene spellchecker implementation now just support the easiest form of source-dictionary based on PlainTextDictionary
  checker.indexDictionary(new PlainTextDictionary(dictionariesFile));
  
  If you want to use OpenOffice dictionaries – the easiest way to do that is to use another implementation based on JMySpell library (JMySpellCheckerServlet).
  
  Probably the most interest usage of Lucene spellchecker here could be the integration with your dynamic data stored in DB to build domain-specific dictionaries ( Hibernate-Search could simplify the process of Lucene-DB integration – https://achorniy.wordpress.com/2010/04/23/suggestion-engine-with-hibernate-search-and-lucene-intro ).
  
  Of course it’s possible to write Lucene org.tinymce.spellchecker.Dictionary which will read OpenOffice dictionaries (JMySpell OpenOfficeSpellDictionary class have the code which reads the data from the open-office dictionaries which can be used as an example) but open-office dictioaries itself represents something more than just list of words, so probably it doesn’t make a lot of sense to read open-office dictionaries to build just list of words for Lucene spellchecker.
  
  Reply
Aman says:

March 12, 2012 at 11:24 pm

Hi ,

I setup the Lucene spell checker after trying Jazzy since Jazzy had some performance concerns. After following the steps from the site I get the following error:

12-Mar-2012 14:21:58.738 SEVERE [ajp-bio-9080-exec-2] org.apache.catalina.core.ApplicationContext.log StandardWrapper.Throwable
java.lang.NoSuchMethodError: org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getSpellCheckerDirectory(LuceneSpellCheckerServlet.java:257)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.reindexSpellchecker(LuceneSpellCheckerServlet.java:182)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.loadSpellChecker(LuceneSpellCheckerServlet.java:167)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getChecker(LuceneSpellCheckerServlet.java:140)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.preloadLanguageChecker(LuceneSpellCheckerServlet.java:90)
at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.preloadSpellcheckers(TinyMCESpellCheckerServlet.java:92)
at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.init(TinyMCESpellCheckerServlet.java:79)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.init(LuceneSpellCheckerServlet.java:75)
at javax.servlet.GenericServlet.init(GenericServlet.java:160)
at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1266)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1185)
at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:857)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:135)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at lithium.apps.main.webserver.Tomcat70Bootstrap$3.invoke(Tomcat70Bootstrap.java:387)
at lithium.apps.main.webserver.Tomcat70Bootstrap$2.invoke(Tomcat70Bootstrap.java:341)
at lithium.apps.main.webserver.ApplicationWebserverConfigurationValve.invoke(ApplicationWebserverConfigurationValve.java:69)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

12-Mar-2012 14:21:58.740 SEVERE [ajp-bio-9080-exec-2] org.apache.catalina.core.StandardWrapperValve.invoke Allocate exception for servlet lucene
java.lang.NoSuchMethodError: org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getSpellCheckerDirectory(LuceneSpellCheckerServlet.java:257)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.reindexSpellchecker(LuceneSpellCheckerServlet.java:182)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.loadSpellChecker(LuceneSpellCheckerServlet.java:167)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.getChecker(LuceneSpellCheckerServlet.java:140)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.preloadLanguageChecker(LuceneSpellCheckerServlet.java:90)
at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.preloadSpellcheckers(TinyMCESpellCheckerServlet.java:92)
at org.tinymce.spellchecker.TinyMCESpellCheckerServlet.init(TinyMCESpellCheckerServlet.java:79)
at org.tinymce.spellchecker.LuceneSpellCheckerServlet.init(LuceneSpellCheckerServlet.java:75)
at javax.servlet.GenericServlet.init(GenericServlet.java:160)
at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1266)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1185)
at org.apache.catalina.core.StandardWrapper.allocate(StandardWrapper.java:857)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:135)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at lithium.apps.main.webserver.Tomcat70Bootstrap$3.invoke(Tomcat70Bootstrap.java:387)
at lithium.apps.main.webserver.Tomcat70Bootstrap$2.invoke(Tomcat70Bootstrap.java:341)
at lithium.apps.main.webserver.ApplicationWebserverConfigurationValve.invoke(ApplicationWebserverConfigurationValve.java:69)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I tried researching on the possible causes and most people said to check whether I had multiple Lucene version causing a conflict.

I tried doing a sysout of LucenePackage.get().getImplementationVersion() and it gives me 3.0.1 912433 – 2010-02-21 23:51:22

Can you guide on the steps I might be missing.

Thanks,
Aman

Reply
- Aman says:
  
  March 12, 2012 at 11:37 pm
  
  Turns out getDirectory is deprecated since version 2.9.
  
  Reply
  - Andrey Chorniy says:
    
    March 27, 2012 at 1:23 pm
    
    Yep, I did it with “Lucene 2.4.1”
    You can find the version info in the “https://achorniy.wordpress.com/2010/04/24/did-you-mean-feature-hibernate-search-lucene-seam-example-application/”
Ben Roberts says:

October 5, 2012 at 9:42 pm

I implemented the lucene spellchecker for TinyMCE and created a dictionary file and it thinks all one and two letter words are misspelled. Any Ideas?

Reply

Lucene spellchecker for TinyMCE

Share this:

Related

6 comments on “Lucene spellchecker for TinyMCE”

Leave a comment Cancel reply