org.apache.lucene.spelt
Class LuceneIndexToDict

Object
  extended by LuceneIndexToDict

public class LuceneIndexToDict
extends Object

Utility class to convert the stored fields of a Lucene index into a spelling dictionary. This is generally less desirable than integrating dictionary creation into the original index creation process (e.g. using SpellWritingAnalyzer or SpellWritingFilter) since that will grab non-stored as well as stored fields. Still, if that isn't an option or if you simply want to test out spelling correction, after-the-fact dictionary creation may be useful.

Author:
Martin Haye

Constructor Summary
LuceneIndexToDict()
           
 
Method Summary
static void createDict(Directory indexDir, File dictDir)
          Read a Lucene index and make a spelling dictionary from it.
static void createDict(Directory indexDir, File dictDir, ProgressTracker prog)
          Read a Lucene index and make a spelling dictionary from it.
static void createDict(IndexReader indexReader, Analyzer analyzer, SpellWriter spellWriter, ProgressTracker prog)
          Read a Lucene index and make a spelling dictionary from it.
static void main(String[] args)
          Command-line interface for build a dictionary directly from a Lucene index without writing any code.
static void queueWords(IndexReader reader, Analyzer analyzer, SpellWriter writer, ProgressTracker prog)
          Re-tokenize all the words in stored fields within a Lucene index, and queue them to a spelling dictionary.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LuceneIndexToDict

public LuceneIndexToDict()
Method Detail

createDict

public static void createDict(Directory indexDir,
                              File dictDir)
                       throws IOException
Read a Lucene index and make a spelling dictionary from it. A minimal token analyzer will be used, which is usually just what is needed for the dictionary. The default set of English stop words will be used (see StopAnalyzer.ENGLISH_STOP_WORDS).

Parameters:
indexDir - directory containing the Lucene index
dictDir - directory to receive the spelling dictionary
Throws:
IOException

createDict

public static void createDict(Directory indexDir,
                              File dictDir,
                              ProgressTracker prog)
                       throws IOException
Read a Lucene index and make a spelling dictionary from it. A minimal token analyzer will be used, which is usually just what is needed for the dictionary. The default set of English stop words will be used (see StopAnalyzer.ENGLISH_STOP_WORDS).

Parameters:
indexDir - directory containing the Lucene index
dictDir - directory to receive the spelling dictionary
prog - tracker called periodically to display progress
Throws:
IOException

createDict

public static void createDict(IndexReader indexReader,
                              Analyzer analyzer,
                              SpellWriter spellWriter,
                              ProgressTracker prog)
                       throws IOException
Read a Lucene index and make a spelling dictionary from it. A minimal token analyzer will be used, which is usually just what is needed for the dictionary. The default set of English stop words will be used (see StopAnalyzer.ENGLISH_STOP_WORDS).

Parameters:
indexReader - used to read fields from a Lucene index
analyzer - used to tokenize fields from the index; generally, this should do minimal filtering, taking care to avoid substantive token modification (such as stemming or depluralization). A good choice is MinimalAnalyzer.
spellWriter - receives words to be added to the dictionary
prog - tracker called periodically to display progress
Throws:
IOException

queueWords

public static void queueWords(IndexReader reader,
                              Analyzer analyzer,
                              SpellWriter writer,
                              ProgressTracker prog)
                       throws IOException
Re-tokenize all the words in stored fields within a Lucene index, and queue them to a spelling dictionary. Does not flush the writer to form the final dictionary, so could be called repeatedly to queue words from multiple Lucene indexes.

Parameters:
reader - used to read fields from a Lucene index
analyzer - used to tokenize fields from the index; generally, this should do minimal filtering, taking care to avoid substantive token modification (such as stemming or depluralization). A good choice is MinimalAnalyzer.
writer - receives words to be added to the dictionary
prog - tracker called periodically to display progress
Throws:
IOException

main

public static void main(String[] args)
Command-line interface for build a dictionary directly from a Lucene index without writing any code.