Package org.cdlib.xtf.textIndexer

Contains all the classes that make up the textIndexer tool.

See:
          Description

Class Summary
AccentFoldingFilter Improves query results by converting accented characters to normal characters by removing diacritics.
CrimsonBugWorkaround There's a very nasty bug in the Apache Crimson XML parser.
CrimsonBugWorkaround.BlockEnum Presents the input stream as a series of blocks of data
FacetTokenizer Performs special tokenization for facet fields.
HTMLIndexSource Transforms an HTML file to a single-record XML file.
HTMLToString This class provides a single static convert() method that converts an HTML file into an XML string that can be pre-filtered and added to a Lucene database by the XMLTextProcessor class.
IdxTreeCleaner This class purges "incomplete" documents from a Lucene index.
IdxTreeCuller This class provides a simple mechanism for removing documents from an index when the source text no longer exists in the document library.
IdxTreeDictMaker This class provides a simple mechanism for generating a spelling correction dictionary after new documents have been added or updated.
IdxTreeOptimizer This class provides a simple mechanism for optimizing Lucene indices after new documents have been added , updated, or removed.
IndexDump This class dumps the contents of user-selected fields from an XTF text index.
IndexerConfig This class records configuration information about the current state of the TextIndexer application.
IndexInfo This class maintains configuration information about the current index that the TextIndexer program is processing.
IndexMerge This class merges the contents of two or more XTF indexes, with certain caveats.
IndexMerge.DirInfo  
IndexRecord A single record within a IndexSource.
IndexSource Represents a single source of data for an XTF index.
IndexStats This class calculates and prints out some useful statistics about an existing index, such as number of documents, size, etc.
MARCIndexSource Supplies MARC data to an XTF index, breaking it up into individual MARCXML records.
MSWordIndexSource Transforms a Microsoft Word file to a single-record XML file.
PDFIndexSource Transforms a PDF file to a single-record XML file.
PDFToString This class provides a single static convert() method that converts the text in a PDF file into an XML string that can be pre-filtered and added to a Lucene database by the XMLTextProcessor class.
PluralFoldingFilter Improves query results by converting plural words to their singular forms.
SectionInfo This class maintains information about the current section in a text document that the TextIndexer program is processing.
SectionInfoStack This class maintains information about the current nesting of sections in a text document that the TextIndexer program is processing.
SpellWritingFilter Adds words from the token stream to a SpellWriter.
SrcTreeProcessor This class is the main processing shell for files in the source text tree.
SrcTreeProcessor.CacheEntry One entry in the docSelector cache
StartEndFilter Ensures that the tokens at the start and end of the stream are indexed both with and without the special start-of-field/end-of-field markers.
StructuredFileProxy Used to put off actually creating a structured store until it is needed.
TagFilter Spots XML elements in a token stream and marks them specially.
TextIndexer This class is the main class for the TextIndexer program.
TextIndexSource Transforms an HTML file to a single-record XML file.
XMLConfigParser This class parses TextIndexer configuration XML files.
XMLIndexSource Supplies a single file containing a single record to the XMLTextProcessor.
XMLTextProcessor This class performs the actual parsing of the XML source text files and generates index information for it.
XtfSpecialTokensFilter The XtfSpecialTokensFilter class is used by the XTFTextAnalyzer class to convert special "bump" count values in text chunks to actual position increments for words prior to adding them to a Lucene index.
XTFTextAnalyzer The XTFTextAnalyzer class performs the task of breaking up a contiguous chunk of text into a list of separate words (tokens in Lucene parlance.)
 

Exception Summary
TextIndexerException This exception is thrown by classes related to the textIndexer tool.
 

Package org.cdlib.xtf.textIndexer Description

Contains all the classes that make up the textIndexer tool.

The TextIndexer class is the main command-line interface, while XMLTextProcessor does most of the heavy lifting (scanning documents, breaking them into chunks, passing the chunks to Lucene.)