public class XTFTextAnalyzer
extends Analyzer
XTFTextAnalyzer
class performs the task of breaking up a
contiguous chunk of text into a list of separate words (tokens
in Lucene parlance.) The resulting list of words is what Lucene iterates
through and adds to its database for an index. TokenizingOnce the
The first phase is the conversion of the contiguous text into a list of separate tokens. This step is performed by theFastTokenizer
class. This class uses a set of rules to separate words in western text from the spacing and puntuation that normally accompanies it. TheFastTokenizer
class uses the same basic tokenizing rules as the LuceneStandardAnalyzer
class, but has been optimized for speed.
Special Token Filtering
In the process of creating chunks of text for indexing, the Text Indexer program inserts virtual words and other special tokens that help it to relate the chunks of text stored in the Lucene index back to its original XML source text. TheXTFTextAnalyzer
looks for those special tokens, removes them from the token list, and translates them into position increments for the first non-special tokens they preceed. For more about special token filtering, see theXtfSpecialTokensFilter
class.
Lowercase Conversion
The next step performed by theXTFTextAnalyzer
is to convert all the remaining tokens in the token list to lowercase. Converting indexed words search phrases to lowercase has the effect of making searches case insensitive.
Plural and Accent Folding
Next theXTFTextAnalyzer
converts plural words to singular form using aWordMap
, and strips diacritics from the word using anCharMap
. These conversions can yield more complete search results.
Stop-Word Filtering
The next step performed by theXTFTextAnalyzer
is to remove certain words called stop-words. Stop-words are words that by themselves are not worth indexing, such as a, the, and, of, etc. These words appear so many times in English text, that indexing all their occurences just slows down searching for other words without providing any real value. Consequently, they are filtered out of the token list.
It should be noted, however, that while stop-words are filtered, they are not simply omitted from the database. This is because stop-words do impart special meaning when they appear in certain phrases or titles. For example, in Man of War the word of doesn't simply act as a conjunction, but rather helps form the common name for a type of jellyfish. Similarly, the word and in the phrase black and white doesn't simply join black and white, but forms a phrase meaning a condition where no ambiguity exists. In these cases it is important to preserve the stop-words, because ignoring them would produce undesired matches. For example, in a search for the words "man of war" (meaning the jellyfish), ignoring stop-words would produce "man and war", "man in war", and "man against war" as undesired matches.
To record stop-words in special phases without slowing searching, theXTFTextAnalyzer
performs an operation called bi-gramming for its third phase of filtering. For more details about how bi-grams actually work, see theBigramStopFilter
class.
Adding End Tokens
As a final step, the analyzer double-indexes the first and last tokens of fields that contain the special start-of-field and end-of-field characters. Essentially, those tokens are indexed with and without the markers. This enables exact matching at query time, since Lucene offers no other way to determine the end of a field. Note that this processing is only performed on non-text fields (i.e. meta-data fields.)
XTFTextAnalyzer
has completed its work, it returns
the final list of tokens back to Lucene to be added to the index database.
Modifier and Type | Field and Description |
---|---|
private CharMap |
accentMap
The set of accented chars to remove diacritics from
|
private HashSet |
facetFields
List of fields marked as "facets" and thus get special tokenization
|
private HashSet |
misspelledFields
List of fields that marked as possibly misspelled, and thus don't get
added to the spelling correction dictionary.
|
private WordMap |
pluralMap
The set of words to change from plural to singular
|
private SpellWriter |
spellWriter
If building a spelling correction dictionary, this is the writer
|
private String |
srcText
A reference to the contiguous source text block to be tokenized and
filtered.
|
private Set |
stopSet
The list of stop-words currently set for this filter.
|
Constructor and Description |
---|
XTFTextAnalyzer(Set stopSet,
WordMap pluralMap,
CharMap accentMap)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addFacetField(String fieldName)
Mark a field as a "facet field", that will receive special tokenization
to deal with hierarchy.
|
void |
addMisspelledField(String fieldName)
Mark a field as a "misspelled field", that won't be added to the spelling
correction dictionary.
|
void |
clearFacetFields()
Clears the list of fields marked as facets.
|
void |
clearMisspelledFields()
Clears the list of fields marked as misspelled.
|
void |
setSpellWriter(SpellWriter writer)
Sets a writer to receive tokenized words just before they are indexed.
|
TokenStream |
tokenStream(String fieldName,
Reader reader)
Convert a chunk of contiguous text to a list of tokens, ready for
indexing.
|
private Set stopSet
private WordMap pluralMap
private CharMap accentMap
private String srcText
tokenStream()
method to read the source text for filter operations in random access
fashion.)private HashSet facetFields
private HashSet misspelledFields
private SpellWriter spellWriter
public XTFTextAnalyzer(Set stopSet, WordMap pluralMap, CharMap accentMap)
XTFTextAnalyzer
and initializes its
member variables.stopSet
- The set of stop-words to be used when filtering text.
For more information about stop-words, see the
XTFTextAnalyzer
class description.pluralMap
- The set of plural words to de-pluralize when
filtering text. See IndexInfo.pluralMapPath
for more information.accentMap
- The set of accented chars to remove diacritics from
when filtering text. See
IndexInfo.accentMapPath
for more information.XTFTextAnalyzer
and pass it to a Lucene
IndexWriter
instance. Lucene will then call the
tokenStream()
method each time a chunk of text is added to the index. public void clearFacetFields()
public void addFacetField(String fieldName)
fieldName
- Name of the field to consider a facet field.public void clearMisspelledFields()
public void addMisspelledField(String fieldName)
fieldName
- Name of the field to consider a misspelled field.public void setSpellWriter(SpellWriter writer)
writer
- The writer to add words topublic TokenStream tokenStream(String fieldName, Reader reader)
tokenStream
in class Analyzer
fieldName
- The name of the Lucene database field that the
resulting tokens will be place in. Used to decide
which filters need to be applied to the text.reader
- A Reader
object from which the source
textTokenStream
containing the tokens that should
be indexed by the Lucene database.