org.cdlib.xtf.textIndexer
Class XtfSpecialTokensFilter

Object
  extended by TokenStream
      extended by TokenFilter
          extended by XtfSpecialTokensFilter

public class XtfSpecialTokensFilter
extends TokenFilter

The XtfSpecialTokensFilter class is used by the XTFTextAnalyzer class to convert special "bump" count values in text chunks to actual position increments for words prior to adding them to a Lucene index.

The way in which Lucene adds words to an index database is to convert a contiguous chunk of text into a list of discrete words (tokens in Lucene parlance.) Then, when the Lucene IndexWriter.addDocument() function is called, Lucene traverses the list of tokens, and calls an instance of a TokenFilter derived class to pre-process each token. The resulting output from the filter is what Lucene actually adds to the database.

Each token entry in the list consists of the token (word) itself, and its position increment from the previous token (referred to as "word bump" in other text indexer related classes.) Since a special bump count value in the original text looks like any other token to Lucene, it simply passes it on to the XtfSpecialTokensFilter to pre-process. The filter recognizes the special token, removes it from the token list, converts it to a number, and sets it as the position increment for the first non-special token that follows. The output of the XtfSpecialTokensFilter is then a list of actual tokens to be indexed and their associated position increments.

For more information on word bump and virtual words, see the XMLTextProcessor class, and its member function insertVirtualWords() .


Field Summary
private  String srcText
          A reference to the original contiguous text that the input token list corresponds.
 
Fields inherited from class TokenFilter
input
 
Constructor Summary
XtfSpecialTokensFilter(TokenStream srcTokens, String srcText)
          Constructor for the XtfSpecialTokensFilter.
 
Method Summary
 Token next()
          Return the next output token from this filter.
 
Methods inherited from class TokenFilter
close
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

srcText

private String srcText
A reference to the original contiguous text that the input token list corresponds. See the constructor for more about how this reference is used.

Constructor Detail

XtfSpecialTokensFilter

public XtfSpecialTokensFilter(TokenStream srcTokens,
                              String srcText)
Constructor for the XtfSpecialTokensFilter.

Parameters:
srcTokens - The source token stream to filter.
srcText - The original source text chunk from wich the source token stream was derived.

Notes:
This class stores a reference to the original chunk of text from which the source token stream is derived. This is so that the filter can perform look-back and look-ahead operations to identify special token by their markers. This is necessary because the standard tokenizer that creates the source token stream for this filter considers our markers to be punctuation rather than part of the token, and strips them out.

Method Detail

next

public Token next()
           throws IOException
Return the next output token from this filter.

Called by Lucene to retrieve the next non-special token from this filter.

Specified by:
next in class TokenStream
Returns:
The next non-special token output by this filter.

Throws:
IOException - Any exceptions generated by the look-back/look-ahead character processing performed by this function.

Notes:
For more information about the filtering performed by this function, see the XtfSpecialTokensFilter class description.