org.apache.lucene.bigram
Class BigramStopFilter

Object
  extended by TokenStream
      extended by TokenFilter
          extended by BigramStopFilter

public class BigramStopFilter
extends TokenFilter

Optimizes query speed by getting rid of stop words, but doing it in a way that still allows them to be queried. We do this by joining stop words to their neighboring (non-stop) words to form what Doug Cutting of Lucene fame calls "n-grams" and which we simplify here to "bi-grams" (n-grams of only two words.) An example: "man of the year" would be indexed as "man man-of the-year year". Then a query for the exact phrase "man of the year" would query for "man-of the-year" and get the correct hit without needing to scan the 50-zillion occurrences of "of" and "the".


Field Summary
private  int accumIncrement
          Accumulates position increment of removed tokens
private  boolean firstTime
          true before next() called for the first time
private  int inputPos
          Tracks the position of input tokens, for debugging
private  Token nextToken
          The next token to process
private  int outputPos
          Tracks the position of output tokens, for debugging
private  Token outputQueue
          Queue of output tokens, only required in some cases
private  Set stopSet
          Set of stop-words (e.g.
static Object tester
          Basic regression test
 
Fields inherited from class TokenFilter
input
 
Constructor Summary
BigramStopFilter(TokenStream input, Set stopSet)
          Construct a token stream to filter 'stopWords' out of 'input'.
 
Method Summary
private  Token glomToken(Token token1, Token token2, int increment)
          Constructs a new token, drawing the start position, position increment, and end position from the specified tokens.
protected  boolean isStopWord(String word)
          Tells whether the token is a stop-word.
static Set makeStopSet(String stopWords)
          Make a stop set given a space, comma, or semicolon delimited list of stop words.
 Token next()
          Retrieve the next token in the stream.
private  Token nextInput()
          Retrieves the next token from the input stream, properly tracking the input position.
 Token nextInternal()
          Retrieve the next token in the stream.
 
Methods inherited from class TokenFilter
close
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stopSet

private Set stopSet
Set of stop-words (e.g. "the", "a", "and", etc.) to remove


firstTime

private boolean firstTime
true before next() called for the first time


nextToken

private Token nextToken
The next token to process


outputQueue

private Token outputQueue
Queue of output tokens, only required in some cases


accumIncrement

private int accumIncrement
Accumulates position increment of removed tokens


outputPos

private int outputPos
Tracks the position of output tokens, for debugging


inputPos

private int inputPos
Tracks the position of input tokens, for debugging


tester

public static final Object tester
Basic regression test

Constructor Detail

BigramStopFilter

public BigramStopFilter(TokenStream input,
                        Set stopSet)
Construct a token stream to filter 'stopWords' out of 'input'.

Parameters:
input - Input stream of tokens to process
stopSet - Set of stop words to filter out. This can be most easily made by calling makeStopSet().
Method Detail

makeStopSet

public static Set makeStopSet(String stopWords)
Make a stop set given a space, comma, or semicolon delimited list of stop words.

Parameters:
stopWords - String of words to make into a set
Returns:
A stop word set suitable for use when constructing an BigramStopFilter.

next

public Token next()
           throws IOException
Retrieve the next token in the stream. Adds a layer of checking on top, to make absolutely sure that we don't accidentally introduce extra position increments, or miss some.

Specified by:
next in class TokenStream
Throws:
IOException

nextInternal

public Token nextInternal()
                   throws IOException
Retrieve the next token in the stream.

Throws:
IOException

nextInput

private Token nextInput()
                 throws IOException
Retrieves the next token from the input stream, properly tracking the input position.

Throws:
IOException

isStopWord

protected boolean isStopWord(String word)
Tells whether the token is a stop-word. Can be overridden for special processing.


glomToken

private Token glomToken(Token token1,
                        Token token2,
                        int increment)
Constructs a new token, drawing the start position, position increment, and end position from the specified tokens.