|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
ObjectTokenStream
TokenFilter
BigramStopFilter
public class BigramStopFilter
Optimizes query speed by getting rid of stop words, but doing it in a way that still allows them to be queried. We do this by joining stop words to their neighboring (non-stop) words to form what Doug Cutting of Lucene fame calls "n-grams" and which we simplify here to "bi-grams" (n-grams of only two words.) An example: "man of the year" would be indexed as "man man-of the-year year". Then a query for the exact phrase "man of the year" would query for "man-of the-year" and get the correct hit without needing to scan the 50-zillion occurrences of "of" and "the".
Field Summary | |
---|---|
private int |
accumIncrement
Accumulates position increment of removed tokens |
private boolean |
firstTime
true before next() called for the first time |
private int |
inputPos
Tracks the position of input tokens, for debugging |
private Token |
nextToken
The next token to process |
private int |
outputPos
Tracks the position of output tokens, for debugging |
private Token |
outputQueue
Queue of output tokens, only required in some cases |
private Set |
stopSet
Set of stop-words (e.g. |
static Object |
tester
Basic regression test |
Fields inherited from class TokenFilter |
---|
input |
Constructor Summary | |
---|---|
BigramStopFilter(TokenStream input,
Set stopSet)
Construct a token stream to filter 'stopWords' out of 'input'. |
Method Summary | |
---|---|
private Token |
glomToken(Token token1,
Token token2,
int increment)
Constructs a new token, drawing the start position, position increment, and end position from the specified tokens. |
protected boolean |
isStopWord(String word)
Tells whether the token is a stop-word. |
static Set |
makeStopSet(String stopWords)
Make a stop set given a space, comma, or semicolon delimited list of stop words. |
Token |
next()
Retrieve the next token in the stream. |
private Token |
nextInput()
Retrieves the next token from the input stream, properly tracking the input position. |
Token |
nextInternal()
Retrieve the next token in the stream. |
Methods inherited from class TokenFilter |
---|
close |
Methods inherited from class Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private Set stopSet
private boolean firstTime
private Token nextToken
private Token outputQueue
private int accumIncrement
private int outputPos
private int inputPos
public static final Object tester
Constructor Detail |
---|
public BigramStopFilter(TokenStream input, Set stopSet)
input
- Input stream of tokens to processstopSet
- Set of stop words to filter out. This can be most easily
made by calling makeStopSet()
.Method Detail |
---|
public static Set makeStopSet(String stopWords)
stopWords
- String of words to make into a set
BigramStopFilter
.public Token next() throws IOException
next
in class TokenStream
IOException
public Token nextInternal() throws IOException
IOException
private Token nextInput() throws IOException
IOException
protected boolean isStopWord(String word)
private Token glomToken(Token token1, Token token2, int increment)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |