org.apache.lucene.bigram
Class BigramQueryRewriter

Object
  extended by QueryRewriter
      extended by BigramQueryRewriter
Direct Known Subclasses:
XtfBigramQueryRewriter

public class BigramQueryRewriter
extends QueryRewriter

Rewrites a query to eliminate stop words by combining them with adjacent non-stop-words, forming "bi-grams" (or bi-grams with 2 words). This is a fairly in-depth process, as bi-gramming across NEAR and OR queries is complex.


Nested Class Summary
 
Nested classes/interfaces inherited from class QueryRewriter
QueryRewriter.SpanClauseJoiner
 
Field Summary
protected  int maxSlop
          Maximum slop to allow in a query, based on the index being queried
protected  HashSet removedTerms
          Keeps track of all stop-words removed from the query
protected  Set stopSet
          Set of stop-words (e.g.
 
Constructor Summary
BigramQueryRewriter(Set stopSet, int maxSlop)
          Constructs a rewriter using the given stopword set.
 
Method Summary
protected  SpanQuery[] bigramQueries(SpanQuery[] clauses, int slop, QueryRewriter.SpanClauseJoiner joiner)
          Removes stop words from a set of consecutive queries by combining them with adjacent non-stop-words.
protected  SpanQuery bigramTermsExact(Query[] queries, String[] terms, QueryRewriter.SpanClauseJoiner joiner)
          Given a sequence of terms consisting of mixed stop and real words, figure out the bigrammed sequence required to get an exact match with the index.
protected  SpanQuery bigramTermsInexact(Query[] queries, String[] terms, QueryRewriter.SpanClauseJoiner joiner)
          Given a sequence of terms consisting of mixed stop and real words, figure out the bigrammed sequence that will give hits on at least the real words, and give priority to ones that are near the closest stop words.
protected  SpanQuery convertToSpanQuery(Query q)
          Converts non-span queries to span queries, and passes span queries through unchanged.
protected  Term extractTerm(Object obj)
          Given a term query, span term query (or plain term), extract the Term itself.
protected  String extractTermText(Object obj)
          Given a term, term query, span term query (or plain string), extract the term text.
protected  SpanQuery glomInside(SpanChunkedNotQuery nq, SpanTermQuery term, boolean before)
          Gloms the term onto each clause within a NOT query.
protected  SpanQuery glomInside(SpanNotNearQuery nq, SpanTermQuery term, boolean before)
          Gloms the term onto each clause within a NOT query.
protected  SpanQuery glomInside(SpanOrQuery oq, SpanTermQuery term, boolean before)
          Gloms the term onto each clause within an OR query.
protected  Query glomQueries(Query q1, Query q2)
          Joins a stop word to a real word, or vice-versa.
static boolean isBigram(Set stopWords, String str)
          Determines if the given string is an bi-gram of a real word with a stop-word.
static Set makeStopSet(String stopWords)
          Make a stop set given a space, comma, or semicolon delimited list of stop words.
protected  Term newTerm(String field, String text)
          Construct a term given its text and field name.
protected  void reduceBoost(Query query)
          Reduces the boost factor of a query (typically the non-bigram of a pair in an OR) so that the bigram will get scored higher.
protected  Query rewrite(BooleanQuery bq)
          Rewrite a BooleanQuery.
protected  Query rewrite(SpanNearQuery q)
          Rewrite a span NEAR query.
protected  Query rewrite(SpanOrNearQuery q)
          Rewrite a span OR-NEAR query.
protected  Query rewrite(SpanOrQuery q)
          Rewrite a span-based OR query.
protected  Query rewriteClauses(Query oldQuery, SpanQuery[] oldClauses, boolean shuntSingle, boolean bigram, int slop, QueryRewriter.SpanClauseJoiner joiner)
          Utility function that takes care of rewriting a series of span query clauses.
 
Methods inherited from class QueryRewriter
combineBoost, copyBoost, copyBoost, forceRewrite, rewrite, rewrite, rewrite, rewrite, rewrite, rewrite, rewrite, rewrite, rewriteClauses, rewriteQuery
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stopSet

protected Set stopSet
Set of stop-words (e.g. "the", "a", "and", etc.) to remove


maxSlop

protected int maxSlop
Maximum slop to allow in a query, based on the index being queried


removedTerms

protected HashSet removedTerms
Keeps track of all stop-words removed from the query

Constructor Detail

BigramQueryRewriter

public BigramQueryRewriter(Set stopSet,
                           int maxSlop)
Constructs a rewriter using the given stopword set.

Parameters:
stopSet - Set of stopwords to remove or bi-gram. This can be constructed easily by calling makeStopSet(String).
maxSlop - Maximum slop to allow in a query, based on the index being queried.
Method Detail

makeStopSet

public static Set makeStopSet(String stopWords)
Make a stop set given a space, comma, or semicolon delimited list of stop words.

Parameters:
stopWords - String of words to make into a set
Returns:
A stop word set suitable for use when constructing an BigramQueryRewriter.

isBigram

public static boolean isBigram(Set stopWords,
                               String str)
Determines if the given string is an bi-gram of a real word with a stop-word.

Parameters:
stopWords - The set of stop-words
str - The string to check
Returns:
true if it's an bi-gram

rewrite

protected Query rewrite(BooleanQuery bq)
Rewrite a BooleanQuery. Prohibited or allowed (not required) clauses that are single stop words will be removed. Required clauses will not have bi-gramming applied.

Overrides:
rewrite in class QueryRewriter
Parameters:
bq - The query to rewrite
Returns:
Rewritten version, or 'bq' unchanged if no changed needed.

rewrite

protected Query rewrite(SpanNearQuery q)
Rewrite a span NEAR query. Stop words will be bi-grammed into adjacent terms.

Overrides:
rewrite in class QueryRewriter
Parameters:
q - The query to rewrite
Returns:
Rewritten version, or 'q' unchanged if no changed needed.

rewrite

protected Query rewrite(SpanOrNearQuery q)
Rewrite a span OR-NEAR query. Stop words will be bi-grammed into adjacent terms.

Overrides:
rewrite in class QueryRewriter
Parameters:
q - The query to rewrite
Returns:
Rewritten version, or 'q' unchanged if no changed needed.

rewrite

protected Query rewrite(SpanOrQuery q)
Rewrite a span-based OR query. The procedure in this case is simple: remove all stop words, with no bi-gramming performed.

Overrides:
rewrite in class QueryRewriter
Parameters:
q - The query to rewrite
Returns:
Rewritten version, or 'q' unchanged if no changed needed.

rewriteClauses

protected Query rewriteClauses(Query oldQuery,
                               SpanQuery[] oldClauses,
                               boolean shuntSingle,
                               boolean bigram,
                               int slop,
                               QueryRewriter.SpanClauseJoiner joiner)
Utility function that takes care of rewriting a series of span query clauses.

Parameters:
oldQuery - Query being rewritten
oldClauses - Clauses to rewrite
shuntSingle - true to allow single-clause result to be returned, false to force wrapping.
bigram - true to bigram stop-words, false to simply remove them
slop - if bigramming, 0 for phrase, non-zero for near
joiner - Handles joining new clauses into wrapper query
Returns:
New rewritten query, or 'oldQuery' if no changes.

bigramQueries

protected SpanQuery[] bigramQueries(SpanQuery[] clauses,
                                    int slop,
                                    QueryRewriter.SpanClauseJoiner joiner)
Removes stop words from a set of consecutive queries by combining them with adjacent non-stop-words.

Parameters:
clauses - array of queries to work on
slop - zero for exact matching, non-zero for 'near' matching.
joiner - used to join the resulting bi-grammed clauses
Returns:
original list, or a new query containing bi-grams

bigramTermsInexact

protected SpanQuery bigramTermsInexact(Query[] queries,
                                       String[] terms,
                                       QueryRewriter.SpanClauseJoiner joiner)
Given a sequence of terms consisting of mixed stop and real words, figure out the bigrammed sequence that will give hits on at least the real words, and give priority to ones that are near the closest stop words. Examples: "man of the world" -> "(man or man-of) near (the-world or world)" "hello there" -> "hello there" "it is not a problem" -> "(a-problem or problem)"

Parameters:
queries - Original queries in the sequence
terms - Corresponding term text of each query
joiner - Used to join the resulting bi-grammed clauses
Returns:
A new query possibly containing bi-grams

convertToSpanQuery

protected SpanQuery convertToSpanQuery(Query q)
Converts non-span queries to span queries, and passes span queries through unchanged.

Parameters:
q - Query to convert (span or non-span)
Returns:
Equivalent SpanQuery.

newTerm

protected Term newTerm(String field,
                       String text)
Construct a term given its text and field name. This function is used instead of Term's constructor to add an extra check that the text is never a stop word.

Parameters:
text - Text for the new term
field - Field being queried
Returns:
A properly constructed Term, never a stop-word.

bigramTermsExact

protected SpanQuery bigramTermsExact(Query[] queries,
                                     String[] terms,
                                     QueryRewriter.SpanClauseJoiner joiner)
Given a sequence of terms consisting of mixed stop and real words, figure out the bigrammed sequence required to get an exact match with the index. Examples: "man of the world" -> "man-of of-the the-world" "hello there" -> "hello there" "it is not a problem" -> "it-is is-not not-a a-problem"

Parameters:
queries - Original queries in the sequence
terms - Corresponding term text of each query
joiner - Used to join the resulting bi-grammed clauses
Returns:
A new query possibly containing bi-grams

glomQueries

protected Query glomQueries(Query q1,
                            Query q2)
Joins a stop word to a real word, or vice-versa. Also handles more complex cases, like joining a stop-word to an OR query. Examples: the rabbit -> the-rabbit the (white OR beige) -> the-white OR the-beige

Parameters:
q1 - First query
q2 - Second query
Returns:
A query representing the join.

glomInside

protected SpanQuery glomInside(SpanOrQuery oq,
                               SpanTermQuery term,
                               boolean before)
Gloms the term onto each clause within an OR query.

Parameters:
oq - Query to glom into
term - Term to glom on
before - true to prepend the term, false to append.
Returns:
A new glommed query.

glomInside

protected SpanQuery glomInside(SpanChunkedNotQuery nq,
                               SpanTermQuery term,
                               boolean before)
Gloms the term onto each clause within a NOT query.

Parameters:
nq - Query to glom into
term - Term to glom on
before - true to prepend the term, false to append.
Returns:
A new glommed query.

glomInside

protected SpanQuery glomInside(SpanNotNearQuery nq,
                               SpanTermQuery term,
                               boolean before)
Gloms the term onto each clause within a NOT query.

Parameters:
nq - Query to glom into
term - Term to glom on
before - true to prepend the term, false to append.
Returns:
A new glommed query.

extractTermText

protected String extractTermText(Object obj)
Given a term, term query, span term query (or plain string), extract the term text. This method is handy so we don't have to sprinkle if statements everywhere we need to get the text.

Parameters:
obj - String, Term, TermQuery, or SpanTermQuery to check
Returns:
text of the term

extractTerm

protected Term extractTerm(Object obj)
Given a term query, span term query (or plain term), extract the Term itself. This method is handy so we don't have to sprinkle if statements everywhere we need to get the term from a query.

Parameters:
obj - Term, TermQuery, or SpanTermQuery to check
Returns:
the Term

reduceBoost

protected void reduceBoost(Query query)
Reduces the boost factor of a query (typically the non-bigram of a pair in an OR) so that the bigram will get scored higher.