org.cdlib.xtf.textEngine
Class SnippetMaker

Object
  extended by SnippetMaker

public class SnippetMaker
extends Object

Does the heavy lifting of interpreting span hits using the actual document text stored in the index. Marks the hit and any matching terms, and includes a configurable amount of context words.

Author:
Martin Haye

Nested Class Summary
 class SnippetMaker.StartEndStripper
          Strips the special start-of-field/end-of-field markers from tokens.
 
Field Summary
private  CharMap accentMap
          Accented chars to remove diacritics from
private static Pattern ampPattern
           
private  Analyzer analyzer
          Lucene analyzer used for tokenizing text
private  int chunkOverlap
          Amount of overlap between adjacent index chunks
private  int chunkSize
          Max # of words in an index chunk
private  DocNumMap docNumMap
          Keeps track of which chunks belong to which source document in the index.
private static Pattern gtPattern
           
private static Pattern ltPattern
           
private  int maxContext
          Target # of characters to include in the snippet.
private  WordMap pluralMap
          Plural words to convert to singular
 IndexReader reader
          Lucene index reader used to fetch text data
private  Set stopSet
          Set of stop-words removed (e.g.
private  int termMode
          Where to mark terms (all, only in spans, etc.)
 
Constructor Summary
SnippetMaker(IndexReader reader, DocNumMap docNumMap, Set stopSet, WordMap pluralMap, CharMap accentMap, int maxContext, int termMode)
          Constructs a SnippetMaker, ready to make snippets using the given index reader to load text data.
 
Method Summary
 CharMap accentMap()
          Obtain the set of accented chars to remove diacritics from.
 DocNumMap docNumMap()
          Obtain the document number map used to make snippets
 Snippet[] makeSnippets(FieldSpans fieldSpans, int mainDocNum, String fieldName, boolean getText)
          Full-blown snippet formation process.
(package private)  String mapXMLChars(String s)
          Replaces 'special' characters in the given string with their XML equivalent.
 String markField(Document doc, FieldSpans fieldSpans, String fieldName, String value)
          Marks all the terms within the given text.
 WordMap pluralMap()
          Obtain the set of plural words to convert to singular form.
 Set stopSet()
          Obtain a list of stop-words in the index (e.g.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

public IndexReader reader
Lucene index reader used to fetch text data


analyzer

private Analyzer analyzer
Lucene analyzer used for tokenizing text


docNumMap

private DocNumMap docNumMap
Keeps track of which chunks belong to which source document in the index.


chunkSize

private int chunkSize
Max # of words in an index chunk


chunkOverlap

private int chunkOverlap
Amount of overlap between adjacent index chunks


stopSet

private Set stopSet
Set of stop-words removed (e.g. "the", "a", "and", etc.)


pluralMap

private WordMap pluralMap
Plural words to convert to singular


accentMap

private CharMap accentMap
Accented chars to remove diacritics from


maxContext

private int maxContext
Target # of characters to include in the snippet.


termMode

private int termMode
Where to mark terms (all, only in spans, etc.)


ampPattern

private static final Pattern ampPattern

ltPattern

private static final Pattern ltPattern

gtPattern

private static final Pattern gtPattern
Constructor Detail

SnippetMaker

public SnippetMaker(IndexReader reader,
                    DocNumMap docNumMap,
                    Set stopSet,
                    WordMap pluralMap,
                    CharMap accentMap,
                    int maxContext,
                    int termMode)
Constructs a SnippetMaker, ready to make snippets using the given index reader to load text data.

Parameters:
reader - Index reader to fetch text data from
docNumMap - Maps chunk numbers to document numbers
stopSet - Stop words removed (e.g. "the", "a", "and", etc.)
pluralMap - Plural words to convert to singular
accentMap - Accented chars to remove diacritics from
maxContext - Target # chars for hit + context
termMode - Where to mark terms (all, only in spans, etc.)
Method Detail

stopSet

public Set stopSet()
Obtain a list of stop-words in the index (e.g. "the", "a", "and", etc.)


pluralMap

public WordMap pluralMap()
Obtain the set of plural words to convert to singular form.


accentMap

public CharMap accentMap()
Obtain the set of accented chars to remove diacritics from.


docNumMap

public DocNumMap docNumMap()
Obtain the document number map used to make snippets


makeSnippets

public Snippet[] makeSnippets(FieldSpans fieldSpans,
                              int mainDocNum,
                              String fieldName,
                              boolean getText)
Full-blown snippet formation process.

Parameters:
fieldSpans - record of the matching spans, and all search terms
mainDocNum - document ID of the main doc
fieldName - name of the field we're making snippets of
getText - true to get the full text of the snippet, false if we only want the start/end offsets.

markField

public String markField(Document doc,
                        FieldSpans fieldSpans,
                        String fieldName,
                        String value)
Marks all the terms within the given text. Typically used to mark terms within a meta-data field.

Parameters:
doc - document to get matching spans from
fieldName - name of the field to mark.
value - value of the field to mark
Returns:
Marked up text value.

mapXMLChars

String mapXMLChars(String s)
Replaces 'special' characters in the given string with their XML equivalent.