org.cdlib.xtf.textIndexer
Class IndexInfo

Object
  extended by IndexInfo

public class IndexInfo
extends Object

This class maintains configuration information about the current index that the TextIndexer program is processing.

Information stored by this class includes:

- The name of the current index being processed.
- The path where the Lucene index database is (to be) stored.
- The path where the source text for this index can be found.
- The path where any XSLT input filters for this index can be found.
- A specification for source text files to ignore.
- The text chunk size and overlap attributes for the current index.
- Specifications for stop word removal.


Field Summary
 String accentMapPath
          Path to a mapping from accented characters to their corresponding chars with teh diacritics removed.
 int[] chunkAtt
          Text chunk attribute array.
static int chunkOvlp
          Index into Chunk Attribute Array for the chunk size attribute.
static int chunkSize
          Index into Chunk Attribute Array for the chunk size attribute.
 boolean cloneData
          True to make a clone of the data in index/dataClone.
 boolean createSpellcheckDict
          Whether to create a spellcheck dictionary for this index
static int defaultChunkOvlp
          Constant defining the default overlap (in words) of two adjacent text chunks.
static int defaultChunkSize
          Constant defining the default size (in words) of a text chunk.
static String defaultStopWords
          Constant defining the default list of stop words.
 String docSelectorPath
          Path to stylesheet used to determine which documents to index
 String indexName
          Name of the current index being processed (as specified in the index configuration file.)
 String indexPath
          Name of the path to the current index's Lucene database.
static int minChunkSize
          Constant defining the minimum size (in words) of a text chunk.
 String pluralMapPath
          Path to a mapping from plural words to their corresponding singular forms that the textIndexer should fold together.
 boolean rotate
          Whether index rotation is enabled
 boolean scanAllDirs
          True to scan all dirs, false for pruned (e.g. stop at first data).
 String sourcePath
          Path to the source text for the current index.
 String stopWords
          Set of stop words to remove.
 boolean stripWhitespace
          Whether to strip whitespace between elements in lazy tree files.
 ArrayList<String> subDirs
          Name of a sub-directory to index, or null to index everything
 String validationPath
          Path to a set of validation specifications for this index.
 
Constructor Summary
IndexInfo()
          Default constructor.
IndexInfo(String indexName, String indexPath)
          Alternate constructor.
 
Method Summary
 int getChunkOvlp()
          Return the overlap of two adjacent text chunks for the current index.
 String getChunkOvlpStr()
          Return the overlap (in words) for two adjacent text text chunks in the current index as a string.
 int getChunkSize()
          Return the size of a text chunk for the current index.
 String getChunkSizeStr()
          Return the size of a text chunk (in words) for the current index as a string.
 int setChunkOvlp(int newChunkOverlap)
          Sets the adjacent chunk overlap attribute for the current index.
 int setChunkSize(int newChunkSize)
          Sets the text chunk size attribute for the current index.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

indexName

public String indexName
Name of the current index being processed (as specified in the index configuration file.)


subDirs

public ArrayList<String> subDirs
Name of a sub-directory to index, or null to index everything


indexPath

public String indexPath
Name of the path to the current index's Lucene database.


rotate

public boolean rotate
Whether index rotation is enabled


sourcePath

public String sourcePath
Path to the source text for the current index.


scanAllDirs

public boolean scanAllDirs
True to scan all dirs, false for pruned (e.g. stop at first data). Defaults to false for backward compatibility.


cloneData

public boolean cloneData
True to make a clone of the data in index/dataClone. Useful so that dynaXML can always get to files that match the index.


docSelectorPath

public String docSelectorPath
Path to stylesheet used to determine which documents to index


stopWords

public String stopWords
Set of stop words to remove. Stop words are common words such as "the", "and", etc. which are so ubiquitous as to add little value to queries. Rather than remove them entirely however, we take an approach suggested by Doug Cutting (inventor of Lucene).

Basically, stop words are joined to surrounding normal words. This speeds queries while still producing good results for requests that contain a mixture of stop words and normal words (which is by far the most common case for queries.)

For example, the string "man of war" would be indexed like this: "man man-of of-war war". This way, searching for "man war" will pull up a hit, but a search for "man of war" will score higher, as long as the same stop-word approach is applied to the query.

You might ask what happens in this case: "joke of the year" (two stop words in a row.) We could index it as "joke joke-of of-the the-year", or as the longer but more complete "joke joke-of joke-of-the of-the of-the-year the-year". The second form doesn't offer much improvement in searching and would make the index bigger and logic more complex. So we always combine a stop word with at most one neighboring word.

The words in this list may be separated by spaces, commas, and/or semicolons.


pluralMapPath

public String pluralMapPath
Path to a mapping from plural words to their corresponding singular forms that the textIndexer should fold together. This can yield better search results. For instance, if a user searches for "cat" they probably also would like results for "cats." The file should be a plain text file, with one word pair per line. First is the plural form of a word, followed by a "|" character, followed by the singular form. All should be lowercase, even in the case of acronyms. Optionally, the file may be compressed in GZIP format, in which case it must end in the extension ".gz". Non-ASCII characters should be encoded in UTF-8 format.


accentMapPath

public String accentMapPath
Path to a mapping from accented characters to their corresponding chars with teh diacritics removed. These chars will be folded together which can yield better search results. For instance, a German user on an American keyboard might want to find "Hut" with an umlaut over the "u", but can't type the umlaut. This way, if they type "hat" they'll still get a match. The file should be a plain text file, with one code pair per line. First is the 4-digit hex Unicode point for the accented character, followed by "|", then the 4-digit hex code for the unaccented form.


validationPath

public String validationPath
Path to a set of validation specifications for this index. This is essentially a list of URLs, with specifications on how many hits should be returned by each one. Validation is applied at index time to determine if the index is valid (and before rotating), and is also applied by the servlets before rotating in a new index. The file should be XML in the defined format.


createSpellcheckDict

public boolean createSpellcheckDict
Whether to create a spellcheck dictionary for this index


stripWhitespace

public boolean stripWhitespace
Whether to strip whitespace between elements in lazy tree files. Not strictly safe for all XML documents, but it can make lazy trees somewhat smaller and faster.


chunkAtt

public int[] chunkAtt
Text chunk attribute array. Currently this array consists of two entries:

- The size of the text chunk in words.
- The overlap in words of adjacent text chunks.

These array members should be addressed using chunkSize} and chunkOvlp constants defined by this class.

Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

chunkSize

public static final int chunkSize
Index into Chunk Attribute Array for the chunk size attribute.

Indexed text stored in the a Lucine index is broken up in to small chunks so that search result "summary blurbs" can be easily generated without having to load the entire source text. The chunk size attribute reflects the chunk size (in words) used by the current index.

See Also:
Constant Field Values

chunkOvlp

public static final int chunkOvlp
Index into Chunk Attribute Array for the chunk size attribute.

Indexed text stored in the a Lucine index is broken up in to small chunks that overlap with adjacent chunks so that "summary blurbs" for proximity searches can be easily generated without having to load the entire source text. The chunk overlap attribute reflects the overlap (in words) used by the current index.

See Also:
Constant Field Values

minChunkSize

public static final int minChunkSize
Constant defining the minimum size (in words) of a text chunk. Value = 2.

See Also:
Constant Field Values
Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

defaultChunkSize

public static final int defaultChunkSize
Constant defining the default size (in words) of a text chunk. Value = 100.

See Also:
Constant Field Values
Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

defaultChunkOvlp

public static final int defaultChunkOvlp
Constant defining the default overlap (in words) of two adjacent text chunks. Value = 50.

See Also:
Constant Field Values
Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

defaultStopWords

public static final String defaultStopWords
Constant defining the default list of stop words. These are common words that are so ubiquitous as to be of little use in queries. Value = "a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with".

See Also:
Constant Field Values
Notes:
For an explanation of stop word handling, see stopWords
Constructor Detail

IndexInfo

public IndexInfo()
Default constructor.

Creates the chunk attribute array, and initializes the chunkSize entry to defaultChunkSize, and the chunkOvlp entry to defaultChunkOvlp.


IndexInfo

public IndexInfo(String indexName,
                 String indexPath)
Alternate constructor.

Initializes the fields needed to use InputStream-based indexing (that is, all fields except subDir, sourcePath, and docSelectorPath.) Uses default values for chunk size/overlap, and for the stop word list. After construction, these may of course be altered if desired.

Method Detail

getChunkSize

public int getChunkSize()
Return the size of a text chunk for the current index.

Returns:
The value of the chunkSize attribute.

Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

getChunkSizeStr

public String getChunkSizeStr()
Return the size of a text chunk (in words) for the current index as a string.

Returns:
The value of the chunkSize attribute converted to a String.

Notes:
This method is intended as a convenience call for code that creats Lucene fields, which are all stored as strings.

For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

getChunkOvlp

public int getChunkOvlp()
Return the overlap of two adjacent text chunks for the current index.

Returns:
The value of the chunkOvlp attribute.

Notes:
For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

getChunkOvlpStr

public String getChunkOvlpStr()
Return the overlap (in words) for two adjacent text text chunks in the current index as a string.

Returns:
The value of the chunkOvlp attribute converted to a String.

Notes:
This method is intended as a convenience call for code that creats Lucene fields, which are all stored as strings.

For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

setChunkSize

public int setChunkSize(int newChunkSize)
Sets the text chunk size attribute for the current index.

This method sets the value for the chunkSize attribute, coercing its value to be greater than or equal to the minChunkSize value.

Returns:
The resulting coerced chunkSize value.

Notes:
This function also calls the setChunkOvlp() method to ensure that the overlap value is valid for the chunk size set by this call.

For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.

setChunkOvlp

public int setChunkOvlp(int newChunkOverlap)
Sets the adjacent chunk overlap attribute for the current index.

This method sets the value for the chunkOvlp attribute, coercing its value to be less than or equal to the half the current chunk size for the current index.

Returns:
The resulting coerced chunkOvlp value.

For an explanation of the text chunk size and overlap attributes, see chunkSize and chunkOvlp.