org.cdlib.xtf.textIndexer
Class XMLTextProcessor

Object
  extended by DefaultHandler
      extended by XMLTextProcessor
All Implemented Interfaces:
ContentHandler, DTDHandler, EntityResolver, ErrorHandler

public class XMLTextProcessor
extends DefaultHandler

This class performs the actual parsing of the XML source text files and generates index information for it.

The XMLTextProcessor class uses the configuration information recorded in the IndexerConfig instance passed to add one or more source text documents to the associated Lucene index. The process of indexing an XML source document consists of breaking the document up into small overlapping "chunks" of text, and indexing the individual words encountered in each chunk.

The reason source documents are split into chunks during indexing is to allow the search engine to load only small pieces of a document when displaying summary "blurbs" for matched text. This significantly lowers the memory requirements to display search results for multiple documents. The reason chunks are overlapped is to allow proximity matches to be found that span adjacent chunks. At this time, the maximum distance that a proximity can be found using this approach is equal to or less than the chunk size used when the text document was indexed. This is because proximity search checks are currently only performed on two adjacent chunks.

Within a chunk, adjacent words are considered to be one word apart. In Lucene parlance, the word bump for adjacent words is one. Larger word bump values can be set for sub-sections of a document. Doing so makes proximity matches within a sub-section more relevant than ones that span sections.

Word bump adjustments are made through the use of attributes added to nodes in the XML source text file. The available word bump attributes are:

xtf:sentencebump="xxx"
Set the additional word distance implied by sentence breaks in the associated node. If not set explicitly, the default sentence bump value is 5.

xtf:sectiontype="xxx"
While this attribute's primary purpose is to assign names to a section of text, it also forces sections with a different names to start in new, non- overlapping chunks. The net result is equivalent to placing an "infinite word bump" between differently named sections, causing proximity searches to never find a match that spans multiple sections.

xtf:proximitybreak
Forces its associated node in the source text to start in a new, non-overlapping chunk. As with new sections described above, the net result is equivalent to placing an "infinite word bump" between adjacent sections, causing proximity searches to never find a match that spans the proximity break.

In addition to the the word bump modifiers described above, there are two additional non-bump attributes that can be applied to nodes in a source text file:
xtf:boost="xxx"
Boosts the ranking of words found in the associated node by multiplying their base relevance by the number xxx. Normally, a boost value greater than 1.0 is used to emphasize the associated text, but values less than 1.0 can be used as an "inverse" boost to de-emphasize the relevance of text. Also, since Lucene only applies boost values to entire chunks, changing the boost value for a node causes the text to start in a new, non-overlapping chunk.

xtf:noindex
This attribute when added to a source text node causes the contained text to not be indexed.
Normally, the above mentioned node attributes aren't actually in the source text nodes, but are embedded via the use of an XSL pre-filter before the node is indexed. The XSL pre-filter used is the one defined for the current index in the XML configuration file passed to the TextIndexer.

For both bump and non-bump attributes, the namespace uri defined by the xtfUri member must be specified for the XMLTextProcessor to recognize and process them.


Nested Class Summary
private  class XMLTextProcessor.FileQueueEntry
           
private  class XMLTextProcessor.MetaField
           
 
Field Summary
private  CharMap accentMap
          The set of accented chars to remove diacritics from.
private  StringBuffer accumText
          A buffer used to accumulate actual words from the source text, along with "virtual words" implied by any sectiontype and proximitybreak attributes encountered, as well as various special markers used to locate where in the XML source text the indexed text is stored.
private  StringBuffer blurbedText
          A buffer containing the "blurbified" text to be stored in the index.
private static int bufStartSize
          Initial size for various text accumulation buffers used by this class.
private  char[] charBuf
          Character buffer for accumulating partial text blocks (possibly) passed in to the characters() method from the SAX parser.
private  int charBufPos
          Current end of the charBuf buffer.
private  int chunkCount
          The number of chunks of XML source text that have been processed.
private  int chunkStartNode
          The XML node in which the current chunk starts (may be different from the current node being processed, since chunks may span nodes.)
private  int chunkWordCount
          Number of words accumulated so far for the current chunk.
private  int chunkWordOffset
          The nodeWordCount at which the current chunk begins.
private  int chunkWordOvlp
          The number of words of overlap between two adjacent chunks.
private  int chunkWordOvlpStart
          The start word offset at which the next overlapping chunk begins.
private  int chunkWordSize
          The size in words of a chunk.
private  StringBuffer compactedAccumText
          A version of the accumText member where individual "virtual words" have been compacted down into special offset markers.
private  IndexRecord curIdxRecord
          The current record being indexed within curIdxSrc
private  IndexSource curIdxSrc
          The location of the XML source text file currently being indexed.
private  int curNode
          The current XML node we are currently reading source text from.
private  String curPrettyKey
          Display name of the current file
private  int docWordCount
          Number of words encountered so far for the current document.
private  LinkedList fileQueue
          List of files to process.
private  boolean forcedChunk
          Flag indicating that a new chunk needs to be created.
private  boolean ignoreFileTimes
          Whether to ignore file modification times
private  IndexInfo indexInfo
          A reference to the configuration information for the current index being updated.
private  String indexPath
          The base directory for the current Lucene database.
private  IndexReader indexReader
          An Lucene index reader object, used in conjunction with the indexSearcher to check if the current document needs to be added to, updated in, or removed from the index.
private  IndexSearcher indexSearcher
          An Lucene index searcher object, used in conjunction with the indexReader to check if the current document needs to be added to, updated in, or removed from the index.
private  IndexWriter indexWriter
          An Lucene index writer object, used to add or update documents to the index currently opened for writing.
private  int inMeta
          Flag indicating how deeply nested in a meta-data section the current text/tag being processed is.
private  LazyTreeBuilder lazyBuilder
          Object used to construct a "lazy tree" representation of the XML source file being processed.
private  ReceivingContentHandler lazyHandler
          Wrapper for the lazyReceiver object that translates SAX events to Saxon's internal Receiver API.
private  Receiver lazyReceiver
          SAX Handler object for processing XML nodes into a "lazy tree" representation of the source docuement.
private  StructuredStore lazyStore
          Storage for the "lazy tree"
private static int MAX_DELETION_BATCH
          Maximum number of document deletions to do in a single batch
private  StringBuffer metaBuf
          A buffer for accumulating meta-text from the source XML file.
private  XMLTextProcessor.MetaField metaField
          The current meta-field data being processed.
private  int nextChunkStartIdx
          The character index in the chunk text accumulation buffer where the next overlapping chunk beings.
private  int nextChunkStartNode
          Start node of next chunk.
private  int nextChunkWordCount
          Number of words encountered so far in the next XML node.
private  int nextChunkWordOffset
          The nodeWordCount at which the next chunk begins.
private  int nodeWordCount
          Number of words encountered so far for the current XML node.
private  WordMap pluralMap
          The set of plural words to de-pluralize while indexing.
private  SectionInfoStack section
          Stack containing the nesting level of the current text being processed.
private  SpellWriter spellWriter
          Queues words for spelling dictionary creator
private  Set stopSet
          The set of stop words to remove while indexing.
private  HashSet<String> subDocsWritten
          Sub-documents that have been written out.
private  HashSet tokenizedFields
          Keeps track of fields we already know are tokenized
private  String xtfHomePath
          The base directory from which to resolve relative paths (if any)
private static String xtfUri
          The namespace string used to identify attributes that must be processed by the XMLTextProcessor class.
 
Constructor Summary
XMLTextProcessor()
           
 
Method Summary
private  void addToTokenizedFieldsFile(String field)
          Adds a field to the on-disk list of tokenized fields for an index.
 void batchDelete()
          If the first entry in the file queue requires deletion, we start up a batch delete up to MAX_DELETION_BATCH deletions.
private  void blurbify(StringBuffer text, boolean trim)
          Convert the given source text into a "blurb."
 void characters(char[] ch, int start, int length)
          Accumulate chunks of text encountered between element/node/tags.
 void checkAndQueueText(IndexSource idxSrc)
          Check and conditionally queue a source text file for (re)indexing.
private  int checkFile(IndexSource srcInfo)
          Check to see if the current XML source text file exists in the Lucene database, and if so, whether or not it needs to be updated.
 void close()
          Close the Lucene index.
private  void compactVirtualWords()
          Compacts multiple adjacent virtual words into a special "virtual word count" token.
private  void copyDependentFile(String filePath, String fieldName, Document doc)
           
private  void createIndex(IndexInfo indexInfo)
          Utility function to create a new Lucene index database for reading or searching.
 boolean docExists(String key)
          Checks if a given document exists in the index.
 void endDocument()
          Perform any final document processing when the end of the XML source text has been reached.
 void endElement(String uri, String localName, String qName)
          Process the end of a new XML source text element/node/tag.
 void endPrefixMapping(String prefix)
           
 void flushCharacters()
          Process any accumulated source text, writing indexing completed chunks to the Lucene database as necessary.
private  void forceNewChunk(SectionInfo secInfo)
          Forces subsequent text to start at the beginning of a new chunk.
private  String getIndexPath()
          Returns a normalized version of the base path of the Lucene database for an index.
 int getQueueSize()
          Find out how many texts have been queued up using queueText(IndexSource, boolean) but not yet processed by processQueuedTexts().
private  void incrementNode()
          Increment the node tracking information.
private  void indexText(SectionInfo secInfo)
          Add the current accumulated chunk of text to the Lucene database for the active index.
private  void insertVirtualWords(StringBuffer text)
          Inserts "virtual words" into the specified text as needed.
private  void insertVirtualWords(String vWord, int count, StringBuffer text, int pos)
          Utility function used by the main insertVirtualWords() method to insert a specified number of virtual word symbols.
private static boolean isAllWhitespace(String str, int start, int end)
          Utility function to check if a string or a portion of a string is entirely whitespace.
private  boolean isEndOfSentence(int idx, int len, StringBuffer text)
          Utility function to determine if the current character marks the end of a sentence.
private  boolean isSentencePunctuationChar(char theChar)
          Utility function to detect sentence punctuation characters.
 void open(String homePath, IndexInfo idxInfo, boolean clean)
          Version for source-level backward compatibility since this API is used sometimes externally.
 void open(String homePath, IndexInfo idxInfo, boolean clean, boolean ignoreFileTimes)
          Open a TextIndexer (Lucene) index for reading or writing.
private  void openIdxForReading()
          Open the active Lucene index database for reading (and deleting, an oddity in Lucene).
private  void openIdxForWriting()
          Open the active Lucene index database for writing.
 void optimizeIndex()
          Runs an optimization pass (which can be quite time-consuming) on the currently open index.
private  int parseText()
          Parse the XML source text file specified.
private  void precacheXSLKeys()
          To speed accesses in dynaXML, the lazy tree is capable of storing pre-cached indexes to support each xsl:key declaration.
 void processingInstruction(String target, String data)
           
private  String processMetaAttribs(Attributes atts)
          Build a string representing any non-XTF attributes in the given attribute list.
private  void processNodeAttributes(Attributes atts)
          Process the attributes associated with an XML source text node.
 void processQueuedTexts()
          Process the list of files queued for indexing or reindexing.
private  int processText(IndexSource file, IndexRecord record, int recordNum)
          Add the specified XML source record to the active Lucene index.
 void queueText(IndexSource idxSrc)
          Queue a source text file for indexing.
 void queueText(IndexSource srcInfo, boolean deleteFirst)
          Queue a source text file for (re)indexing.
 boolean removeSingleDoc(File srcFile, String key)
          Remove a single document from the index.
private  void saveDocInfo(SectionInfo secInfo)
          Save document information associated with a collection of chunks.
 void startDocument()
          Process the start of a new XML document.
 void startElement(String uri, String localName, String qName, Attributes atts)
          Process the start of a new XML source text element/node/tag.
 void startPrefixMapping(String prefix, String uri)
           
private  int trimAccumText(boolean oneEndSpace)
          Utility method to trim trailing space characters from the end of the accumulated chunk text buffer.
private  boolean trueOrFalse(String value, boolean defaultResult)
          Utility function to check if a string contains the word true or false or the equivalent values yes or no.
 
Methods inherited from class DefaultHandler
error, fatalError, ignorableWhitespace, notationDecl, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

bufStartSize

private static final int bufStartSize
Initial size for various text accumulation buffers used by this class.

See Also:
Constant Field Values

chunkCount

private int chunkCount
The number of chunks of XML source text that have been processed. Used to assign a unique chunk number to each chunk for a document.


curNode

private int curNode
The current XML node we are currently reading source text from.


docWordCount

private int docWordCount
Number of words encountered so far for the current document. Used to to determine whether any real data has been encountered requiring a sub-document docInfo chunk (we don't want to write them for empty sections.)


subDocsWritten

private HashSet<String> subDocsWritten
Sub-documents that have been written out. Used to know whether a main docInfo chunk needs to be written.


nodeWordCount

private int nodeWordCount
Number of words encountered so far for the current XML node. Used to to track the current working position in a node. This value is recorded in chunkWordOffset whenever a new text chunk is started.


chunkStartNode

private int chunkStartNode
The XML node in which the current chunk starts (may be different from the current node being processed, since chunks may span nodes.)


chunkWordCount

private int chunkWordCount
Number of words accumulated so far for the current chunk. Used to track when a chunk is "full", and a new chunk needs to be started.


chunkWordOffset

private int chunkWordOffset
The nodeWordCount at which the current chunk begins. This value is stored with the chunk in the index so that the search engine knows where a chunk appears in the original source text.


nextChunkStartNode

private int nextChunkStartNode
Start node of next chunk. Used to hold the start node for the next overlapping chunk while the current chunk is being processed. Copied to chunkStartNode when processing for the current node is complete.


nextChunkWordCount

private int nextChunkWordCount
Number of words encountered so far in the next XML node. Used to track how "full" the next overlapping chunk is while the current chunk is bing processed. Copied to chunkWordCount when processing for the current node is complete.


nextChunkWordOffset

private int nextChunkWordOffset
The nodeWordCount at which the next chunk begins. Used to track the offset of the next overlapping chunk while the current chunk is being processed. Copied to chunkWordOffset when processing for the current node is complete.


nextChunkStartIdx

private int nextChunkStartIdx
The character index in the chunk text accumulation buffer where the next overlapping chunk beings. Used to track how many characters in the accumulation buffer need to be saved for the next chunk because of its overlap with the current chunk.


chunkWordSize

private int chunkWordSize
The size in words of a chunk.


chunkWordOvlp

private int chunkWordOvlp
The number of words of overlap between two adjacent chunks.


chunkWordOvlpStart

private int chunkWordOvlpStart
The start word offset at which the next overlapping chunk begins. Simply a precalculated convenience variable based on chunkWordSize and chunkWordOvlp.


stopSet

private Set stopSet
The set of stop words to remove while indexing. See IndexInfo.stopWords for details.


pluralMap

private WordMap pluralMap
The set of plural words to de-pluralize while indexing. See IndexInfo.pluralMapPath for details.


accentMap

private CharMap accentMap
The set of accented chars to remove diacritics from. See IndexInfo.accentMapPath for details.


forcedChunk

private boolean forcedChunk
Flag indicating that a new chunk needs to be created. Set to true when a node's section name changes or a proximitybreak attribute is encountered.


indexInfo

private IndexInfo indexInfo
A reference to the configuration information for the current index being updated. See the IndexInfo class description for more details.


ignoreFileTimes

private boolean ignoreFileTimes
Whether to ignore file modification times


xtfHomePath

private String xtfHomePath
The base directory from which to resolve relative paths (if any)


fileQueue

private LinkedList fileQueue
List of files to process. For an explanation of file queuing, see the processQueuedTexts() method.


curIdxSrc

private IndexSource curIdxSrc
The location of the XML source text file currently being indexed. For more information about this structure, see the IndexSource class.


curIdxRecord

private IndexRecord curIdxRecord
The current record being indexed within curIdxSrc


curPrettyKey

private String curPrettyKey
Display name of the current file


indexPath

private String indexPath
The base directory for the current Lucene database.


lazyBuilder

private LazyTreeBuilder lazyBuilder
Object used to construct a "lazy tree" representation of the XML source file being processed.

A "lazy tree" representation of an XML document is a copy of the source document optimized for quick access and low memory loading. This is accomplished by breaking up the original XML document into separately addressable subsections that can be read in individually as needed. For more detailed information about "lazy tree" organization, see the LazyTreeBuilder class.


lazyStore

private StructuredStore lazyStore
Storage for the "lazy tree"


lazyReceiver

private Receiver lazyReceiver
SAX Handler object for processing XML nodes into a "lazy tree" representation of the source docuement. For more details, see the lazyBuilder member.


lazyHandler

private ReceivingContentHandler lazyHandler
Wrapper for the lazyReceiver object that translates SAX events to Saxon's internal Receiver API. lazyReceiver and lazyBuilder members for more details.


charBuf

private char[] charBuf
Character buffer for accumulating partial text blocks (possibly) passed in to the characters() method from the SAX parser.


charBufPos

private int charBufPos
Current end of the charBuf buffer.


inMeta

private int inMeta
Flag indicating how deeply nested in a meta-data section the current text/tag being processed is.


metaField

private XMLTextProcessor.MetaField metaField
The current meta-field data being processed.


metaBuf

private StringBuffer metaBuf
A buffer for accumulating meta-text from the source XML file.


indexReader

private IndexReader indexReader
An Lucene index reader object, used in conjunction with the indexSearcher to check if the current document needs to be added to, updated in, or removed from the index.


indexSearcher

private IndexSearcher indexSearcher
An Lucene index searcher object, used in conjunction with the indexReader to check if the current document needs to be added to, updated in, or removed from the index.


indexWriter

private IndexWriter indexWriter
An Lucene index writer object, used to add or update documents to the index currently opened for writing.


spellWriter

private SpellWriter spellWriter
Queues words for spelling dictionary creator


tokenizedFields

private HashSet tokenizedFields
Keeps track of fields we already know are tokenized


MAX_DELETION_BATCH

private static final int MAX_DELETION_BATCH
Maximum number of document deletions to do in a single batch

See Also:
Constant Field Values

blurbedText

private StringBuffer blurbedText
A buffer containing the "blurbified" text to be stored in the index. For more about how text is "blurbified", see the blurbify() method.


accumText

private StringBuffer accumText
A buffer used to accumulate actual words from the source text, along with "virtual words" implied by any sectiontype and proximitybreak attributes encountered, as well as various special markers used to locate where in the XML source text the indexed text is stored.


compactedAccumText

private StringBuffer compactedAccumText
A version of the accumText member where individual "virtual words" have been compacted down into special offset markers. To learn more about "virtual words", see the insertVirtualWords() and compactVirtualWords() methods.


section

private SectionInfoStack section
Stack containing the nesting level of the current text being processed.

Since various section types can be nested in an XML source document, a stack needs to be maintained of the current nesting depth and order, so that previous section types can be restored when the end of the active section type is encountered. See the SectionInfoStack class for more about section nesting.


xtfUri

private static final String xtfUri
The namespace string used to identify attributes that must be processed by the XMLTextProcessor class.

Indexer specific attributes are usuall inserted into XML source text elements by way of an XSL pre-filter. For these pre-filter attributes to be recognized by the XMLTextProcessor, this string ("http://cdlib.org/xtf") must be set as the attributes uri. To learn more about pre-filter attributes, see the XMLTextProcessor class description.

See Also:
Constant Field Values
Constructor Detail

XMLTextProcessor

public XMLTextProcessor()
Method Detail

open

public void open(String homePath,
                 IndexInfo idxInfo,
                 boolean clean)
          throws IOException
Version for source-level backward compatibility since this API is used sometimes externally. Defaults 'ignoreFileTimes' to false.

Throws:
IOException

open

public void open(String homePath,
                 IndexInfo idxInfo,
                 boolean clean,
                 boolean ignoreFileTimes)
          throws IOException
Open a TextIndexer (Lucene) index for reading or writing.

The primary purpose of this method is to open the index identified by the cfgInfo for reading and searching. Index reading and searching operations are used to clean, cull, or optimize an index. Opening an index for writing is performed by the method openIdxForWriting() only when the index is being updated with new document information.

Parameters:
homePath - Path from which to resolve relative path names.
idxInfo - A config structure containing information about the index to open.
clean - true to truncate any existing index; false to add to it.

ignoreFileTimes - true to ignore file time checks (only applies during incremental indexing).
Throws:
IOException - Any I/O exceptions that occurred during the opening, creation, or truncation of the Lucene index.

Notes:
This method will create an index if it doesn't exist, or truncate an index that does exist if the clean flag is set in the cfgInfo structure.

This method makes a private internal reference indexInfo to the passed configuration structure for use by other methods in this class.


close

public void close()
           throws IOException
Close the Lucene index.

This method closes the current open Lucene index (if any.)

Throws:
IOException - Any I/O exceptions that occurred during the closing, of the Lucene index.

Notes:
This method closes any indexReader, indexWriter or indexSearcher objects open for the current Lucene index.


createIndex

private void createIndex(IndexInfo indexInfo)
                  throws IOException
Utility function to create a new Lucene index database for reading or searching.

This method is used by the createIndex() method to create a new or clean index for reading and searching.

Throws:
IOException - Any I/O exceptions that occurred during the deletion of a previous Lucene database or during the creation of the new index currently specified by the internal indexInfo structure.

Notes:
This method creates the Lucene database for the index, and then adds an "index info chunk" that identifies the chunk size and overlap used. This information is required by the search engine to correctly detect and hilight proximity search results.


copyDependentFile

private void copyDependentFile(String filePath,
                               String fieldName,
                               Document doc)
                        throws IOException
Throws:
IOException

checkAndQueueText

public void checkAndQueueText(IndexSource idxSrc)
                       throws ParserConfigurationException,
                              SAXException,
                              IOException
Check and conditionally queue a source text file for (re)indexing.

This method first checks if the given source file is already in the index. If not, it adds it to a queue of files to be (re)indexed.

Parameters:
idxSrc - The source to add to the queue of sources to be indexed/reindexed.

Throws:
ParserConfigurationException
SAXException
IOException
Notes:
For more about why source text files are queued, see the processQueuedTexts() method.


queueText

public void queueText(IndexSource idxSrc)
Queue a source text file for indexing. Old chunks with the same key will not be deleted first, so this method should only be used for new texts, or to append chunks for an existing text.

Parameters:
idxSrc - The data source to add to the queue of sources to be indexed/reindexed.

Notes:
For more about why source text files are queued, see the processQueuedTexts() method.


queueText

public void queueText(IndexSource srcInfo,
                      boolean deleteFirst)
Queue a source text file for (re)indexing.

Parameters:
srcInfo - The source XML text file to add to the queue of files to be indexed/reindexed.

Notes:
For more about why source text files are queued, see the processQueuedTexts() method.


getQueueSize

public int getQueueSize()
Find out how many texts have been queued up using queueText(IndexSource, boolean) but not yet processed by processQueuedTexts().


removeSingleDoc

public boolean removeSingleDoc(File srcFile,
                               String key)
                        throws ParserConfigurationException,
                               SAXException,
                               IOException
Remove a single document from the index.

Parameters:
srcFile - The original XML source file, used to calculate the location of the corresponding *.lazy file to delete. If null, this step is skipped.
key - The key associated with the document in the index.
Returns:
true if a document was found and removed, false if no match was found.
Throws:
ParserConfigurationException
SAXException
IOException

docExists

public boolean docExists(String key)
                  throws ParserConfigurationException,
                         SAXException,
                         IOException
Checks if a given document exists in the index.

Parameters:
key - The key associated with the document in the index.
Returns:
true if a document was found, false if not.
Throws:
ParserConfigurationException
SAXException
IOException

batchDelete

public void batchDelete()
                 throws IOException
If the first entry in the file queue requires deletion, we start up a batch delete up to MAX_DELETION_BATCH deletions. We batch these up because in Lucene, you can only delete with an IndexReader. It costs time to close our IndexWriter, open an IndexReader for the deletions, and then reopen the IndexWriter.

Throws:
IOException - Any I/O exceptions encountered when reading the source text file or writing to the Lucene index.


processQueuedTexts

public void processQueuedTexts()
                        throws IOException
Process the list of files queued for indexing or reindexing.

This method iterates through the list of queued source text files, (re)indexing the files as needed.

Throws:
IOException - Any I/O exceptions encountered when reading the source text file or writing to the Lucene index.

Notes:
Originally, the XMLTextProcessor opened the Lucene database, (re)indexed the source file, and then closed the database for each XML file encountered in the source tree. Unfortunately, opening and closing the Lucene database is a fairly time consuming operation, and doing so for each file made the time to index an entire source tree much higher than it had to be. So to minimize the open/close overhead, the XMLTextProcessor was changed to traverse the source tree first and collect all the XML filenames it found into a processing queue. Once the files were queued, the Lucene database could be opened, all the files in the queue could be (re)indexed, and the database could be closed. Doing so significantly reduced the time to index the entire source tree.

It should be noted that each file in the queue is identified by a "relocatable" path to the source tree directory where it was found, and that this relocatable path is stored in the Lucene database when the file is indexed. This relocatable path consists of the index name followed by the source tree sub-path at which the file is located. Storing this relocatable file path in the index allows the indexer and the search engine to correctly locate the source text, even if the source tree base directory has been renamed or moved. Correctly locating the original source text for chunks in an index is necessary when displaying search results, or to determine if source text needs to be reindexed due to changes, or removed from an index because it no longer exists. Ultimately, both the indexer and the query engine use the index configuration file to map the index name back into an absolute path when a source text needs to be accessed.


processText

private int processText(IndexSource file,
                        IndexRecord record,
                        int recordNum)
                 throws IOException
Add the specified XML source record to the active Lucene index. This method indexes the specified XML source text file, adding it to the Lucene database currently specified by the indexPath member.

Parameters:
file - The XML source text file to process.
record - Record within the XML file to process.
recordNum - Zero-based index of this record in the XML file.
Throws:
IOException - Any I/O errors encountered opening or reading the XML source text file or the Lucene database.

Notes:
To learn more about the actual mechanincs of how XML source files are indexed, see the XMLTextProcessor class description.

parseText

private int parseText()
Parse the XML source text file specified.

This method instantiates a SAX XML file parser and passes this class as the token handler. Doing so causes the startDocument(), startElement(), endElement(), endDocument(), and characters() methods in this class to be called. These methods in turn process the actual text in the XML source document, "blurbifying" the text, breaking it up into overlapping chunks, and adding it to the Lucene index.

Returns:
0 - XML source file successfully parsed and indexed.
-1 - One or more errors encountered processing XML source file.
Notes:
For more about "blurbifying" text, see the blurbify() method.

This function enables namespaces for XML tag attributes. Consquently, attributes such as sectiontype and proximitybreak are assumed to be prefixed by the namespace xtf.

If present in the indexInfo member, the XML file will be prefiltered with the specified XSL filter before XML parsing begins. This allows node attributes to be inserted that modify the proximity of various text sections as well as boost or deemphasize the relevance sections of text. For a description of attributes handled by this XML parser, see the XMLTextProcessor class description.


precacheXSLKeys

private void precacheXSLKeys()
                      throws Exception
To speed accesses in dynaXML, the lazy tree is capable of storing pre-cached indexes to support each xsl:key declaration. These take a while to build, however, so it's desirable to do this at index time rather than on-demand. This method reads a stylesheet that should contain the xsl:key declarations that will be used. It then generates each key and stores it in the lazy file.

Throws:
Exception - If anything goes awry.

startDocument

public void startDocument()
                   throws SAXException
Process the start of a new XML document. Called by the XML file parser at the beginning of a new document.

Specified by:
startDocument in interface ContentHandler
Overrides:
startDocument in class DefaultHandler
Throws:
SAXException - Any exceptions encountered by the lazyBuilder during start of document processing.

Notes:
This method simply calls the start of document handler for the "lazy tree" builder object lazyBuilder.

startElement

public void startElement(String uri,
                         String localName,
                         String qName,
                         Attributes atts)
                  throws SAXException
Process the start of a new XML source text element/node/tag.

Called by the XML file parser each time a new start tag is encountered.

Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class DefaultHandler
Parameters:
uri - Any namespace qualifier that applies to the current XML tag.
localName - The non-qualified name of the current XML tag.
qName - The qualified name of the current XML tag.
atts - Any attributes for the current tag. Note that only attributes that are in the namespace specified by the xtfUri member of this class are actually processed by this method.

Throws:
SAXException - Any exceptions generated by calls to "lazy tree" or Lucene database access methods.

Notes:
This method processes any text accumulated before the current start tag was encountered by calling the flushCharacters() method. It also calls the lazyHandler object to write the accumulated text to the "lazy tree" representation of the XML source file. Finally, it then resets the node tracking information to match the new node, including any boost or bump attributes set for the new node.


processMetaAttribs

private String processMetaAttribs(Attributes atts)
Build a string representing any non-XTF attributes in the given attribute list. This will be a series of name="value" pairs, separated by spaces. If there are no non-XTF attributes, empty string is returned.


incrementNode

private void incrementNode()
Increment the node tracking information.

Notes:
This method is called when a new node in a source XML document has been encountered. It increments the current node count, resets the number of words accumulated for the new node to zero, and if a partial chunk has been accumulated, inserts a node marker into the accumulated text buffer.

endElement

public void endElement(String uri,
                       String localName,
                       String qName)
                throws SAXException
Process the end of a new XML source text element/node/tag.

Called by the XML file parser each time an end-tag is encountered.

Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class DefaultHandler
Parameters:
uri - Any namespace qualifier that applies to the current XML tag.
localName - The non-qualified name of the current XML tag.
qName - The qualified name of the current XML tag.

Throws:
SAXException - Any exceptions generated by calls to "lazy tree" or Lucene database access methods.

Notes:
This method processes any text accumulated before the current end tag was encountered by calling the flushCharacters() method. It also calls the lazyHandler object to write the accumulated text to the "lazy tree" representation of the XML source file. Finally, it returns the node tracking information back to a state that match the parent node, including any boost or bump attributes previously set for that node.


processingInstruction

public void processingInstruction(String target,
                                  String data)
                           throws SAXException
Specified by:
processingInstruction in interface ContentHandler
Overrides:
processingInstruction in class DefaultHandler
Throws:
SAXException

startPrefixMapping

public void startPrefixMapping(String prefix,
                               String uri)
                        throws SAXException
Specified by:
startPrefixMapping in interface ContentHandler
Overrides:
startPrefixMapping in class DefaultHandler
Throws:
SAXException

endPrefixMapping

public void endPrefixMapping(String prefix)
                      throws SAXException
Specified by:
endPrefixMapping in interface ContentHandler
Overrides:
endPrefixMapping in class DefaultHandler
Throws:
SAXException

endDocument

public void endDocument()
                 throws SAXException
Perform any final document processing when the end of the XML source text has been reached.

Called by the XML file parser when the end of the source document is encountered.

Specified by:
endDocument in interface ContentHandler
Overrides:
endDocument in class DefaultHandler
Throws:
SAXException - Any exceptions generated during the final writing of the Lucene database or the "lazy tree" representation of the XML file.

Notes:
This method indexes any remaining accumulated text, adds any remaining text to the "lazy tree" representation of the the XML document, and writes out the document summary record (chunk) to the Lucene database.


characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws SAXException
Accumulate chunks of text encountered between element/node/tags.

Specified by:
characters in interface ContentHandler
Overrides:
characters in class DefaultHandler
Parameters:
ch - A block of characters from which to accumulate text.

start - The starting offset in ch of the characters to accumulate.

length - The number of characters to accumulate from ch.

Throws:
SAXException
Notes:
Depending on how the XML parser is implemented, a call to this function may or may not receive all the characters encountered between two tags in an XML file. However, for the XMLTextProcessor to correctly assemble overlapping chunks for the Lucene database, it needs to have all the characters between two tags available as a single chunk of text. Consequently this method simply accumulates text, and calls calls from the XML parser to startElement() and endElement() trigger the actual processing of accumulated text.

flushCharacters

public void flushCharacters()
                     throws SAXException
Process any accumulated source text, writing indexing completed chunks to the Lucene database as necessary.

Throws:
SAXException - Any exceptions encountered during the processing of the accumulated chunks, or writing them to the Lucene index.

Notes:
This method processes any accumulated text as follows:
1. First the accumulated text is "blurbified." See the blurbify() method for more information about what this entails.

2. Next, a chunk is assembled a word at a time from the accumulated text until the required chunk size (in words) is reached. The completed chunk is then added to the Lucene database.

3. Step two is repeated until no more complete chunks can be assembled from the accumulated text. (Any partial chunk text is saved until the next call to this method.)

forceNewChunk

private void forceNewChunk(SectionInfo secInfo)
Forces subsequent text to start at the beginning of a new chunk.

This method is used to ensure that source text marked with proximity breaks or new section types does not overlap with any previously accumulated source text.

Notes:
This method writes out any accumulated text and resets the chunk tracking information to the start of a new chunk.

trimAccumText

private int trimAccumText(boolean oneEndSpace)
Utility method to trim trailing space characters from the end of the accumulated chunk text buffer.

Parameters:
oneEndSpace - Flag indicating whether the accumulated chunk text buffer should be completely stripped of trailing whitespace, or if one ending space should be left.

Returns:
The final length of the accumulated chunk text buffer after the trailing spaces have been stripped.


blurbify

private void blurbify(StringBuffer text,
                      boolean trim)
Convert the given source text into a "blurb."

This method replaces line-feeds, tabs, and other whitespace characters with simple space characters to make the text more readable when presented to the user as the summary of a search query.

Parameters:
text - Upon entry, the text to be converted into a "blurb." Upon return, the resulting "blurbed" text.
trim - A flag indicating whether or not leading and trailing whitespace should be trimmed from the resulting "blurb" text.

Notes:
This function also compresses multiple space characters into a single space character, and removes any internal processing markers (i.e., node tracking or bump tracking markers.)

insertVirtualWords

private void insertVirtualWords(StringBuffer text)
Inserts "virtual words" into the specified text as needed.

Parameters:
text - The text into which virtual words should be inserted.

Notes:
Virtual words? What's that all about? Well...

The search engine is capable of performing proximity searches (up to the size of one text chunk.) Normally, the proximity of two words in a chunk is simply determined by the number of words between them. However, it might be nice to consider two words within a single sentence closer together than the same two words appearing in two adjacent sentences. Similarly, it might be nice if two words in a single section are considered closer together than the same two words in adjacent sections.

To accomodate this, the text indexer supports the concept of increasing the distance between words at the end of a section or sentence and the beginning of the next (refererred to as "section bump" and "sentence bump" respectively.) The question is, how can we make it seem like the distance between a word at the end of one sentence/section and the first word in the next sentence/section is larger than it really is? The answer is to insert "virtual words."

Virtual words are simply word-like markers inserted in the text that obey the same counting rules as real words, but do not actually appear in the final "blurb" seen by the user. For example, assume we have a sentence bump set to five, and the following text:
Luke Luck likes lakes.
Luke's duck likes lakes.
Luke Luck licks lakes.
Luke's duck licks lakes.
Duck takes licks in lakes Luck Luck likes.
And Luke Luck takes licks in lakes duck likes.
When virtual words are inserted into this text from Dr. Seuss' Fox in Sox, the resulting blurb text that is added to the index looks as follows:
Luke Luck likes lakes. vw vw vw vw vw
Luke's duck likes lakes. vw vw vw vw vw
Luke Luck licks lakes. vw vw vw vw vw
Luke's duck licks lakes. vw vw vw vw vw
Duck takes licks in lakes Luck Luck likes. vw vw vw vw vw
And Luke Luck takes licks in lakes duck likes.
Because of the virtual word insertion, the Luke at the beginning of the first sentence is considered to be two words away from the lakes at end, and the Luke at the beginning of the second sentence is considered to be five words away from the lakes at the end of the first sentence. The result is that in these sentences, the Lukes are considered closer to the lakes in their respective sentences than the ones in the adjacent sentences.

The actual marker used for virtual words is not really vw, since that combination of letters is likely to occur in regular text that discusses Volkswagens. The marker used is defined by the VIRTUAL_WORD member of the Constants class, and has been chosen to be unlikely to appear in any actual western text.

The added feature of using virtual words is that counting them like real words will result in text chunks with correct word counts, regardless of any bumps introduced. This is important if the code that displays search results is to correctly hilight the matched text.

While the use of virtual words as shown will yield the correct spacing of bumped text, it is not very space efficient, and would yield larger than necessary Lucene databases. To eliminate the unwanted overhead, virtual words are compacted just before the chunk is written to the Lucene database with the method compactVirtualWords().


isEndOfSentence

private boolean isEndOfSentence(int idx,
                                int len,
                                StringBuffer text)
Utility function to determine if the current character marks the end of a sentence.

Parameters:
idx - The character offset in the accumulated chunk text buffer to check.
len - The total length of the text in the text buffer passed.
text - The text buffer to check.
Returns:
true - The current character marks the end of a sentence.
false - The current character does not mark the end of a sentence.

Notes:
This method handles obvious end of sentence markers like ., ? and !, but also "artistic" punctuation like ???, !!!, and ?!?!. Currently, it considers ... to represent a long pause (extended comma), and does not treat it as the end of a sentence. It also tries to avoid mistaking periods as decimals and in acronyms (i.e., 61.7 and I.B.M.) as end of sentence markers.


isSentencePunctuationChar

private boolean isSentencePunctuationChar(char theChar)
Utility function to detect sentence punctuation characters.

Parameters:
theChar - The character to check.
Returns:
true - The specified character is a sentence punctuation character.
false - The specified character is not a sentence punctuation character.

Notes:
This function looks for punctuation that marks the end of a sentence (not clause markers, like ;, :, etc.) At this time, only .,, ?, and ! are considered end of sentence punctuation characters.

insertVirtualWords

private void insertVirtualWords(String vWord,
                                int count,
                                StringBuffer text,
                                int pos)
Utility function used by the main insertVirtualWords() method to insert a specified number of virtual word symbols.

Parameters:
vWord - The virtual word symbol to insert.
count - The number of virtual words to insert.
text - The text to insert the virtual words into.
pos - The character index in the text at which to insert the virtual words.

Notes:
For an in-depth explanation of virtual words, see the main insertVirtualWords() method.

indexText

private void indexText(SectionInfo secInfo)
Add the current accumulated chunk of text to the Lucene database for the active index.

Parameters:
secInfo - Info such as sectionType, wordBoost, etc.
Notes:
This method peforms the final step of adding a chunk of assembled text to the Lucene database specified by the indexInfo configuration member. This includes compacting virtual words via the compactVirtualWords() method, and recording the unique document identifier (key) for the chunk, the section type for the chunk, the word boost for the chunk, the XML node in which the chunk begins, the word offset of the chunk, and the "blurbified" text for the chunk.


isAllWhitespace

private static boolean isAllWhitespace(String str,
                                       int start,
                                       int end)
Utility function to check if a string or a portion of a string is entirely whitespace.

Parameters:
str - String to check for all whitespace.
start - First character in string to check.
end - One index past the last character to check.

Returns:
true - The specified range of the string is all whitespace.
false - The specified range of the string is not all whitespace.


compactVirtualWords

private void compactVirtualWords()
                          throws IOException
Compacts multiple adjacent virtual words into a special "virtual word count" token.

Throws:
IOException - Any exceptions generated by low level string operations.

Notes:
For an explanation of "virtual words", see the main insertVirtualWords() method.

A virtual word count consists of a special start marker, followed by the virtual word count, and an ending marker. Currently, the start and end markers are the same, allowing virtual word markers to be detected in the same way regardless of which direction a string is processed.

The actual virtual word count marker character is defined by the BUMP_MARKER member of the Constants class.


trueOrFalse

private boolean trueOrFalse(String value,
                            boolean defaultResult)
Utility function to check if a string contains the word true or false or the equivalent values yes or no.

Parameters:
value - The string to check the value of.
defaultResult - The boolean value to default to if the string doesn't contain true, false, yes or no.

Returns:
The equivalent boolean value specified by the string, or the default boolean value if the string doesn't contain true, false, yes or no.

Notes:
This function is primarily used to interpret values of on/off style attributes associated with prefiltered nodes in the XML source text.


processNodeAttributes

private void processNodeAttributes(Attributes atts)
Process the attributes associated with an XML source text node.

Sets internal flags and variables used during text processing based on any special attributes encountered in the given attribute list.

Parameters:
atts - The attribute list to process.

Notes:
This method is called to process a list of attributes associated with a node. These attributes are typically inserted into the XML source text by an XSL prefilter.

Since many of these attributes nest (i.e., values for child nodes temporarily override parent attributes), the current state of the attributes is maintained on a "section info" stack. See the section member for more details.

For a description of node attributes handled by this method, see the XMLTextProcessor class description.


saveDocInfo

private void saveDocInfo(SectionInfo secInfo)
Save document information associated with a collection of chunks.

This method saves a special document summary information chunk to the Lucene database that binds all the indexed text chunks for a document back to the original XML source text.

Notes:
The document summary chunk is the last chunk written to a Lucene database for a given XML source document. Its presence or absence then can be used to identify whether or a document was completely indexed or not. The absence of a document summary for any given text chunk implies that indexing was aborted before the document was completely indexed. This property of document summary chunks is used by the IdxTreeCleaner class to stript out any partially indexed documents.

The document summary includes the relative path to the original XML source text, the number of chunks indexed for the document, a unique key that associates this summary with all the indexed text chunks, the date the document was added to the index, and any meta-data associated with the document.


addToTokenizedFieldsFile

private void addToTokenizedFieldsFile(String field)
Adds a field to the on-disk list of tokenized fields for an index. Exceptions are handled internally and thrown as RuntimeException.


getIndexPath

private String getIndexPath()
                     throws IOException
Returns a normalized version of the base path of the Lucene database for an index.

Throws:
IOException - Any exceptions generated retrieving the path for a Lucene database.


checkFile

private int checkFile(IndexSource srcInfo)
               throws IOException
Check to see if the current XML source text file exists in the Lucene database, and if so, whether or not it needs to be updated.

Returns:
0 - Specified XML source document not found in the Lucene database.
1 - Specified XML source document found in the index, and the index is up-to-date.
2 - Specified XML source document is in the index, but the source text has changed since it was last indexed.

Throws:
IOException
Notes:
The XML source document checked by this function is specified by the curIdxSrc member.

An XML source document needs reindexing if its modification date differs from the modification date stored in the summary info chunk the last time it was indexed.


optimizeIndex

public void optimizeIndex()
                   throws IOException
Runs an optimization pass (which can be quite time-consuming) on the currently open index. Optimization speeds future query access to the index.

Throws:
IOException

openIdxForReading

private void openIdxForReading()
                        throws IOException
Open the active Lucene index database for reading (and deleting, an oddity in Lucene).

Throws:
IOException - Any exceptions generated during the creation of the Lucene database writer object.
Notes:
This method attempts to open the Lucene database specified by the indexPath member for reading and/or deleting. It is strange that you delete things from a Lucene index by using an IndexReader, but hey, whatever floats your boat man.


openIdxForWriting

private void openIdxForWriting()
                        throws IOException
Open the active Lucene index database for writing.

Throws:
IOException - Any exceptions generated during the creation of the Lucene database writer object.
Notes:
This method attempts to open the Lucene database specified by the indexPath member for writing.