org.cdlib.xtf.textIndexer
Class IdxTreeCleaner

Object
  extended by IdxTreeCleaner

public class IdxTreeCleaner
extends Object

This class purges "incomplete" documents from a Lucene index.

A "complete" document consists of all the overlapping text chunks for the document plus a special docInfo chunk that provides summary information about the rest of the chunks in the document. Since the summary chunk is the last chunk written for a document, any early termination of the indexer (due to errors, or user abort) will leave text chunks in the database without the summary chunk, which is called an "incomplete" document.

Since the search engine relies on the summary chunk to correctly search overlapping text chunks, the absence of the summary chunk will cause problems. Consequently, this class is used to purge text chunks from the index that do not have a corresponding summary chunk.

To use this class, simply instantiate a copy, and call the processDir() method on a directory containing an index. Note that the directory passed may also be a root directory with many index sub-directories if desired.


Constructor Summary
IdxTreeCleaner()
           
 
Method Summary
 void cleanIndex(File idxDirToClean)
          Performs the actual work of removing incomplete documents from an index.
 void processDir(File dir)
          Create an IdxTreeCleaner instance and call this method to remove "incomplete" documents from an index directory or a root directory containing multiple indices.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IdxTreeCleaner

public IdxTreeCleaner()
Method Detail

processDir

public void processDir(File dir)
                throws Exception
Create an IdxTreeCleaner instance and call this method to remove "incomplete" documents from an index directory or a root directory containing multiple indices.

Parameters:
dir - The index database directory clean. May be a directory containing a single index, or the root directory of a tree containing multiple indices.

Throws:
Exception - Passes back any exceptions generated by the cleanIndex() function, which is called for each index sub-directory found.

Notes:
This method also calls itself recursively to process potential index sub-directories below the passed directory.

For an explanation of "complete" and "incomplete" documents, see the IdxTreeCleaner class description.

cleanIndex

public void cleanIndex(File idxDirToClean)
                throws Exception
Performs the actual work of removing incomplete documents from an index.

Parameters:
idxDirToClean - The index database directory clean. This directory must contain a single Lucene index.

Throws:
Exception - Passes back any exceptions generated by Lucene during the opening of, reading of, or writing to the specified index.

For an explanation of "complete" and "incomplete" documents, see the IdxTreeCleaner class description.