org.cdlib.xtf.textIndexer
Class IdxTreeCleaner
Object
IdxTreeCleaner
public class IdxTreeCleaner
- extends Object
This class purges "incomplete" documents from a Lucene index.
A "complete" document consists of all the overlapping text chunks for the
document plus a special docInfo chunk that provides summary information
about the rest of the chunks in the document. Since the summary chunk is
the last chunk written for a document, any early termination of the indexer
(due to errors, or user abort) will leave text chunks in the database
without the summary chunk, which is called an "incomplete" document.
Since the search engine relies on the summary chunk to correctly search
overlapping text chunks, the absence of the summary chunk will cause
problems. Consequently, this class is used to purge text chunks from the
index that do not have a corresponding summary chunk.
To use this class, simply instantiate a copy, and call the
processDir()
method on a directory containing an index. Note that the directory passed
may also be a root directory with many index sub-directories if desired.
Method Summary |
void |
cleanIndex(File idxDirToClean)
Performs the actual work of removing incomplete documents from an index. |
void |
processDir(File dir)
Create an IdxTreeCleaner instance and call this method to
remove "incomplete" documents from an index directory or a root
directory containing multiple indices. |
Methods inherited from class Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
IdxTreeCleaner
public IdxTreeCleaner()
processDir
public void processDir(File dir)
throws Exception
- Create an
IdxTreeCleaner
instance and call this method to
remove "incomplete" documents from an index directory or a root
directory containing multiple indices.
- Parameters:
dir
- The index database directory clean. May be a directory
containing a single index, or the root directory of a
tree containing multiple indices.
- Throws:
Exception
- Passes back any exceptions generated by the
cleanIndex() function, which is called for
each index sub-directory found.
- Notes:
- This method also calls itself recursively to process
potential index sub-directories below the passed
directory.
For an explanation of "complete" and "incomplete" documents, see the
IdxTreeCleaner class description.
cleanIndex
public void cleanIndex(File idxDirToClean)
throws Exception
- Performs the actual work of removing incomplete documents from an index.
- Parameters:
idxDirToClean
- The index database directory clean. This directory
must contain a single Lucene index.
- Throws:
Exception
- Passes back any exceptions generated by Lucene
during the opening of, reading of, or writing to
the specified index.
For an explanation of "complete" and "incomplete" documents, see the
IdxTreeCleaner
class description.