org.cdlib.xtf.textIndexer
Class SrcTreeProcessor

Object
  extended by SrcTreeProcessor

public class SrcTreeProcessor
extends Object

This class is the main processing shell for files in the source text tree. It optimizes Lucene database access by opening the index once at the beginning, processing all the source files in the source tree (including skipping non-source XML files in the tree), and closing the database at the end.

Internally, this class uses the XMLTextProcessor class to actually split the source files up into chunks and add them to the Lucene index.


Nested Class Summary
(package private) static class SrcTreeProcessor.CacheEntry
          One entry in the docSelector cache
 
Field Summary
private  IndexerConfig cfgInfo
           
private  StringBuffer dirBuf
           
private  StringBuffer docBuf
           
private  HashMap docSelCache
           
private  File docSelCacheFile
           
private  String docSelDependencies
           
private  Templates docSelector
           
private  String docSelPath
           
private  int nScanned
           
private  StylesheetCache stylesheetCache
           
private  XMLTextProcessor textProcessor
           
 
Constructor Summary
SrcTreeProcessor()
          Default constructor.
 
Method Summary
 void close()
          Indexing close function.
 void loadCache(IndexerConfig cfgInfo)
          Load the previous docSelector cache.
 void open(IndexerConfig cfgInfo)
          Indexing open function.
 void processDir(File currFile, int level)
          Process a directory containing source XML files.
 boolean processFile(String dir, EasyNode parentEl)
          Process file.
 void saveCache()
          Save the docSelector cache.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

cfgInfo

private IndexerConfig cfgInfo

textProcessor

private XMLTextProcessor textProcessor

stylesheetCache

private StylesheetCache stylesheetCache

docSelector

private Templates docSelector

nScanned

private int nScanned

docBuf

private StringBuffer docBuf

dirBuf

private StringBuffer dirBuf

docSelPath

private String docSelPath

docSelDependencies

private String docSelDependencies

docSelCacheFile

private File docSelCacheFile

docSelCache

private HashMap docSelCache
Constructor Detail

SrcTreeProcessor

public SrcTreeProcessor()
Default constructor.

Instantiates the XMLTextProcessor used internally to process individual XML source files.

Method Detail

open

public void open(IndexerConfig cfgInfo)
          throws Exception
Indexing open function.

Calls the XMLTextProcessor open() method to actually create/open the Lucene index.

Parameters:
cfgInfo - The IndexerConfig that indentifies the Lucene index, source text tree, and other parameters required to perform indexing.

Throws:
IOException - Any I/O exceptions generated by the XMLTextProcessor open() method.

Exception

close

public void close()
           throws IOException
Indexing close function.

Calls the XMLTextProcessor processQueuedTexts() method to flush all the pending Lucene writes to disk. Then it calls the XMLTextProcessor close() method to actually close the Lucene index.

Throws:
IOException - Any I/O exceptions generated by the XMLTextProcessor close() method.


loadCache

public void loadCache(IndexerConfig cfgInfo)
Load the previous docSelector cache.

Parameters:
cfgInfo - The IndexerConfig that indentifies the Lucene index, source text tree, and other parameters required to perform indexing.


saveCache

public void saveCache()
Save the docSelector cache.


processDir

public void processDir(File currFile,
                       int level)
                throws Exception
Process a directory containing source XML files.

This method iterates through a source directory's contents indexing any valid files it finds, any processing any sub-directories.

Parameters:
currFile - The current file to be processed. This may be a source XML file, a file to be skipped, or a subdirectory.

level - The directory level we're currently processing (zero for top-level, 1 for its children, etc.)

Throws:
Exception - Any exceptions generated internally by the File class or the XMLTextProcessor class.


processFile

public boolean processFile(String dir,
                           EasyNode parentEl)
                    throws Exception
Process file.

This method processes a source file, including source text XML files, PDF files, etc.

Parameters:
parentEl - DOM element representing the current file to be processed. This may be a source XML file, PDF file, etc.

Returns:
true if the document was processed, false if it was skipped due to skipping rules.

Throws:
Exception - Any exceptions generated internally by the File class or the XMLTextProcessor class.