[ You are here: XTF -> Programming -> textIndexer ]

textIndexer Programming


Introduction

The purpose of the textIndexer tool is to create or update a document search index whenever documents are updated, added to, or removed from the document library. If we would isolate and zoom in on the textIndexer portion of the XTF Overview Diagram shown in the Introduction, we'd see something like this:
textIndexerDataFlow.gif
What the diagram shows, is that the textIndexer uses a Document Selector stylesheet to select which files in the document library need to be indexed. For non-XML document files, the text to index is extracted and converted to XML. This base XML is then processed by the Document Pre-Filter stylesheet to add additional meta-data and/or sectioning information to the text. The resulting filtered XML is then passed on to the actual Text Indexer Engine, which breaks the text up into smaller overlapping chunks and then adds them to a Lucene based word index. The index can then be used by the crossQuery servlet to quickly locate files in the document library containing any text requested by the user. Optionally, the dynaXML servlet can also use the index to highlight any matches in the context of their original XML documents.

The textIndexer is capable of handling many documents, of various types, that are filtered in different ways. Here is a diagram showing how the decisions are made.
textIndexerDecisionTree.gif
The textIndexer.conf file, the Document Selector stylesheet, and the Pre-filter stylesheet together define how the textIndexer performs the document indexing process. A complete discussion of the textIndexer.conf file appears in the XTF Deployment Guide. The next two subsections discuss the inner workings of the Document Selector and Pre-Filter stylesheets.