[ You are here:
XTF ->
Deployment Guide -> Configuring textIndexer ]
Configuring the textIndexer Tool
The textIndexer tool is a command line tool that indexes the XML, PDF, and other documents in an XTF document library. The index it creates is used by the crossQuery servlet to search for documents based on user queries. The organization of the textIndexer tool can be illustrated as follows:
Running the textIndexer
Running the text indexer from the command line is accomplished by switching to the bin subdirectory just below the main XTF directory, and using the following command. Note: make sure you've set the
XTF_HOME environment variable correctly first (see above).
textIndexer {-config config-file-path}
{-incremental | -clean}
{-optimize | -nooptimize}
{-updatespell | -noupdatespell}
{-buildlazy | -nobuildlazy}
{-trace trace-level}
{-dir subdirectory}
-index index-name
Only the last argument, the
-index index-name argument, is required. The remaining arguments (shown in braces) are optional arguments, and in the case of paired options (e.g.
-optimize vs
-nooptimize) the first is the default. Note that the order of the arguments is important, since each occurrence of
-index index-name causes the specified index to be updated using the previously set argument values.
The arguments passed to the textIndexer have the following meanings:
-config config-file-path |
This argument is an optional argument that identifies an XML configuration file to use when indexing. If this argument is omitted, the textIndexer will use the textIndexer.conf file found in the XTF_HOME/conf directory. If used, this argument must be the first argument passed to the text indexer. For a complete description of the contents of the configuration file, see the following section. |
-incremental | -clean |
This is an optional argument that can be set to either -clean or -incremental. This flag tells the text indexer whether the document indexes should be rebuilt from scratch (-clean) or should be updated incrementally (-incremental). If this argument is not specified, the default behavior is to incrementally update the index. |
-optimize | -nooptimize |
This is an optional argument that specifies whether or not the indexer should optimize the indexes after they are built. Optimization improves query speed, but can take a very long time to complete if an index is large. If this argument is not specified, the default behavior is to optimize indexes after they are updated. |
-updatespell | -noupdatespell |
This is an optional argument that if not specified defaults to -updatespell. The -updatespell argument tells the textIndexer to update the spelling correction dictionary once all of the documents have been indexed. Specifying -noupdatespell avoids updating the spelling dictionary; these updates will be made the next time the indexer is run with -updatespell. Typically this option is rarely used, as spelling updates are quite fast. |
-buildlazy | -nobuildlazy |
This is an optional argument that if not specified defaults to -buildlazy. The -buildlazy argument tells the textIndexer to pre-build lazy tree files. Lazy tree files speed the retrieval of documents by the dynaXML servlet, and also allow resulting matches within documents to be highlighted in context. Pre-building lazy tree files at index time ensures the fastest possible retrieval time for any document in the library, but at the expense of longer indexing times and additional hard-disk space usage. To avoid pre-building lazy tree files for all documents in a library, pass the -nobuildlazy flag instead. Doing so will cause the dynaXML servlet to build lazy tree files for documents only when the documents are requested. While this approach conserves hard-disk space, it comes at the expense of longer processing times when dynaXML retrieves a document for which it has not yet built lazy tree information. WARNING: If the -nobuildlazy flag is used, the docSelctor.xsl stylesheet used by the textIndexer and the docReqParser.xsl stylesheet used by the dynaXML servlet must both specify the same pre-filter, or else chaos may ensue. |
-trace trace-level |
This is an optional argument that sets the level of output displayed by the textIndexer. The possible output levels are defined as follows: errors Only error messages are displayed. warnings Both error and warning messages are displayed. info Error, warning, and informational messages are displayed. debug Low level debug output is displayed in addition to error, warning, and informational messages. If this argument is not specified, the textIndexer defaults to displaying informational (info) level messages. |
-dir subdirectory |
This argument is an optional argument, and identifies a sub-directory of an index tree to re-index. The primary purpose of this argument is to allow only portions of an index to be updated to save time. When specified, the value of this argument should be path to the sub-directory relative to the base directory set for the index specified in the -index index-name argument that immediately follows. |
-index index-name |
This argument is a required argument that identifies the name of the index to be created/updated. The name must be one of the index names contained in the configuration file specified as the first argument. Note that this argument also starts the (re)indexing process for the given index using the values set by the arguments that precede it. |
A simple example of command line parameters for a text indexer run might look like this:
$ textIndexer -clean -index default
This example uses the default config file,
textIndexer.conf, located in the
XTF_HOME/conf subdirectory. It assumes that the config file contains an entry for an index called
default, and that the user wants the index to be rebuilt from scratch and then optimized (because of the presence of the
-clean argument, and the absence of the
-nooptimize argument.)
As mentioned above, the
-config config-file-path argument (if used) must be specified first. After that, all the arguments may be used as a set one or more times to update multiple indexes. For example:
$ textIndexer -config conf/myIndexes.conf -clean -index default -incremental -nooptimize -index abstracts
In this example, the
myTextIndexer.conf file defines two indexes called
default and
abstracts, with the
default index created from scratch and optimized at the end, while the
abstracts index is updated incrementally and not optimized.
The textIndexer Configuration File
As noted above, first command line argument passed to the
textIndexer tool is the name of a configuration file. This file contains the name of one or more indexes and their associated parameters. The format of the configuration files is as follows:
<?xml version="1.0" encoding="utf-8"?>
<textIndexer-config>
<index name="NameOfIndex"/>
<src path="LocationOfSourceDocuments"/>
<db path="LocationOfIndexDatabase"/>
<chunk size="TextChunkSize" overlap="TextChunkOverlap"/>
<stopwords path="LocationOfStopWordFile"/>
<docselector path="LocationOfDocSelector"/>
<pluralmap path="LocationOfPluralMapFile"/>
<accentmap path="LocationOfAccentMapFile"/>
<spellcheck createdict="YesOrNo"/>
</index>
…
<index name="NameOfIndexN"/>
…
</index>
</textIndexer-config>
As the listing shows, the textIndexer configuration file is an XML file containing a configuration type tag (the tag
<textIndexer-config>), followed by one or more index definition blocks. For each index block in the file, the following attributes are defined:
<index name="NameOfIndex"/> |
This element specifies a name for an index definition block. The name may be any combination of digits and/or letters and the underscore (_) character. Punctuation and other symbols are not permitted, and neither is the use of the space character. Also, the index name may only be used for one index block in any given configuration (if it appears more than once, the first occurrence is used, and the remaining ones are ignored.) This index name is the name passed on the command line to the textIndexer to identify which indexes need to be processed. |
<src path="LocationOfSourceDocuments"/> |
This element specifies the file-system path where the documents to be indexed are located. The path specified for an index must be a valid path for the operating system on which the tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative path is used, it is considered to be relative to the base XTF installation directory (i.e., XTF_HOME.) |
<db path="LocationOfIndexDatabase"/> |
This element specifies the file-system path where the database for the named index should be located. If the path does not exist or there are no databases files located there, the textIndexer will automatically create the necessary directories and database files. As with the source path, if a relative path is used, it is considered to be relative to the base XTF installation directory (i.e., XTF_HOME.) |
<chunk size="TextChunkSize" overlap="TextChunkOverlap"/> |
The textIndexer tool splits source documents into smaller chunks of text when adding the document to its index. Doing so makes proximity searches and the display of their resulting summary "blurbs" faster by limiting how much of the source document must be read into memory. See the discussion of chunk sizing below. |
<docselector path="LocationOfDocSelector"/> |
This tag identifies the location of an XSLT stylesheet that describes which source documents should be indexed, what pre-filter (if any) should be used on the selected documents, and what display formatter should be used to display the document when it is retrieved by the user. The path specified for the document selector must be a valid path and file name for the operating system on which the tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative path is used, it is considered to be relative to the base XTF installation directory (i.e., XTF_HOME.) (Note: The document selector must be specified for an index. If it is not, an error is generated, and no document indexing will be performed.) |
<stopwords path="StopWordFileLocation"/> or <stopwords list="StopWordList"/> |
The first variant of this attribute specifies the path to a file containing a list of words that the textIndexer should not add to the index. As with other paths, if a relative path is used, it is considered to be relative to the base XTF installation directory (i.e., XTF_HOME.) The second variant simply accepts the list of stop words as a string, with stop words separated from each other by spaces (i.e., "a an and the".) See the discussion of stop words below for more information. |
<pluralmap path="LocationOfPluralMapFile"/> |
This tag identifies the location of a file containing a list of plural words and their corresponding singular forms that should be considered the same by the textIndexer. This file is primarily used to improve search results by finding plural forms of words when only the singular form has been specified (or vice versa.) See the discussion of plural mapping below. |
<accentmap path="LocationOfAccentMapFile"/> |
This tag identifies the location of a file containing a list of accented characters and their corresponding un-accented forms that should be considered the same by the textIndexer. This file is primarily used to map accented characters to un-accented ones so that searches can still be performed on words with characters not implemented on localized keyboards. See the discussion of accent mapping below. |
<spellcheck createDict="YesOrNo"/> |
This tag tells the indexer whether to create/update the spelling correction dictionary when the main indexing phase is complete. If no dictionary is created, then spelling correction will not be available in crossQuery. For more information on how to produce spelling suggestions, please see the Programming Guide. Note that dictionary creation does take some time, though it is generally a small proportion of the total indexing time. |
Chunk sizing
The
<chunk> element's
size attribute defines (as a number of words) how large the chunk size should be. Note that if the selected chunk size is too large, then time will be wasted reading too much text from disk. Inversely, if the chunk size is too small, too much time will be spent assembling the summary "blurb" from its component chunks. As a guideline, a chunk size of about 200 words was found through experimentation to give good overall performance with "blurbs" of 80 characters (about 15 words) in length. (
Note: The chunk size specified for this attribute should be equal to or more than two words. If it is not, the textIndexer will force it to be two.)
The
overlap attribute controls the amount of overlap between chunks. By overlapping adjacent chunks of text in the index, proximity searches can still be performed, even though the source document doesn't appear in the index in one contiguous piece. This attribute defines (as a number of words) how large the chunk overlap should be. It should be mentioned that the selected chunk overlap effectively defines the maximum distance (in words) that can exist between two words in a document and still produce a search match. Consequently if you have a chunk overlap of five words, the maximum distance between two words that will result in a proximity match is five words. As a guideline, a chunk overlap of about 20 words for a chunk size of 200 words gives fairly good results. (
Note: The chunk overlap specified for this attribute should be equal to or less than half the chunk size. If it is not, the textIndexer will force it to be half.)
Stop words
Eliminating stop-words from an index improves search speed for an index. This is because the search doesn't need to sift through all the occurrences of the stop-words in the document library. Consequently, adding words like
a,
an,
the,
and, etc. to the stop-word list, which occur frequently in documents but are relatively uninteresting to search for, can speed up the search for more interesting words enormously.
The one caveat is that searches for any single stop-word by itself will yield
no matches, so it is important to pick stop-words that people aren't usually interested in finding. Note however that due to an internal process called
bi-gramming, stop words will still be found as part of larger phrases, like
of in
Man of War, or
the in
The Terminator.
Finally, if a stop-word file is used, the file itself may be stored in GZIP compressed format (and correspondingly labeled with a
.gz extension) if desired.
Plural mapping
Since the XTF was designed to be language independent, the textIndexer relies on the Plural Map file to supply all plural information used during indexing, including simple plurals (i.e, words ending in s or es.) Consequently, if this file is not specified or if a plural mapping is not found in the file, the singular and plural forms of a word
will not both be matched in a search.
The contents of the file should consist of the plural form of a word followed by the 'pipe' character ( | ) and then by the singular form of the word. Like this:
cacti|cactus
cactuses|cactus
houses|house
mice|mouse
There should be only one plural to singular mapping per line in the file (although multiple plural mappings may exist for a singular form), and no white-space should be present within a definition line.
The path specified for the document selector must be a valid path and file name for the operating system on which the tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative path is used, it is considered to be relative to the base XTF installation directory (i.e.,
XTF_HOME).
Finally, the file itself may be stored in GZIP compressed format (and correspondingly labeled with a
.gz extension) if desired.
Accent mapping
Since the XTF was designed to be language independent, the textIndexer relies on the Accent Map file to supply all accent mapping information used during indexing. Consequently, if this file is not specified or if an accent mapping is not found in the file, the accented and un-accented forms of a letter
will not both be matched in a search.
The contents of the file should consist of the
16-bit Unicode value for an accented form of a letter followed by the 'pipe' character ( | ) and then by the
16-bit Unicode value for the un-accented form of the letter. Optionally, you can also add a comment at the end of the line, where a comment starts with a semicolon. Like so:
00C4|0041 ; Latin Capital Letter A With Diaeresis|Latin Capital Letter A
00C5|0041 ; Latin Capital Letter A With Ring Above|Latin Capital Letter A
00C6|0041 0045 ; Latin Capital Letter AE|Latin Capital Letters A E
00C7|0043 ; Latin Capital Letter C With Cedilla|Latin Capital Letter C
00C8|0045 ; Latin Capital Letter E With Grave|Latin Capital Letter E
There should be only one accented to unaccented mapping per line in the file (although multiple accent mappings may exist for an un-accented form), and no white-space should be present within a definition line. Note that single to multiple letter mappings are supported (such as mapping
ö to
oe).
The path specified for the document selector must be a valid path and file name for the operating system on which the tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative path is used, it is considered to be relative to the base XTF installation directory (i.e.,
XTF_HOME.) Finally, the file itself may be stored in GZIP compressed format (and correspondingly labeled with a .gz extension) if desired.
Example configuration file
An example of a working
textIndexer configuration file is the sample textIndexer.conf file found in the conf subdirectory immediately below the base XTF install directory. At the time of this writing it looked like this:
<?xml version="1.0" encoding="utf-8"?>
<textIndexer-config>
<index name="default"/>
<src path="./data"/>
<db path="./index"/>
<chunk size="200" overlap="20"/>
<docselector path="./style/textIndexer/docSelector.xsl"/>
<stopwords list="a an and are as at be but by for if in into is it
no not of on or s such t that the their then
there these they this to was will with"/>
<pluralmap path="./conf/pluralFolding/pluralMap.txt.gz"/>
<accentmap path="./conf/accentFolding/accentMap.txt"/>
<spellcheck createDict="yes"/>
</index>
</textIndexer-config>
In this example, we see that the index name is "default", that the source and index paths are
../data and
../index respectively (relative to the directory in which the textIndexer configuration file resides), the chunk size and overlap are the recommended sizes of 200 and 20 words, a list of stop-words is active for the index, and the example document selector has been specified.
The Document Selector
As the previously mentioned, the TextIndexer uses an XSLT based
document selector stylesheet to identify which documents in the source tree should be indexed. Using an XSLT stylesheet allows arbitrary selection criteria to determine which source files to index, and which ones to ignore.
The
default document selector provided with the default XTF installation will index any document files whose names end in
.xml,
.htm /
.html,
.pdf,
.txt, or
.doc. You can use the provided stylesheet as the basis for defining your own document selection logic if you wish. However, writing and maintaining the XSLT document selector is beyond the scope of this deployment guide, and will not be discussed here. Please refer to the
Document Selector Programming section of the
XTF Programming Guide for more information about writing your own document selector.
The Document Pre-Filter
The TextIndexer can make use of an XSLT based document pre-filter stylesheet to restructure the source document just prior to indexing without changing the stored source document. Normally this feature is used to insert or mark meta-data for a document, or to insert additional sectioning attributes before indexing is performed.
Once a pre-filter has been defined, it is used on any source documents identified by the document selector stylesheet for an index. The default sample pre-filters are a good starting place. Here are some you may find interesting:
>
teiPreFilter.xsl
Because writing and maintaining the XSLT pre-filter is beyond the scope of this deployment guide, its format will not be discussed here. Please refer to the
XTF Programming Guide for more information about document pre-filters.
Adding/Deleting/Updating Source Documents
When new source documents are added to the document library or when existing documents are updated or deleted, the
textIndexer tool must be used to update the index. Fortunately, nothing special needs to be done to inform the textIndexer about which documents have been updated, added, or deleted. The program automatically detects the changes and takes the appropriate actions. It has enough smarts to avoid re-indexing documents that have not changed since the last time the index was generated.
Updating the index can occur any time after a document has been added, deleted or changed. You might wish to run the textIndexer tool manually or as part of a script whenever a change to the document library is made. Alternately, on a Unix system you could schedule a daily or weekly
cron job that made any accumulated changes to the document library and then re-indexed afterwards. Regardless of the method selected, it is best to re-index as soon as possible after the document library is updated so as to minimize the chance of a user performing a search with an index that doesn't match the actual document set.
Generating Index Summary Reports
There may be times when you want more information about an index created by the
textIndexer, and the
indexStats tool can be used to do this. The indexStats tool is a command-line tool that is invoked as follows:
- $ indexStats{-config config-file-path} -index index-name
The
-config config-file-path argument is an optional argument that identifies an index configuration file to use. If none is specified, the default
XTF_HOME/conf/textIndexer.conf file is used. The
-index index-name argument is required, and identifies the index for which to generate summary information.
For the sample document library provided, the command line to generate a summary report would be:
$ indexStats -index default
The output generated by the indexStats for the sample index would look something like this:
IndexStats v2.x
Index: "default"
Configuration Info...
Chunk Size = 200, Overlap = 20
Index Path = C:/UCOP/Tomcat/webapps/xtf/index/
Data Path = C:/UCOP/Tomcat/webapps/xtf/data/
Stop Words = a an and are as at be but by for if in into is it no not of
on or s such t that the their then there these they this to was will with
Statistics...
Total Documents (Records) = 15
Total Chunks = 16537
Avg Chunks Per Doc/Rec = 1,033.6
Total Number of Src Files = 15
Avg Docs/Recs Per File = 1
Size of Lucene Index = 25.31 Mb
Size of Source Files = 17.81 Mb
Size of Lazy Trees = 14.17 Mb
Total Index Size = 39.48 Mb (Lucene + Lazy)
Done.
In this report, the
Avg Chunks Per Doc/Rec entry reports the average number of chunks that a document was sub-divided into when it was added into the index. The
Size of Lazy Trees entry indicates how large the current Lazy Tree is for the index, and identifies how much space is currently being used to maintain information to speed access to documents retrieved from the index. To learn more about document chunks and lazy tree functionality, see the
XTF Under the Hood Guide.