[ You are here: XTF -> Programming -> textIndexer -> Document Selector ]

Document Selector Programming

This section describes how to program an XTF Document Selector stylesheet. If you want to skip the tutorial, you can check out the Reference section or the default docSelector.xsl code.

The primary purpose of the textIndexer Document Selector is to select which files in the document library are to be indexed. Since the Document Selector is an XSLT stylesheet, its input is in fact an XML fragment that identifies a single directory in the document library and the files that it contains. The Document Selector stylesheet is invoked one time for each subdirectory encountered in the document library, and the input it receives looks as follows:
<directory dirPath="DirectoryPath">
 
    <file fileName="FileName1"/>
    <file fileName="FileName2"/><file fileName="FileNameN"/>
 
</directory>
The <directory> tag identifies a single directory in the document library, and the DirectoryPath attribute specifies its absolute file system path. Within the <directory> tag, each of the <file/> entries identifies one of files found in the directory. Note that FileName1 through FileNameN do not contain any path information, since the absolute path that applies to all the file tags is already identified by DirectoryPath.

It is the responsibility of the Document Selector XSLT code to output an XML fragment that identifies which of the files in the directory should be indexed. This output XML fragment should take the following form:
<indexFiles>
 
    <indexFile  fileName     = "FileName"
               {format       = "FileFormatID"}
               {preFilter    = "PreFilterPath"}
               {displayStyle = "DocumentFormatterPath"}/>
                    …
 
</indexFiles>
Note that the output XML consists of a single <indexFiles> container tag and one <indexFile/> tag for each document file that needs to be indexed. Within each of the <indexFile/> tags, the following attributes are defined:
fileName This attribute identifies the name of a file to be indexed, and should be one of the file names received in the input XML fragment.
format This is an optional attribute that defines the format of the file to be indexed. At this time, XML, PDF, HTML, Word, and Plain Text are supported by the textIndexer tool, and this attribute should be set to the strings XML, PDF, HTML, MSWord, or Text respectively, depending on the native format of the file. If this attribute is not specified, the textIndexer will try to infer the file type based on the extension for the file.
preFilter This is an optional attribute that defines the Pre-Filter stylesheet that the textIndexer should use on this document file. If not specified, the text for this file will not be filtered before indexing. See the textIndexer Pre-Filter Programming section for more details about document pre-filtering.
displayStyle This is an optional attribute that defines the Document Formatter stylesheet associated with the given file. If specified, the textIndexer will create a special cache that is used by the dynaXML servlet to display selected documents more quickly. If not specified, the cache for the current file is not created. For more details, see the discussion of Lazy Document Handling in the XTF Under the Hood guide.
Using these XML input and output specifications shown, we can build up a document selector that handles all the types of files to index. We're going to start simple and work our way up. A very simple document selector might look something like this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
 
   <xsl:template match="directory">
      <indexFiles>
         <xsl:apply-templates/>
      </indexFiles>
   </xsl:template>
 
   <xsl:template match="file">
      <xsl:choose>
         <xsl:when test="ends-with(@fileName, '.pdf')">
            <indexFile fileName="{@fileName}" type="PDF"/>
         </xsl:when>
      </xsl:choose>
   </xsl:template>
 
</xsl:stylesheet>
In this simple Document Selector example, the first line establishes the xsl namespace used in the rest of the stylesheet. Next, the <xsl:template match="directory"> tag looks for the <directory> block in the input XML, and writes out a corresponding <indexFiles> block to the output XML. Also the <xsl:template match="file"> template is applied to any tags found within the <directory> block.

The <xsl:template match="file"> block is the code that is actually responsible for selecting the files to be indexed. In this example, only files that end in .pdf are passed on for indexing, and are assigned the format PDF. No Pre-Filter or Document Formatter stylesheets are defined, and so the textIndexer will not pre-filter or pre-cache display information for PDF files.

Selecting other file types for indexing is as simple as adding more <xsl:when> clauses to the <xsl:choose> block, like this:
<xsl:template match="file">
     <xsl:choose>
         <!-- XML files -->
         <xsl:when test="ends-with(@fileName, '.xml')">
            <!-- More detailed work here, to determine if it's TEI, EAD, NLM, etc. -->
         </xsl:when>
 
         <!-- PDF files -->
         <xsl:when test="ends-with(@fileName, '.pdf')">
            <indexFile fileName="{@fileName}"
               type="PDF"
               preFilter="style/textIndexer/default/defaultPreFilter.xsl"/>
         </xsl:when>
 
         <!-- Plain text files -->
         <xsl:when test="ends-with(@fileName, '.txt')">
            <indexFile fileName="{@fileName}"
               type="text"
               preFilter="style/textIndexer/default/defaultPreFilter.xsl"/>
         </xsl:when>
  </xsl:template>
 
</xsl:stylesheet>
This revies <xsl:choose> block looks for XML, PDF, and Text files. Note that the <indexFile> tags also define a Pre-Filter stylesheet for each type.

While this simple Document Selector example works, its file selection rules are limited only to checking for certain file extensions. Clearly, all the power of XSLT could be used to construct more complicated selection criteria for files, including ignoring various directories, pulling in meta-data from files or URLs, and so on.

Now you're equipped to understand the default Document Selector provided by XTF. You can check out the default docSelector.xsl at SourceForge, or you can edit it in your own directory: style/textIndexer/docSelector.xsl.

Next, we'll learn how to program the Pre-Filter stylesheet.