[ You are here: XTF -> Programming -> dynaXML -> Document Request Parser ]

Document Request Parser Programming

As discussed above, the Document Request Parser is responsible for interpreting a URL based query. XTF allows great flexibility in terms of how URLs are constructed and interpreted, and this stylesheet is the key to that flexibility.
The main tasks of the Document Request Parser are:
  1. Use the URL parameters to determine which document to display, and figure out exactly where to find it. Typically it will come up with a full path to an XML file in the filesystem, but it can also come from an external HTTP source.
  2. Decide which Document Formatter stylesheet to use. Some systems will have only one Document Formatter, but most will have different formatters for different types of documents, or for different display modes.
  3. Specify what authentication is required to view the document, if any. This can be IP-based filtering, looking up a username/password in an LDAP database, or using an external login web page.
  4. If a text query is specified, the Document Request Parser is responsible for structuring that query in the same fashion as crossQuery's Query Request Parser. In fact, it often makes sense for one parser to call the other, or for them to share a common stylesheet that handles parsing duties.
Suppose a user, through crossQuery, searches for the word "apartheid". The top document hit will be the book The Opening of the Apartheid Mind (assuming you've indexed the sample data available with the XTF distribution.) If the user clicks on that title, dynaXML is then invoked with a URL like this:
http://yourserver:8080/xtf/view?docId=mm/ft958009mm/ft958009mm.xml;query=apartheid
Notice that the first part of the URL (everything before the ? symbol) invokes the dynaXML servlet, and second part of the URL (everything after the ? symbol) defines which document to display and a text query to run on it.
docId=mm/ft958009mm/ft958009mm.xml The identifier of the document to view.
query=apartheid The word to search for in the document.
The dynaXML servlet transforms the URL parameters into an XML document, suitable for processing by the Document Request Parser (which is of course written in XSLT.) The parser included with the XTF distribution is called docReqParser.xsl, and we'll discuss what it does below.

The input document will always have the form:
<parameters>
 
  <param name="ParamName" value="ParamValue">
      Token | Phrase
      Token | Phrase
           …
  </param></parameters>
where Token specifies a single word, and has the form:
where
value="Word" is the actual word or symbol extracted from the URL.
isWord="YesOrNo" identifies whether the token is a word or punctuation symbol.
and Phrase specifies an entire phrase extracted as a single string, with the form:
<phrase value="StringOfWords">
    Token
      …
    Token
</phrase>
where
value="StringOfWords" is the entire phrase extracted from the URL as a single string.
Token...Token is the original phrase broken down into individual token tags for each word or symbol in the phrase.
For the sample URL above, the XML passed to the Document Request Parser looks like this:
<parameters>
 
  <param name="docId" value="mm/ft958009mm/ft958009mm.xml">
     <token value="mm/ft958009mm/ft958009mm.xml" isWord="yes"/>
  </param>
 
  <param name="query" value="apartheid">
     <token value="apartheid" isWord="yes"/>
  </param>
 
</parameters>
Our example doesn't have a phrase in it or a multi-word parameter, but if you're curious how those would look, see the example in the section on Query Parser Programming above.

The sample docReqParser.xsl creates the following output based on the input <parameters> block.
<style path="style/dynaXML/docFormatter/default/docFormatter.xsl"/>
<source path="data/mm/ft958009mm/ft958009mm.xml"/>
<index configPath="{{conf/textIndexer.conf}}" name="default"/>
<query indexPath="index" termLimit="1000" workLimit="500000">
   <and field="text" maxSnippets="-1" maxContext="80">
      <term>apartheid</term>
   </and>
</query>
<auth access="allow" type="all"/>
Let's analyze this in small pieces.
<style path="style/dynaXML/docFormatter/default/docFormatter.xsl"/>
The <style> tag directs dynaXML to the Document Formatter stylesheet to use for this request. The path is relative to the XTF base directory.
<source path="data/mm/ft958009mm/ft958009mm.xml"/>
Next, the <source> tag identifies the location of the source document that should be displayed. Again the path is relative to the XTF base directory. The sample data is laid out in subdirectories based on deconstructing the document ID, but one could write a parser that used some other strategy for locating documents.
<index configPath="{{conf/textIndexer.conf}}" name="default"/>
For speed, dynaXML includes a facility called "Lazy Trees" which creates a binary representation of the input document on disk. The binary version is much faster to process, especially if the input document is large but the parts of it needed by the Document Formatter are small. In any case, dynaXML needs to know where to find the Lazy Trees created by the textIndexer, or where to create them if not found. The <index> tag tells it where to find the index configuration file, and the name of the index subset. If you're interested in learning more about Lazy Trees, see XTF Under the Hood.
<query indexPath="index" termLimit="1000" workLimit="500000">
   <and field="text" maxSnippets="all" maxContext="80">
      <term>apartheid</term>
   </and>
</query>
Next comes a query to run against the full text of the document. Of course the query is optional, but if included, its format is exactly the same as the output of crossQuery's Query Parser. In fact, the default docReqParser.xsl simply uses <xsl:import> to incorporate queryParser.xsl, and uses its templates to do the work of parsing and formatting the text query.

One curious thing to see here is maxSnippets="all". In this case, all is a special value, telling the Text Engine to gather all of the snippets/hits for the given document. If you only wanted the ten best scoring hits, you could specify maxSnippets="10" instead.
<auth access="allow" type="all"/>
The final tag produced by the Document Request Parser is the <auth> tag, which specifies authentication to perform. The simplest tag is shown above, and simply allows access to all users. Other authentication mechanisms are available; for more information please consult the section on User Authentication in the XTF Deployment Guide. Multiple <auth> tags will be processed in order until one succeeds or fails.

Customizing the Document Request Parser is beyond the scope of this document, but a good place to start is by incrementally modifying the sample docReqParser.xsl included with the XTF distribution.