[ You are here: XTF -> Under the Hood -> Lazy Trees ]

Lazy Trees

Table of Contents

Lazy Trees
The Problem of Large Documents
What is a "Lazy Tree"?
Search Results in Context
Stylesheet Considerations
To accelerate processing of large documents and to enable highlighting search results within a document, XTF creates and utilizes lazy trees. This section describes what lazy trees are, how they speed up processing, and how stylesheets should be optimize to take advantage of lazy trees.

The Problem of Large Documents

The user usually browses a document page by page – for instance, reading a section, then jumping to an appendix, then another section, etc. Each time the servlet produces one of these “pages”, it must read and process the entire source document, even though only a small portion contributes to the output. This is the key observation underlying lazy trees.

When a document is processed through a set of stylesheets, the XSLT processor spends significant time reading and parsing the XML document, building indexes of parent-child relationships, creating identifier cross-references, and so on. Most of this work is wasted in generating the page for a single section, so what if we could eliminate the data for all the other sections?

Let’s take an example. Consider the source document below (simplified for discussion). If the user wants to see Chapter 1, only the indicated elements actually needed.
<front>
  <titlePage>The opening of the Apartheid Mind</titlePage>
  blah… blah… blah…
</front>
  <div1 id="ack">
  <head>Acknowledgements</head>
  <p>I would like to thank…</p>
  <!-- blah, blah, blah -->
  <!-- blah, blah, blah -->
</div1>
 
<?NOTE: The "ch1" div1 element is needed for this request... ?>
<div1 id="ch1">
  <head>Chapter 1</head>
  <p>It is now conventional wisdom that…</p>
  <p>See Hanlon (1981) for an overview… <ref target="bib12"/></p>
  <!-- blah, blah, blah -->
  <!-- blah, blah, blah -->
</div1>
 
<div1 id="ch2">
  <head>Chapter 2</head>
  <p>One of the more striking aspects of contemporary…</p>
  <!-- blah, blah, blah -->
  <!-- blah, blah, blah -->
</div1>
<div1 id="bib">
  <bibl id="bib1">  <author>Adams, Heribert</author> </bibl>
  <!-- blah, blah, blah -->
  <!-- blah, blah, blah -->
 
  <?NOTE: This "bib12" element is also needed... ?>
  <bibl id="bib12"> <author>Hanlon, Joseph</author> </bibl>
 
  <!-- blah, blah, blah -->
  <!-- blah, blah, blah -->
</div1>
Now that might not look like much of a savings, until you remember that each "blah, blah, blah" stands for a great deal of data.

The <div1 id="ch1"> element contains the actual data for Chapter 1. But the <bibl id="bib12"> element must also be included. Why? Notice that Chapter 1 contains a bibliographic reference; in order to properly generate a hyperlink to it, that reference element must be included.

In general, the only parts of the document needed are those the stylesheet actually references while generating a given page. This leads to the central idea of lazy trees: only load those parts of the document that are needed.

What is a "Lazy Tree"?

Unfortunately it's not easy to randomly load small pieces of a large XML document. In general, one can't tell where the end tag for an element is without scanning all the text until it's found. So XTF creates a binary version of each XML document, called a "lazy tree". The tree is stored in a file containing all the original contents of the document, plus an index telling XTF where each element starts and ends.

Processing begins by loading only the root element of the document. If the stylesheet references the children of that element (in an XPath expression for instance), then XTF knows right where to find them in the lazy tree's file. It loads only those children, but not their descendants. Processing then proceeds again until the stylesheet needs another node that hasn't been loaded yet. In this way, a typical page generation ends up loading only a small portion of the document. This strategy makes it possible to quickly generate pages for any size document (tested up to at least 60 gigabytes).

Lazy tree files are stored under the index directory, in a folder hierarchy parallel to that of the original data directories. They are typically a bit smaller than the source document, since the text within them is compressed.

Search Results in Context

The other main benefit of lazy trees is that they enable XTF to show search results in context. At first the link might seem unobvious. Remember that XTF reports, for each element, the total number of hits found in that element and all of its descendants. Without a lazy tree, the system would have to read in the entire XML document to determine the parent-child relationships. With a lazy tree, each search hit can be directly attributed to the correct XML elements by simply looking them up in the cross-index stored in the lazy tree's file.

For this reason and for speed of processing, dynaXML always uses the lazy tree if one has been created. If it doesn't exist yet, dynaXML will attempt to create it at run-time. This introduces a pause, which can be avoided by having the textIndexer can create lazy trees at index time (and in fact, this is the default behavior.) One caution: if pre-building of lazy trees by the textIndexer is disabled, be sure that docSelector.xsl and docReqParser.xsl both specify the same pre-filter to use. Otherwise, dynaXML will build a lazy tree that doesn't match the index, and strange random errors will occur when highlighting search hits in the document.

Stylesheet Considerations

Though the process of loading elements and other pieces of an XML file is generally invisible to the stylesheet programmer, there are certain best practices to obtain maximum processing speed.

First, avoid using XSL constructs that scan the entire input document. Generally, any XPath instruction beginning with "//" will scan every element of the document, defeating any gains of using a lazy tree. Also, using "descendant::" and "descendant-or-self::" in an XPath expression will generally cause all of the descendants to be loaded, again counteracting the benefit of lazy trees.

Instead, try to replace these constructs with the use of XSL keys, declared with the <xsl:key> element at the top-level of the stylesheet. You might ask, "Doesn't xsl:key need to scan every element of the document to build the key?" The answer is yes, but the result is stored in the lazy document, so that subsequent page views using the same key don't need to scan the document again. In other words, only the first page view causes XSL keys to be built.

Even the overhead of building XSL keys at runtime can be avoided by having the textIndexer do that work. This is accomplished simply by specifying the displayStyle attribute in the docSelector.xsl stylesheet used by the indexer. If specified, the stylesheet referenced by displayStyle will be scanned for <xsl:key> declarations, and all of the keys will be pre-built at index time. See the Document Selector Programming section in the XTF Programming Guide.
One final note: a good way to locate parts of the stylesheet that need to be optimized is to use dynaXML's stylesheet profiling configuration option. When enabled, a summary of how many XML document nodes were accessed by each line of the stylesheet will be printed. Find a particular request that runs unacceptably long, examine the profile, and take aim at the lines which access the most nodes.