[ You are here:
XTF ->
Tips and Tricks -> Miscellaneous ]
Miscellaneous Tips and Tricks
The following tips and tricks don't fit into the other major categories of this tips guide, so have been collected here.
Working with Large Collections
A frequently asked question regarding XTF is whether it can successfully scale to handle very large collections. The answer is
yes, as was shown by a research project undertaken by CDL and funded by a generous grant from the Andrew W. Mellon Foundation:
The Melvyl Recommender Project. In this project, XTF was shown to work with about 10 million meta-data records, plus over 18,000 full text objects. The total size of the textual part of the input objects was almost 14 gigabytes. So yes, XTF can handle very large collections.
There are some hurdles to overcome when indexing and searching such large collections, which are covered briefly below.
- Give the indexer plenty of RAM; for the large index above, we found that the default maximum RAM limit of 1 gigabyte was insufficient, and we raised it to 4 gigabytes (although 2 would have probably been sufficient.) Of course, make sure that the machine you use has plenty of RAM, at least double that allocated for the indexer. Luckily, RAM is getting cheaper and cheaper nowadays. Of course to handle this amount of RAM, be sure the machine has a 64-bit processor, operating system, and Java runtime.
To change the amount of RAM allocated to the textIndexer, edit the bin/textIndexer script, changing the parameter on the last line from -Xmx1000m to something like -Xmx4000m.
- Likewise, you'll need plenty of disk space. The final index may end up twice the size of your data set, and during the index process the textIndexer needs plenty of temporary working space. So plan to have about six times the size of your data set in free hard drive space before starting the indexer.
- It took almost two days of CPU time to index the large collection above. We have successfully experimented indexing subsets of the collection on separate CPUs, and then merging the final index together. If you are interested trying this, check out the bin/indexMerge script.
- When running crossQuery on the resulting index, make sure that the servlet container also has access to plenty of RAM. You can do this by editing the Tomcat or Resin configuration files or startup script (this varies by servlet container.)
- Be prepared for long initial wait times while crossQuery loads tables from the index. This happens when the query specifies any record sorting besides relevance, and also occurs when faceted browsing is used. Generally, when we restart the servlet container we run through a "warm up" routine that accesses each of these types of pages, because the tables are cached in memory and subsequent access is very fast.
Any additional observations you have with large collections, or additional tips or tricks, would be welcome additions to this section.
AJAX-style XTF Programming
A new generation of interfaces is taking the web by storm, and the technology behind them is AJAX — Asynchronous Java And XML. There's a good reason that AJAX is so popular: it makes user interfaces much more responsive and therefore useful. One might ask, can XTF be used in conjunction with AJAX?
Of course it can. In the default stylesheets that come with XTF, we provide one basic example: fetching similar documents. When the user clicks on this link in a set of crossQuery search results, a small piece of JavaScript asynchronously sends a new query to the servlet, fetching documents similar to the one of interest to the user. When the list comes back, it is inserted directly into the original search results page. If you're interested in the details of this interaction, we encourage you to explore the source code, and if you have questions, post them to the XTF Users discussion list.
One can think of many more ways to combine AJAX and XTF, including book bag functionality, tagging, and improvements to faceted browse. These are beyond the scope of this document, but we hope that more XTF implementors will explore this area.