Fundamental Concepts

What does XTF do?

Built on Java and XSLT 2.0, XTF is a flexible indexing, display and query tool that supports searching across collections of heterogeneous data and presents results in a highly configurable manner. XTF is an open-source project of the Publishing Group of the California Digital Library, and is deployed in academic settings worldwide.

The basic organization of XTFcan be illustrated as follows:

In this diagram, the flow of information is left to right. Document retrieval begins with a Web-based user search query,i.e. a user enters terms into a text box or clicks on a browse link. The crossQuery servlet checks the query against an index of available documents, and produces a list of matching documents for display in a web browser. Selecting a document from the search results page invokes the dynaXML servlet, which retrieves and formats the actual document for display in a web browser. The textIndexer tool shown at the bottom is used to update the document search index whenever documents in the library are added, removed, or updated.

This guide describes the steps that must be taken to deploy a working XTF system. These steps include installing and configuring the crossQuery and dynaXML servlets, the textIndexer tool, and the run-time environment that they depend on.

What is a “run-time environment”?

The “run-time environment” is the set of components required for a given software application.

The necessary components for the XTF run-time environment for Windows and Mac systems are included in the xtfWorkshop.zip package. Those working on Unix systems or installing just the current release of XTF without the tutorial pieces will have to ensure that the run-time environment is setup properly.

What are the necessary components for the XTF run-time environment?

Sun Microsystems Java 2 Platform Standard Edition (J2SE)

XTF is a Java application, so requires that that the Java platform is available.
A Java servlet container of your choice

A servlet container implements java servlets (see “What are servlets?” below) and serves up the pages in your website. The XTF Tutorial/Workshop package includes the open source Apache Tomcat Web Servlet.

Why/When do I run Tomcat?

Tomcat is what brings your XTF instance to life. In order to see any of the dynamic pages generated by XTF or any HTML pages associated with an XTF-based web application, Tomcat must be running. When Tomcat is off, the XTF-based application is also turned off.

You must also restart tomcat whenever the documents in your application require re-indexing. If you need to re-index, you must first shut tomcat down if it’s running, re-index, and then open tomcat once again before attempting to perform another search.

Tomcat startup:
In your xtfWorkshop folder:
1. double-click: runTomcat.bat
Tomcat shutdown:
In the Command Prompt where Tomcat was started up:
1. type Ctrl-C (while holding down the Ctrl key, hit the C key)
2. reply to message “Terminate batch job (Y/N)?” (enter: y )
URL for XTF search
http://localhost:8080/xtf/search

What are servlets?

A servlet is a Java application that runs server-side in a web-environment.

What servlets are part of XTF?

XTF has two primary servlets, one that handles incoming web queries and one that handles document display:

Web Queries: crossQuery servlet
Checks the query against an index of available documents, and produces a list of matching documents for display in a web browser.

Document Display: dynaXML servlet
Retrieves and formats the actual document for display in a web browser after a search.

Are there any other critical components for XTF?

textIndexer

The textIndexer is a command line utility that creates an index of the documents associated with an application and that also updates the document search index whenever documents in the library are added, removed, or updated.

Why/When do I run textIndexer?

Create an index

textIndexer is run when generating an index for the first time.

Incremental index

An incremental index is one that only processes files that have changed since the last index. Many changes will only require this type of indexing.

Text indexing in incremental mode:

In your xtfWorkshop folder:
1) double-click: cmdPrompt.bat
In the Command Prompt:
2) enter: textIndexer -index default

Clean index

A clean index removes the old index and starts over. You generally only want to do this if you have made stylesheet changes that affect the indexing of all content.

Text indexing in clean mode

In your xtfWorkshop folder:
1) double-click: cmdPrompt.bat
In the Command Prompt:
2) enter: textIndexer -clean -index default

Re-index

This term is used when you need to index your library again, and the instructions should tell you whether you need to run an incremental or clean re-index.

Other Essential Terms

Dublin Core

A widely used, simple metadata format consisting of 15 repeatable elements (Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights). In XTF we use it as the core set of index fields as so many projects have already gone to the effort to create DC, or use a standard that is easily translated to it.

EAD

Used predominantly for finding aids of archival materials, EAD is a “…data structure and data interchange standard that applies to inventories and registers. EAD is compatible with XML. EAD-formatted inventories can be opened and viewed by web page browsers” (www.mnhs.org/preserve/records/recordsguidelines/guidelinesglossary.html).

NLM

NLM is a widely accepted standard for digital journal content, both metadata and full-text. Though developed in the medical field, it is used more broadly. “The National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) created the Journal Archiving and Interchange Tag Suite with the intent of providing a common format in which publishers and archives can exchange journal content. The Suite provides a set of XML schema modules that define elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews” (http://dtd.nlm.nih.gov/).

TEI

TEI is a standard for translating print display into machine actionable markup for online presentation. “The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics” (http://www.tei-c.org/index.xml).

Stylesheets

In the XTF context, the term “stylesheet” is shorthand for the more formal term “Extensible Stylesheet Language Transformations,” also referred to by the acronym XSLT. XSLT is a language used to transform XML documents, and in a web application, such as one built with XTF, can be used to deliver dynamic web-pages. The stylesheets in XTF are where all of the customizations are made to turn a generic XTF instance into a application with a distinct look and feel appropriate to the content it is delivering.

XTF includes stylesheets to handle a variety of tasks, from query parsing (taking apart the pieces of a web query) to result formatting (determining how search results will be presented). The exercises in the tutorial will walk you through editing various stylesheets in order to achieve different effects and functionality.

XTF stylesheets are usually made up of sets that include a common, default, and one or more content-specific versions. The common stylesheet contains templates and functions that are used by all the others. The default is the fallback if the content type can’t be determined or isn’t indicated. This link leads to the default stylesheets that are part of XTF, only some of which you will be editing. The relationship between them will be explained within the context of each exercise.

Indexing

Indexing is the process of breaking down a document or documents into the list of words and their locations in a given document and/or across the collection of documents.