[ You are here:
XTF ->
Under the Hood -> Document Processing ]
Document Processing
The bulk of the
XTF system is concerned with searching XML documents for text in various ways, and displaying the results in several forms. A brute-force search of each word in each document, every time a user made a query, would be extremely inefficient, so the
textIndexer tool is used to compile all of the documents into what is called an "
inverted index". In essence, this index is similar to that in the back of a book: for each word, it points to all the locations that word appears in all of the documents that have been indexed.
The following sections discuss the details of how the
textIndexer dissects and digests documents, and cover a few basic concepts (such as
term,
proximity, and
stop word) necessary to understand the entire system.
But first, some useful definitions are needed. The
XTF system views each XML document as containing two major types of data: (1) Full text, and (2) Meta-data.
XTF Definition |
Full text, n. All text within an XML document, except actual element definitions and their attributes. |
XTF Definition: |
Meta-data, n. Short data fields within an XML document which describe the entire document, for example: title, author, subject, publication date and publisher, access rights, etc. |
This distinction mainly reflects the way the two types of data are used. Typically, meta-data fields are searched "database style". For instance, if one were looking for any book about Mark Twain published after 1912, then one would search the publication date and subject meta-data fields.
By contrast, the full text is typically searched "shotgun style", when one is looking for any use of a word or phrase in any book. For example, one might be interested in every reference to Mark Twain. Of course, the two types of queries can be combined in useful ways, for instance if one were interested in any mention of Mark Twain in books published after 1960.
Text Processing
The
textIndexer extracts the text from each XML document, locating each word and storing its position in the inverted index. These words are called "terms".
XTF Definition: |
term, n. A single indexable word. Most punctuation marks are not considered terms, but may occasionally appear within terms. For example, consider the string "O'Reilly". XTF considers this to be a single term, not two terms as in "O" and "Reilly". See Tokenizing for details of how terms are parsed. |
In the following sample XML fragment, all the terms are
underlined.
<head id="chapter2"> <p>"This is the day of man's greatest peril."</p> <footnote>Bekins, 1986</footnote> </head>
Because many documents may be book-length, this body of text could be extremely large, containing tens or hundreds of thousands of terms.
XTF imposes no limits on the length of the text or the number of terms within it.
Of course the text exists within the context of XML elements; these elements and their attributes are not considered part of the document text and are not indexed, with one exception. A special attribute (
xtf:sectionType) can be used to associate a type with block of text; the section type is recorded with all the terms in that block, and text queries can later be restricted to certain blocks based on their section types.
Meta-data Processing
As mentioned earlier, meta-data consists of small fields describing an XML document as a whole. Examples might be a book's publication date and publisher, author(s), subject keywords, etc.
XTF provides a very simple model of meta-data: each document may have any number of meta-data fields, each with a name and a textual value. A given field name may be repeated; each associated text value will be considered one unit of meta-data. Note that structured meta-data (i.e. sub-fields within fields) is not supported.
Each meta-data field is scanned for terms, and each term and its position are recorded in the inverted index. Typically these fields contain dozens or (rarely) hundreds of terms. Longer blocks, while supported, are discouraged as they are inefficient to process.
Tokenizing
As mentioned earlier, all text (whether part of a meta-data field or the full text of a document) is broken into
terms.
XTF Definition: |
tokenize, v., to break a string of text into discrete tokens, or "terms". |
XTF, as it is based on the Lucene search toolkit, uses an XTF specific derivative of the Lucene standard tokenizer. This tokenizer makes a fair effort at identifying terms in the source text regardless of language. For the most part, a single term consists of one or more of the following characters:
- Western alphabetic characters such as A, B, C, d, e, f, Ö, Ç,...
- Arabic numerals such as 1, 2, 3, ...
- Non-breaking symbols, such as &, @, and ' (apostrophe)
- The underscore character ( _ )
By contrast, the XTF system considers all traditional Chinese, Japanese, and Korean characters to be
logograms (complete words in and of themselves) and treats each separate character as a term. Some Western symbols are also treated as logograms by the XTF system and are treated as separate terms. These symbols include:
- Fractions symbols, such as ¼, ½, ¾, ...
- Monetary symbols, such as $, £, ¥, ...
- Mathematical symbols, such as +, >, =, ...
- Trademark or copyright symbology, such as ©, ®, and ™
Three other western symbols, the period (
. ), forward slash (
/ ), and dash (
- ) are treated differently depending on the context in which they are used. For example, if these characters appear in an acronym like
U.S.A or in a serial or model number like
v1.2-a they will be treated as part of the word rather than as separate punctuation.
The following table gives examples of character strings that the tokenizer recognizes as terms. The exact specification is somewhat complex; for details see
XTFTokenizer.jj from the XTF Distribution (or
on-line in the CVS repository).
Category |
Examples |
Basic word: sequence of digits and letters |
boat, Washington, 6, 1895, Java2 |
Words with internal apostrophes |
O'Reilly, you're, O'Reilly's |
Acronyms (internal dots are removed) |
U.S.A, I.B.M. |
Company names |
AT&T, Excite@Home |
Email addresses |
wer@all-one.com |
Computer host names |
texts.cdlib.org, main.server |
Floating-point numbers |
3.14159, .6, 7,23 |
Dates |
12/3/89, 3-Jan-02 |
Serial and model numbers |
12-6A, 127.0.1.1, 270_ES, FX/7 |
After tokenizing, all upper-case letters within tokens are converted to lower-case, which allows queries to be case-insensitive. Optional processing on tokens can remove distinctions of plural vs. singular, and can remove diacritic accent characters.
Proximity and "Slop"
As mentioned earlier, the inverted index not only maintains a list of the documents each term appears in, but also each term's position. This information is then used to support proximity-based queries, for example, searching for a pair of terms within 10 words of each other. In
XTF, proximity queries are viewed as a "sloppy" match, and thus are specified in terms of a "slop" value.
XTF Definition: |
slop, n. The sum, for each term in the query, of the distance between its position in the query and its position (relative to the start of the match) in the source text. Note that slop is similar to "edit distance" in Computer Science, but slop is easier to compute. |
For example, consider the query
man NEAR war compared to this sentence in a document: "The
man went to war." The potential match will be on the words "man went to war". In this case, "man" is at position 1 in both, and thus contributes nothing to the slop. However, "war" is at position 2 in the query but position 4 in the match; thus it contributes 2 to the slop, making the total slop 2. Thus for this to be considered a match, the proximity query would have to specify a slop of 2 or greater. This can be summarized in a table:
Term |
Position in Query |
Position in Text |
Difference |
man |
1 |
1 |
0 |
war |
2 |
4 |
2 |
|
|
total slop |
2 |
The definition of slop penalizes terms that are found out-of-order in the document text. So for example, consider a query for
dog NEAR house compared to the sentence "Looking at his
house, our dog despaired." Even though there's only one word between the terms, the slop is actually 3:
Term |
Position in Query |
Position in Text |
Difference |
dog |
1 |
3 |
2 |
house |
2 |
1 |
1 |
|
|
total slop |
3 |
It is easy to see that if the slop were zero, then all the words must appear in the source document in the same order with no intervening terms. This is considered an exact match, generally referred to in this document as a "phrase".
XTF Definition: |
phrase, n. A proximity query with slop equal to zero. |
When processing the source document, the
textIndexer increments the position for each term it encounters in the source document. However, at sentence boundaries it incremented by five (this default can be changed), which gives the effect of penalizing matches that cross sentence boundaries. Additionally, special index pre-filter tags can increase the position (see the
XTF Programming Guide for details.)
Chunking
The
textIndexer tool splits source documents into smaller chunks of text when adding the document to its index. Doing so makes proximity searches and calculating their resulting summary blurbs or "snippets" faster by limiting how much of the source document must be read into memory.
XTF Definition: |
snippet, n. A section of source text surrounding and including a match (or "hit"). |
The
Lucene search engine forms the foundation of
XTF's search capabilities. From Lucene's point of view, each of these text chunks is a single searchable unit, independent of all other chunks. You might wonder then, "How are proximity searches performed that span the boundary between two chunks?" For instance, consider the following two chunks representing the sentence "The quick brown fox jumped over the lazy dog."
Chunk 1: |
the |
quick |
brown |
fox |
jumped |
Chunk 2: |
|
|
|
|
|
over |
the |
lazy |
dog |
If one searched for the phrase "fox jumped over", the text engine would return zero results. This is clearly unacceptable.
The answer is that each chunk overlaps the previous one by a certain number of terms, called the "chunk overlap". This overlapping area allows a match in one chunk to extend into the next chunk by the overlapping number of words. Thus the overlap limits and is equal to the maximum proximity the system can handle.
Chunk 1: |
the |
quick |
brown |
fox |
jumped |
Chunk 2: |
|
|
brown |
fox |
jumped |
over |
the |
Chunk 3: |
|
|
|
|
jumped |
over |
the |
lazy |
dog |
The chunk size and chunk overlap are both configurable. If the selected chunk overlap is large relative to the chunk size, space and processing time will be wasted because many more chunks will be created. Conversely, making the overlap very small limits the effective maximum "slop" value for all proximity queries. Selecting these values is a trade-off between performance and maximum proximity.
The default values in the XTF distribution define a chunk size of 200 words and an overlap of 20 words. These seem to give an adequate maximum proximity, while minimizing processing time and disk space. One final note: chunking is not performed on meta-data fields, as they are assumed to be relatively small in size.
Stop Words and Bi-grams
Recall that
XTF builds an
inverted index using Lucene. For each term found in any document, the index stores a list of each occurrence of that term.
Now consider a term like "the". It occurs so commonly in English-language texts that the list of all occurrences becomes very large and thus takes a long time to process. Common words like "the" are called stop words; other examples are "a", "an", "it", "and", "in", "is".
A key observation is that these stop words, because they're so common, are uninteresting to search for. One's initial tendency might be then to simply ignore them, and this solution indeed speeds up searching.
XTF Definition: |
stop word, n. A very common word that is generally uninteresting to search for. |
For instance, a search for "man in war" would be interpreted as "man war". Unfortunately, this will turn up occurrences of "man of war" (a kind of jellyfish, and not what the user intended.)
So the second key observation is that stop words are useful
in conjunction with non-stop words. While "in" is a very common word, the combination "man-in" is much less common, and is thus much faster to search for. This leads us directly to the idea of bi-grams, which
XTF implements to get almost the speed of eliminating stop words, but still providing good query results.
XTF Definition: |
bi-gram, n. A single term in the index, composed of a stop word and a non-stop word fused together. |
Consider the sentence "A friend in need is a friend indeed." Scanning for stop- words and combining them with adjacent words, we get the following sequence of terms (regular terms are marked in
bold, while bi-grams are
underlined):
- Index: a-friend friend friend-in in-need need need-is is-a a-friend indeed
As you can see, the index is quite different when bi-grams are added into it. Consequently, a similar transformation must be performed when a query is made, essentially rewriting the query. For instance, a search for a
phrase "
friend in
need" is re-written to search for the phrase "
friend-in in-need".
More complex transformations are required for
NEAR queries. For instance, consider the proximity query: "
friend NEAR in NEAR need". The engine rewrites the query with and without stop words, so that if any exact matches are found, they will be ranked higher, but if not, any matches containing "friend" near "need" will be found. The resulting query looks like this:
- Query: (friend OR friend-in) NEAR (in-need OR need)
In summary, transforming stop words into bi-grams speeds up query processing, while retaining the ability to include stop words in a query.