[ You are here: XTF -> Under the Hood -> Query Operations ]

Query Operations and What They Do

Table of Contents

Query Operations and What They Do
Interpreting User Queries
Text Query Operations
TERM and Wildcard Queries on Text
AND Query on Text
OR Query on Text
PHRASE Query on Text
NEAR Query on Text
NOT Clause on Text
Meta-data Query Operations
TERM and Wildcard Queries on Meta-data
RANGE Queries on Meta-data
AND Query on Meta-data
OR Query on Meta-data
PHRASE Query on Meta-data
NEAR Query on Meta-data
NOT Clause on Meta-data
Stop Words in Queries
This section gives details on how queries are interpreted, and specifies how the various query operators work. Note that meta-data and text queries are treated somewhat differently. This is due to the fact that meta-data fields are assumed to be short, while the full text of a document is assumed to be very large.

Interpreting User Queries

The job of translating queries from an input URL to a form that XTF can understand is undertaken by the Query Parser stylesheet. The query parser included in the base distribution is relatively simple: by default, it simply forms an AND query consisting of all the input terms. All non-word terms (such as the "++" in "C++") are ignored. Optionally, the operation can be changed to OR or NEAR. In addition, terms can be excluded if desired.

Here are some sample URLS, and how the default query parser interprets each one:

http://yourserver:8080/xtf/search?title=apartheid+mind
http://yourserver:8080/search?title=apartheid+mind;title-exclude=mandela
http://yourserver:8080/search?title=apartheid+mind;title-join=or
http://yourserver:8080/search?title=apartheid;text=mandela
http://yourserver:8080/search?text=%22Nelson+Mandela%22;subject=africa
http://yourserver:8080/search?text=Mandela+Apartheid;text-join=5

Of course much more complex query parsing is possible, since the Text Engine can handle arbitrarily complex queries consisting any combination of boolean query operators. Creating such a system is, however, left to the system designer setting up XTF, as it intermeshes closely with whatever HTML form or other mechanism is used to input the query, and is highly dependent on the skill level and needs of the final users of the system.


Text Query Operations

XTF implements a full complement of "boolean" operators used to form complex queries: AND, OR, NEAR, PHRASE, RANGE, and NOT, and supports wildcard search characters. This section covers the details of how these queries are interpreted within the very large documents XTF can handle.


TERM and Wildcard Queries on Text


A TERM query matches every occurrence of the specified term in the document. Upper-case vs. lower-case distinctions are ignored. Additionally, the term may contain special wildcard characters:

? The question-mark character matches terms with any character at that position. For example, lo?e would match love or lose, but not loe.
* The asterisk character matches terms with any number (including zero) characters at that position. For example, dog* would match any of the following terms: dog, dogs, doggie, doggerel, etc.

Depending on the particular wildcard, hundreds or even thousands of terms might match, so care should be taken when using these. To avoid allowing such queries to occupy the engine for long periods of time, XTF allows queries to specify a limit on the maximum number of terms to match (controlled by the workLimit attribute.) Queries that exceed the limit produce an error.

AND Query on Text


What does it mean to search for "man AND war" in the full text of all documents? Perhaps the most obvious answer would be to search for any document containing both words. But consider a document where "man" appeared in Chapter 1 and "war" appeared in Chapter 7. Would that be a document the user really wanted to find? Probably not. More likely they'd be interested in a document where "man" and "war" appear close together.

Thus XTF interprets AND queries on the full text as NEAR queries instead, with the slop factor set to the maximum for that index.

More formally, when used with terms, the AND query will match any section of text that contains all of the terms, in any order, as long as they are close together (that is, within the maximum proximity defined for the index, or 20 words in the default configuration.)

When used to group sub-queries together, AND will match text where all of the sub-queries match, in any order, as long as the matches are close together (i.e. within the maximum proximity for the index.)

OR Query on Text


When used to group terms, an OR query matches each occurrence of every term contained within it. If used to group sub-queries together, the OR query matches each occurrence of every sub-query.

PHRASE Query on Text


A PHRASE query generally contains two or more terms, and it matches any span of text where all the terms appear together, in order, with no other terms between them.

Less frequently, it can be used to group sub-queries. It matches a span of text where all of the sub-queries match, in order, without any intervening non-matching terms.

Note that a PHRASE query is equivalent to a NEAR query with a slop factor of zero.

NEAR Query on Text


Each NEAR query requires a "slop" factor. In rough terms, this factor can be thought of as limiting the amount of sloppiness when matching. A slop of zero indicates very tight control; in fact, a NEAR query with zero slop is equivalent to a PHRASE query. A large slop value indicates that terms may appear far apart, or out of order, or both. Note however that the slop value is silently constrained to the maximum proximity defined by the chunk overlap of an index.

For more details on how slop is computed, see the section on Proximity and Slop. For information on chunk overlap and how it relates to proximity searching, see the section on Chunking.

The NEAR query, when used with terms, matches a span of text containing all of the terms, where the match's slop is less than or equal to the slop factor specified for the query.

When used to group sub-queries, it matches a span of text where all of the sub-queries match, and the complete match's slop is less than or equal to the slop factor specified for the NEAR query.

NOT Clause on Text


A NOT clause may be specified as a sub-query of any boolean query (OR, AND, PHRASE, or NEAR). Any matches in the NOT clause will suppress outer matches within the maximum proximity of the index. This can be thought of as a "kill zone": each match within the NOT clause kills off nearby matches.

Meta-data Query Operations

The following query operations can be applied to any meta-data field. Queries are applied to meta-data and full text in like fashion, with a few exceptions: AND queries are not proximity-based in meta-data fields, NOT clauses on meta-data can eliminate whole documents, and a new operator, RANGE, is available on meta-data fields.

TERM and Wildcard Queries on Meta-data


A TERM query matches every occurrence of the specified term in the meta-data field. Upper-case vs. lower-case distinctions are ignored. Additionally, the term may contain special wildcard characters:

? The question-mark character matches terms with any character at that position. For example, lo?e would match love or lose, but not loe.
* The asterisk character matches terms with any number (including zero) characters at that position. For example, dog* would match any of the following terms:dog, dogs, doggie, doggerel, etc.

Depending on the particular wildcard, hundreds or even thousands of terms might match, so care should be taken when using these. To avoid allowing such queries to occupy the engine for long periods of time, XTF allows queries to specify a limit on the maximum number of terms to match (controlled by the workLimit attribute.) Queries that exceed the limit produce an error.

RANGE Queries on Meta-data


A RANGE query is similar to a wildcard term query in that it matches a (possibly large) number of terms. Lower and upper bounding terms are specified, and every term that appears in the index lexicographically between the two bounds is matched.

For example, if the lower bound were "1895" and the upper bound were "1900", a range query would match any of the terms 1895, 1896, 1897, 1898, 1899, and 1900. Optionally, the query can exclude the bounds, in which case it wouldn't match 1895 nor 1900.

As in the case of wildcard queries, care must be taken to avoid searching a huge number of terms. This can happen easily: in the case of the example above, if dates were encoded in the index in the form YYYY-MM-DD, then all the days from 1895 to 1900 would match... potentially 2,190 of them. And of course a range query from A to Z would match practically every term in the index. Again, each query can specify a limit on the maximum number of terms to match, to avoid bogging down the engine.

However, when searching numeric data (for example, file time and date stamps) the above wildcard approach simply is not sufficient. To handle this, XTF provides a special numeric range searching capability. This is specified as an attribute to the normal <range> query operator, but it tells XTF that the data is numeric, and in a rigid format (such as YYYY-MM-DD:HH-MM-SS; any rigid format is acceptable). When the first such query is made, XTF loads a table of all the data values and converts them to 64-bit integers. This table is then cached in memory, and range queries on that field are processed very quickly, avoiding any wildcard-like expansion.

AND Query on Meta-data


Unlike in full-text queries, an AND query on meta-data implies no proximity restrictions. When used with terms, it matches documents where every term appears somewhere in the field, in any order.

When used to group sub-queries, it matches documents where all of the sub-queries match (note that the sub-queries may be on several different fields.)

OR Query on Meta-data


When used to group terms, an OR query matches a document where any of the terms occurs within the meta-data field.

If used to group sub-queries together, the OR query matches documents that match by any of the sub-queries (note that the sub-queries may involve several different fields.)

PHRASE Query on Meta-data


A PHRASE query generally contains two or more terms, and it matches any document where the terms appear together in the field, in order, with no other terms between them.

Less frequently, it can be used to group sub-queries. It matches any document where all of the sub-queries match, in order, without any intervening non-matching terms.

Note that a PHRASE query is equivalent to a NEAR query with a slop factor of zero.

NEAR Query on Meta-data


Each NEAR query requires a "slop" factor. In rough terms, this factor can be thought of as limiting the amount of sloppiness when matching. A slop of zero indicates very tight control; in fact, a NEAR query with zero slop is equivalent to a PHRASE query. A large slop value indicates that terms (or sub-queries) may appear far apart, or out of order, or both. There is no upper bound on the slop factor. For more details on how slop is computed, see the section on Proximity and Slop. The NEAR query, when used with terms, matches any document where all of the terms appear in the field and their group slop is less than or equal to the slop factor specified for the query.

When used to group sub-queries, it matches any document where all of the sub-queries match, and the complete match's slop is less than or equal to the slop factor specified for the NEAR query.

NOT Clause on Meta-data


A NOT clause may be specified as a sub-query of any boolean query (OR, AND, PHRASE, or NEAR). Any documents matching the NEAR clause will be removed from the outer set of matches.

Stop Words in Queries

If a query contains one or more stop words, the query will be internally rewritten to work properly with the bi-gram system. Recall from the section on Stop Words and Bi-grams that using bi-grams allows XTF to support queries containing stop words while avoiding the usual severe impact on performance that they might have.

Here are some details on how stop words are handled in various query situations: