[ You are here:
XTF ->
Programming ->
crossQuery -> Query Parser ]
Query Parser Programming
As previously noted, the
Query Parser is responsible for translating a URL based query into an XML query that the search engine can actually understand. Consider the following pseudo-query:
- Find all occurrences of "man" and "war" but not "Man of War".
What we're trying to find here is any document containing both "man" and "war", but not "Man of War" (a kind of jellyfish). In theory, the web page into which the user types the search query could take a simplified English-like representation of the query with the form:
- find man and war but not "Man of War"
but writing an XSLT parser to process it would be a complicated endeavor. To simplify things, we'll assume that the web page has a field that accepts all the words or phrases to find, and another field that accepts all the words or phrases to exclude. For our specific example, the user would type
into the
text to find field, and
into the
text to exclude field. Note that each word or phrase is separated from the others by a space, and that an exact phrase (like
Man of War) is enclosed in double-quotes to differentiate it from a list of individual words. The resulting query URL that would be passed to the
crossQuery servlet would then look something like this:
- http://yourserver:8080/xtf/search?text=man+war;text-exclude=%22Man+of+War%22
Notice that the first part of the URL (everything before the
? symbol) invokes the
crossQuery servlet, and second part of the URL (everything after the
? symbol) defines the search to be performed. Also notice that search to be performed is represented by two parameters:
text=man+war |
The list of words search for. |
text-exclude=%22Man+of+War%22 |
The phrase to exclude from the search. |
These two parameters carry the "find" and "exclude" semantics represented by the two fields of our imagined query Web-Page. As is typical for URLs, the spaces in each parameter have been replaced with plus signs (
+), and the double-quote characters have been replaced with their ANSI equivalent hexadecimal values.
Since the
Query Parser is written in XSLT, it actually expects an XML document as its input, and not a URL like the one presented above. Consequently, the
crossQuery servlet preprocesses the query URL and turns it into an XML input fragment for the
Query Parser to translate. In general, the input XML passed to the
Query Parser looks like this:
<parameters>
<param name="ParamName" value="ParamValue">
Token | Phrase
Token | Phrase
…
</param>
…
</parameters>
where
Token specifies a single word, and has the form:
- <token value="Word" isWord="YesOrNo"/>
where
value="Word" |
is the actual word or symbol extracted from the URL. |
isWord="YesOrNo" |
identifies whether the token is a word or punctuation symbol. |
and
Phrase specifies an entire phrase extracted as a single string, with the form:
<phrase value="StringOfWords">
Token
…
Token
</phrase>
where
value="StringOfWords" |
is the entire phrase extracted from the URL as a single string. |
Token...Token |
is the original phrase broken down into individual token tags for each word or symbol in the phrase. |
For our particular example URL, the input XML fragment passed to the
Query Parser would be:
<parameters>
<param name="text" value="man war">
<token value="man" isWord="yes"/>
<token value="war" isWord="yes"/>
</param>
<param name="text-exclude" value=""Man of War"">
<phrase value="Man of War"/>
<token value="Man" isWord="yes"/>
<token value="of" isWord="yes"/>
<token value="War" isWord="yes"/>
</phrase>
</param>
</parameters>
As mentioned before, it is the job of the
Query Parser XSLT code to translate the above input into an XML query that the
crossQuery search engine understands. The general format of an XML query passed to the search engine has the form:
<query indexPath="LocationOfIndexDBToUse" style="ResultFormatterLocation">
QueryElements
</query>
The
<query> tag is always the outermost tag in a query, containing all the other tags that define the query to be performed. Through its
indexPath="LocationOfIndexDBToUse" attribute, this tag identifies the Lucene index to use when performing the query. Through its
style="ResultFormatterLocation" attribute, it also defines the path to the
Result Formatter XSLT stylesheet that will format the query results. For both attributes, the path specified is relative to the base install path for the XTF system (i.e.,
XTF_HOME.)
Within the
<query> tag, the
QueryElements identify the type of query to perform. The simplest query that can be performed is a query for a single word, or
term. It has the form:
<term field="FieldToSearch">
WordToFind
</term>
This tag indicates that we wish to find a single word in the field identified by
field="FieldToSearch" . If we wish to search the main text of a document,
FieldToSearch should be set to text. If we wish to search meta data for a document, we would use a meta-data field name instead, like creator or subject. Once the search field been identified, the single word we actually wish to find should substituted for
WordToFind.
The next simplest query to perform is a
phrase query. It has the form:
<phrase field="FieldToSearch">
Term
Term
…
</phrase>
This query contains one or more term tags that together identify a phrase to find, rather than a single word. For example, the
"Man of War" phrase in our sample query above would be constructed using the
<phrase> and
<term> tags as follows:
<phrase field="text">
<term> Man </term>
<term> of </term>
<term> War </term>
</phrase>
It should be noted from this example the
field="FieldToSearch" attribute doesn't need to be specified in each of the
<term> tags, since the enclosing
<phrase> tag has already identified the field to be searched.
The one remaining query element that we would need to construct a complete query for our man and war not "Man of War" example is the
query clause. It has the form:
<ClauseType field="FieldToSearch">
Term | Clause
Term | Clause
…
</ClauseType>
Where valid
ClauseType values are
and,
or,
not,
near,
orNear,
phrase, and
exact. Each of these clause types do pretty much what you would expect:
- The <and> clause requires all its sub-terms/phrases/clauses to be present for a match to occur.
- The <or> clause requires any one of its sub-terms/phrases/clauses to be present for a match to occur.
- The <not> clause requires that none of its sub-terms/phrases/clauses are present for a match to occur.
- The <near> clause requires all its sub-terms/phrases/clauses to be near each other for a match to occur. The definition of near is fairly complicated, and will not be discussed here. See the Query Parser tag reference for an in-depth description of the <near> clause.
- The <orNear> clause is similar to <near> except that it can also match if some of the clauses are missing.
- The <exact> clause operates just like the phrase clause, except that it matches the entire contents of a field only, whereas a phrase clause can match anywhere within the field.
Now, for the sample query we discussed above:
- man and war not "Man of War"
the complete query would look as follows:
<query indexPath="./index" style="./style/crossQuery/resultFormatter.xsl">
<and field="text">
<term> man </term>
<term> war </term>
<not>
<phrase>
<term> Man </term>
<term> of </term>
<term> War </term>
</phrase>
</not>
</and>
</query>
At this point, the trick is to write a
queryParser.xsl stylesheet that converts the given input XML fragment into the output XML query shown above. Unfortunately, writing XSLT is well beyond the scope of this document and will not be discussed here. The good news however is that the sample
queryParser.xsl included with the XTF installation performs the necessary query conversion illustrated in this example, and is a good starting point for creating your own custom Query Parser.
It should also be noted that the various query tags illustrated here have been shown in their simplest form for the sake of clarity. For example the Query tag has additional attributes that allow query matches to be returned a few at a time. This allows the Result Formatter to display a short page of search results rather than a single page containing every result in the repository. Another thing to note is that Phrase tags are in fact recursive, and can contain sub-phrases or clauses (not just Term tags.) For a complete description of query tags and the attributes they support, please refer to the
Query Parser Tag Reference.