[ You are here: XTF -> Programming -> crossQuery -> Faceted Browsing ]

Faceted Browsing

Table of Contents

Faceted Browsing
Meta-data Requirements
Adding Facets to the Stylesheets
Group Selection
Hierarchical Facets
crossQuery provides a good general-purpose solution for searching a large collection of documents. However, we haven't up to this point covered any convenient or useful way to browse such a collection. In other words, a user will have a difficult time if they're not quite sure what subject they're looking for, or if they just want to get an idea of what is offered by the collection.

There are many possible ways to build a browsing system, but one very promising avenue is called faceted browsing, a way of intuitively exploring a collection that has rich meta-data. If one thinks of the collection as a bag of jewels, each meta-data field is a "facet", and items will have various values for that facet. The user can choose to explore one or many facets simultaneously.

Note: Faceted browsing is an advanced and fairly complex topic. Those new to XTF would be advised to skip this section for now, and stick with the faceted browsing implemented in the default XTF stylesheets.

A good example of a faceted browse system was developed by Prof. Marti Hearst at UC Berkeley. The system is called Flamenco, and the Flamenco web site is quite informative and includes a good demonstration. For more information the reader is encouraged to play with and read about Flamenco.

XTF's faceted browse feature is built in to the crossQuery servlet. One might ask whether it should have been a separate servlet altogether, but there is a good reason to have search and browse in the same servlet: it can be quite useful to combine the two activities. For instance, a user might enter a search for "africa", and then use the browse system to get an idea of the collection's coverage, in terms of dates (the interface might include decades and a count of documents for each one), subjects, authors, etc.

Meta-data Requirements

The faceted browse system relies on properly marked meta-data in the documents. Essentially, it relies on meta-data fields that are not tokenized during the indexing process. If you are creating meta-data fields for sorting, you already know how to do this.

There are two ways to create a non-tokenized field. Both involve using the pre-filter stylesheet used by the textIndexer to add an additional attribute to an element (the element should already be marked with xtf:meta="yes").
  1. Add xtf:tokenize="no" to the meta-data element. This will keep the indexer from tokenizing the field (and therefore, the user cannot perform queries on it.) But the contents will still be available to the result formatter for display. ...or...
  2. Add xtf:indexOnly="yes" to the meta-data element. Again the indexer won't tokenize the field, but it also won't store the contents of the field. This is more efficient if you don't need to display the contents in the result formatter.
What if you want to be able to search within a field and also use it for browsing? Simply program your pre-filter to make two copies of the meta-data field, one tokenized and one untokenized. Of course you should give them distinct names so you can tell them apart later. For example, one might create a "subject" field (tokenized) and a "facet-subject" field (untokenized) both of which contain the same data. In this case, it's wise to make the non-tokenized one indexOnly, to avoid storing the same data twice.

See the pre-filter programming section of this guide for more information.

Adding Facets to the Stylesheets

The first step to implementing a browse system is to add facets to the Query Parser stylesheet. You can add one or more <facet> elements as top-level children of the <query> element. For a fuller description see the Facet Tag in the XTF Tag Reference. Modify your query parser to add a <facet> tag for each meta-data field you want to browse by in the user interface.

Here are a couple other things to note when constructing your query parser:
  1. The document hits accumulated for selected groups in a facet tag are independent of those for the main query. If you want both, fine. If you only want facet counts, or want document hits only accumulated within the facet, then specify maxDocs="0" on your main <query> element.
  2. For purposes of facets, XTF counts only those documents matched by the main query. Thus, you can use a query to form arbitrary slices of a repository, and the facet system will report information about each slice.
  3. In the case where you want to count all the documents in the repository, you need to make a query that matches all documents. A simple way to do this is to specify an <allDocs> query like this:
<query >
   <facet field="field1" .../>
   <facet field="field2" .../>
   …
   <allDocs/>
</query>
Now that we've covered how to construct a <facet> element, it's time to look at the results. In the case of a faceted query, the normal query results passed to the Result Formatter stylesheet are supplemented with one or more Facet Result Tags, one per facet in the input query. These appear at the top level of the input to the Result Formatter, that is, as children of the <crossQueryResult> element.

Group Selection

A faceted query may result in a very large number of groups. For instance, a very large collection of documents could have thousands of different Subject headings; if one were browsing by subject it would be silly to look at a page containing a thousand subjects. So some intelligence is needed in picking which subjects to show. In addition, some applications will want to display document hits below the first group, or the first four groups, etc.

XTF provides a fairly sophisticated mechanism for choosing which groups to return, and to control the groups that will have documents hits gathered for them. The group selection mechanism is somewhat loosely modeled on XPath. Since XTF's group selection language is under considerable development, this section simply teaches by example rather than providing a formal specification.

We will refer to the following 10 groups in the discussion to follow:
African Studies
Ancient History
California and the West
Classical Religions
Classics
Environmental Studies
Philosophy
Politics
Social Theory
Sociology
Let's start with the simplest possible selection expression:
select="*"
Essentially, "*" is a wildcard that matches any group, regardless of its name. This is the simplest possible selection expression; it simply selects all groups -- African Studies through Sociology -- and returns them. (Well, it selects all top-level groups in the case of a hierarchical field; see Hierarchical Facets later on.) This is the default behavior if the select attribute is not specified for a facet.

It's important to note that by default XTF skips over groups that have no hits. If you really want all the groups including empty ones, specify the includeEmptyGroups attribute in the <facet> query tag.
select="*[1-5]"
This selects the first five groups, African Studies through Classics. It can be interpreted this way: "Start with all groups, regardless of name. From that set, select items 1 through 5."

It is important to note that the order of the groups is very important to this selection. After all the counting is performed, XTF sorts all the groups, either by total number of documents (the default), or by group value/name. So this selection is the first five groups, after all groups are sorted.

This sort of selection is generally used for dividing the groups up into pages of a fixed size (five groups per page in this case.) To get the second page (Environmental Studies through Sociology), one could select "*[6-10]", etc.
select="Ancient History"
This selection chooses a group by name, Ancient History in this case. This could be useful if you wanted the count for only one group. Note that name selection is not case-sensitive (i.e. differences between upper and lower case are ignored.)
select="Ancient History#1-4"
Here we've introduced something new. This still selects a single group, but also tells XTF to gather document hits for the Ancient History group, and return the first four document hits.
select="Ancient History#all"
Just like above, but selects all documents hits (instead of just four.)
select="*[1-5]|Ancient History#1-4"
This may look complicated, but you've seen everything here before except the "|" separator. All this does is perform a logical union of two selections. In this case, it selects the first 5 groups (*[1-5]). Then it also selects the Ancient History group, and gathers the first four document hits for it (Ancient History#1-4)

Why would you want to do this? Say for instance you wanted to display the first page of groups, and you knew Ancient History was on that page, and you wanted document hits for that group shown. This selection would accomplish exactly that.

But what if you wanted to display document hits for the first group on the page, and you didn't already know what that group was?
select="*[1-5]|*[1]#1-4"
This selects the first five groups (*[1-5]) -- African Studies through Classics. Also, it selects the first group (*[1]), which is African Studies in this case, and gathers four document hits for it.

What if you want to select a certain group by name, and also select the other groups in the same page, but you don't know in advance which page the group is on? Well...
select="Politics[page(size=5)]"
This expression selects the Politics group, and then expands to select all the other groups on the same page as Politics, performing calculations assuming each page is five groups.

In the case of our sample data above, the first page of five groups is African Studies through Classics. But the first page doesn't contain Politics, so we skip it and select the groups on the second page instead -- Environmental Studies through Sociology -- which does include Politics.

Hierarchical Facets

Above we covered one way to deal with a large number of groups, by paging them. Another way is to apply structure to the groups, forming a hierarchy of parent/child relationships. One obvious application is for geographical information, which breaks down naturally into large groupings by nation, followed by states/provinces within each nation, counties or districts within the states, and thence to cities.

Telling XTF about hierarchical data is simple: place the data items in a meta-data field, listing the groupings from most general to most specific and separating them by double-colons, like this: US::California. In this section we'll refer to the following sample data.
Canada
Canada::Ontario
Canada::Ontario::Toronto
US
US::California
US::California::Berkeley
US::California::San Francisco
US::California::Yreka
US::Nevada
US::Nevada::Las Vegas
US::Nevada::Reno
As you can see, the sample hierarchy has three levels (Nation, State/Province, and City), but XTF imposes no particular limits on the depth of the hierarchy.

The groups above shown with a gray background are implied by the other groups but need not be present in the document repository. For example, if XTF encounters the group US::Nevada::Las Vegas, it automatically creates the groups US::Nevada and US, even if they those values are not specifically present in any document's meta-data.

Now let's consider how the group selection mechanism works in the presence of hierarchical groups.
select="*"
One might think this selects all the groups, but instead it only selects the top-level groups. In this case, it will select US and Canada. You might ask what happens if you request document hits:
select="*#all"
This expression selects all top-level groups, and gathers all the document hits for each group. Note that XTF will automatically count and gather document hits for all the children and grandchildren of US and Canada. That is, the US group will contain document hits for US::California::Berkeley, US::California::San Francisco, US::Nevada::Reno, etc.
select="US::*"
Selects all groups that are children of US, being California and Nevada. Because no "#" is present, this expression only counts documents but doesn't collect them.

Note that when XTF reports the groups in Group Result Tags, the tags will be nested in a hierarchy which includes all the necessary parents and grandparents even if they weren't specifically selected:
<facet totalGroups="1" >
    <group value="US" totalSubGroups="2" >
        <group value="California" totalSubGroups="0" .../>
        <group value="Nevada" totalSubGroups="0" .../>
    </group>
</facet>
Also note that the group values do not have the colons "::" embedded in them. These are only present in the document meta-data.

But what if you wanted to select absolutely all of the groups, not just the top-level ones? One could use a really clunky expression like "*|*::*|*::*::*", but here's an easier way:
select="**"
This special syntax simply selects all the groups regardless of their level in the hierarchy. This can be useful for small hierarchies, or to bypass XTF's group selection mechanism and do group selection entirely in the Result Formatter stylesheet. Note however that it can be slow when processing large hierarchies.

What if you don't know which level of the hierarchy to select?
select="**[topChoices]"
This instructs XTF to make a good guess as to which level of the hierarchy to return, based on the documents selected by the main query. Essentially, it looks for the topmost level in the hierarchy that has more than one choice.

For instance, if the query produced two documents, one coded with US::California::Berkeley in its meta-data, and the other encoded with US::Nevada::Reno, then **[topChoices] would select California and Nevada. Starting from the top, there's only one choice: US. After that, there are two choices, so XTF stops there.

If instead the two documents were coded for US::California::Berkeley and US::California::Yreka, then XTF would select Berkeley and Yreka. Again starting from the top, there's one choice: US. Below that, there's again one choice: California. Below that are two choices, so it stops there.

If all the documents were for US::California::Berkeley, XTF would simply select that single group.