[ You are here:
XTF ->
Experimental Features -> Dynamic FRBR ]
Dynamic FRBR
In a catalog with millions of bibliographic records, the problem of duplicate or near-duplicate records often arises. A recognized approach to dealing with this problem is to group them together using
Functional Requirements for Bibliographic Records (FRBR).
To implement FRBR in XTF, we chose to adapt the standard FRBR
Work Set Algorithm to our purposes. In particular, we changed the method to dynamically determine work groups, rather than making this determination at index time. This allowed us to play with and tweak the algorithm without having to re-index the entire test collection (over 10 million records).
For speed, XTF's "dynamic FRBRization" is implemented in Java, and caches large tables of data drawn from the underlying Lucene index files. In this section we'll cover very briefly the algorithm used, and then show how to activate it.
FRBR Algorithm in Brief
XTF's dynamic FRBR algorithm takes the entire result set from a query, and checks the resulting documents against each other. The goal is to group similar records together. For instance, two records with the same title and author should end up in the same group.
The actual algorithm relies on a score-based approach that allows partial matches of various sorts. It is too complex to cover here, so for full details the reader is referred to Chapter 5 of the
Melvyl Recommender Project Full Text Extension Report (
PDF).
Note: The algorithm involves loading sizable tables for title, author, date, and ID and caching these tables in RAM. This has two major ramifications: first, the tables can take some time to load the first time dynamic FRBR is accessed; second, make sure to give the servlet container plenty of RAM, especially if the number of records in the collection is very large. This can be adjusted by setting -Xmx flag for the Java command that starts the servlet container.
Activating Dynamic FRBRization
XTF integrates dynamic FRBRization into the normal crossQuery process by exposing the grouped records as a new facet. If you're not familiar with facets, refer to the
Faceted Browsing section of the
XTF Programming Guide.
Just like a normal facet, FRBRization is activated by adding a
Facet Query Tag to the Query Tag produced by your Query Parser stylesheet. However, it must be of a special form, as follows.
Dynamic FRBR Facet Tag
This tag specifies that the
Text Engine should group result documents into FRBR Work Sets using the built-in "dynamic FRBRization" algorithm (above). This tag should appear directly within a
Query Tag.
<facet field="java:org.cdlib.xtf.textEngine.facet.FRBRGroupData({SortOrder}FieldList)"
<!-- Other attributes as per normal Facet Query Tag... -->
/>
where
SortOrder |
is an optional field name to sort the resulting groups by. If preceded with a hyphen, the sort is reversed. If not specified, the groups order will be arbitrary. |
FieldList |
is a required list of the meta-data fields to use to create FRBR work groups. The engine looks for fields containing the strings "title", "author" or "creator", "date" or "year", and "id" to determine how the field contents are to be incorporated into the groupings. |
The results are identical to a normal facet query except that the facet groups are computed dynamically based on the result documents of the query, rather than statically based on values in a particular meta-data field.