[ You are here: XTF -> Under the Hood -> More Like This ]

"More Like This" (Similar Documents Query)

XTF contains a query operator that allows one to query for documents that are "similar to" a given target document. This sort of feature is often activated in user interfaces using a "More like this..." link.

A sample implementation has been provided in the default stylesheets that come with XTF. Simply perform a query in crossQuery, and click on the Similar Items: Find link. A similarity query will be executed asynchronously, and a summary of the resulting documents inserted directly into your search page.

Similarity Algorithm in Brief

The current algorithm analyzes the content of the bibliographic metadata for the target item, chooses the most important terms in the record, and formulates a new query. Top-ranking items resulting from the new query are presented as recommendations.

While simple in theory, the number of permutations and complications to this approach are vast. There are many methods for choosing and ordering the top terms, and many approaches to formulating the new query. Moreover, bibliographic records are inconsistent. Some records are catalogued exhaustively, others are sparse. Particularly in sparse records, the choice of a single subject heading can significantly affect the choices and weights of terms. In extreme cases, this can cause unexpected results: two versions of a book, sparsely catalogued and with slight differences in subject headings, can yield very different recommendations. We experimented with various approaches to try to balance these complexities.

In our final iteration, each term of each metadata field in the source document is considered in turn. The number of occurrences of that term in the field, tf, is computed. Also, the total number of documents containing that term in that field, df, is fetched from the Lucene index. Terms are filtered out if they occur in too few or too many documents (the limits are adjustable.) Next, a score is calculated for the term by multiplying tf * idf, where idf is the standard log(numDocs / df) + 1. Finally, the score for each term is totaled across all fields it occurs in. The resulting term list is ranked by score and the top-scoring 25 terms are chosen (also adjustable.)

The chosen terms are turned into what we call an "Or-Near" query. Each term is searched in each field and document, increasing the score of documents it is found in. Documents with more terms appearing in a single field receive an extra boost. In this way, a score is calculated for each matching document, and the top 5 scoring documents are output as the query results.

Activating Similarity Query

To activate the similar documents query, replace your main query output from the Query Parser with a <moreLike> tag. The XTF Tag Reference has more information on this tag.