[ You are here: XTF -> Tag Reference -> crossQuery -> Query Parser Output Tags -> More Like This Tag ]

More Like This Tag

This tag specifies that XTF should locate documents "similar" to a given document, where many of the similarity parameters can be controlled. This tag has the form:

<moreLike  fields        = "FieldList"
          {boosts        = "BoostFactorList"}
          {minWordLen    = "MinWordLength"}
          {maxWordLen    = "MaxWordLength"}
          {minDocFreq    = "MinDocFrequency"}
          {maxDocFreq    = "MaxDocFrequency"}
          {minTermFreq   = "MinTermFrequency"}
          {termBoost     = "ShouldBoostTerms"}
          {maxQueryTerms = "MaxQueryTerms"}>
 
    DocumentQuery
 
</moreLike>
where
fields="FieldList" is a required attribute naming all of the fields that XTF should search for "interesting" terms. The field names may be separated by spaces, commas, semicolons, or pipe symbols "|". For best performance, this list should be kept relatively small, and concentrate on fields of most interest to users, such as title, author, subject, etc. Note that XTF currently does not support using the special field name text to search the full document text for interesting terms, and behavior is undefined if you specify this as a field name.
boosts="BoostFactorList" is an optional attribute specified exactly one boost factor for each field listed in the fields attribute. Each boost factor should be a non-negative decimal number, and is multiplied into the scoring for all terms from the given field. For example, a boost factor of 0.5 will reduce the score for terms by half, while a factor of 2.0 will double the score. In general, the boost factor is very useful in adjusting the weight that various fields have on selecting similar documents. For instance, if one decided that the title should be twice as important as author and subject, the fields attribute might be "title,author,subject" and the boosts attribute would be "2.0,1.0,1.0". If not specified, the boost factor for all fields in the fields list is set to 1.0.
minWordLen="MinWordLength" is an optional attribute that limits the length of terms from the source fields that will be considered for similarity matching. Terms shorter than the specified number of characters will be disregarded. This can speed up processing and improve results by getting rid of useless words. If not specified, this attribute defaults to 4.
maxWordLen="MaxWordLength" is an optional attribute that limits the length of terms from the source fields will be considered for similarity matching. Terms longer than the specified number of characters will be disregarded. This can speed up processing and improve results by getting rid of useless words. If not specified, this attribute defaults to 12.
minDocFreq="MinDocFrequency" is an optional attribute that helps select which terms from source fields will be considered for similarity matching. In particular, terms that appear in fewer than the specified number of documents will be discarded. This can speed processing and improve results by discarding highly unusual terms. If not specified, this attribute defaults to 2.
maxDocFreq="MaxDocFrequency" is an optional attribute that helps select which terms from source fields will be considered for similarity matching. In particular, terms that appear in more than the specified number of documents will be discarded. This can speed processing and improve results by discarding very common terms. If not specified, this attribute defaults to -1, meaning that there is no limit at all.
minTermFreq="MinTermFrequency" is an optional attribute that helps select which terms from source fields will be considered for similarity matching. In particular, if the term occurs in the original field less than the specified number of times, it will be discarded. This can help choose more relevant terms by concentrating on those that are repeated in the field. If not specified, this attribute defaults to 1.
termBoost="ShouldBoostTerms" is an optional attribute controls whether the similarity engine should calculate and attach a boost factor to each term. This factor will be equal to the score that was calculated for that term, and serves to make more important terms select documents more specifically. In general, it's best to leave this at the default value, which is true.
maxQueryTerms="MaxQueryTerms" is an optional attribute that controls how many "interesting" terms are selected from the original document's fields. Generally, this should be chosen to balance speed (more terms take longer to process) vs. quality (more terms can result in higher quality results, up to a point.) If not specified, this attribute defaults to 10.
Within the moreLike tag, DocumentQuery is a normal XTF query that results in a single document. That document's fields will be scanned, and each term will be scored for "interestingness" subject to the attributes above. Those terms that rank highest will be combined into a new <orNear> query, and the results will be documents that are similar to the original document selected by DocumentQuery.