|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
SpellTestCmdLine.SuggTester | Generic strategy for testing spelling suggestion algorithms |
WordEquiv | Used to establish whether a potential spelling suggestion is simply "equivalent" to the original word, and should thus be skipped. |
Class Summary | |
---|---|
DoubleMetaphone | Encodes a string into a double metaphone value. |
FreqData | A fast, simple, in-memory data structure for holding frequency data used to produce spelling suggestions. |
LuceneIndexToDict | Utility class to convert the stored fields of a Lucene index into a spelling dictionary. |
MinimalAnalyzer | Performs minimal token processing, without case conversion. |
QuerySpeller | Handles spelling correction for simple queries produced by the Lucene
QueryParser . |
SimpleQueryRewriter | Traverses and rewrites simple Lucene queries. |
SpellReader |
Reads a spelling dictionary created by SpellWriter , and provides
fast single- and multi-word spelling suggestions. |
SpellReader.WordQueue | Queue of words, ordered by score and then frequency |
SpellTestCmdLine | A command-line driver class to test out the spelling correction engine. |
SpellTestCmdLine.DictBuilder | Common interface for various dictionary-building algorithms |
SpellTestCmdLine.SpeltDictBuilder | Builds a new-style Spelt spelling dictionary |
SpellTestCmdLine.SpeltSuggTester | Get spelling suggestions using the Spelt (new) algorithm |
SpellTestCmdLine.TextRipper | Scans a directory for files, and rips text from all of them. |
SpellWriter |
Writes spelling dictionaries, which can later be used by SpellReader
to obtain spelling suggestions. |
SpellWritingAnalyzer | Drop-in replacement for the Lucene StandardAnalyzer , which performs
all the same functions plus queues words to a spelling dictionary. |
SpellWritingFilter | A simple token filter for Lucene, that adds words to a spelling correction dictionary as they're being indexed by Lucene. |
TRStringDistance2 | Calculates the edit distance between two strings, with special modifications to score transpositions and double-letter changes as lower cost than insertion/deletion/replacement. |
This package provides a facility for creating a spelling correction dictionary, and for generating spelling suggestions from it.
To build a spelling correction dictionary, you may use any of the following methods:
Easiest: If you already have a Lucene index, you may simply use LuceneIndexToDict to re-analyze the contents of your index and create a dictionary. Note that only stored fields are added to the dictionary. Overall this method is inefficient since fields have to be tokenized twice, but it's an easy way to start. This class even includes a command-line driver, used like this:
java ... org.apache.lucene.spelt.LuceneIndexToDict
<luceneIndexDir> <targetDictDir>
Once you've built the dictionary, there are a couple of ways to get spelling suggestions from it:
Everybody likes Google's "did you mean" suggestions. Users often misspell words when they're querying a Lucene index, and it would be nice if the system could catch the most obvious errors and automatically suggest an appropriate spelling correction. We did did extensive work to create a fast and accurate facility for doing this, involving minimal work for those deploying Lucene.
We call this spelling correction system "Spelt". In the following sections, we'll discuss the guts of Spelt's spelling correction system, detailing some strategies that were considered and the final strategy selected, the methods that Spelt uses to come up with high-quality suggestions, and how the dictionary is created and stored.
Choosing an Index-based Strategy
We considered three strategies for spelling correction, each deriving suggestions from a different kind of source data.
Because of the issues identified with the other strategies, we opted to pursue the index-based approach. We feel it is best for most collections in most situations, as it adapts to the documents most germane to the application and users, and doesn't require a long query history to become effective.
We set ourselves a goal of getting the correct suggestion in the number 1 position 80% of the time, which seemed a good threshold for the system to be truly useful. With several iterations and many tweaks to the algorithm, we were able to achieve this goal for our sample data set and our sample queries (drawn from real query logs). We have a fair degree of confidence that the algorithm is quite general and should be applicable to many other data sets and types of queries.
A typical implementation of Spelt would build a spelling correction dictionary at index-time. If a query results in only a small number of hits (the threshold is configurable), the system calls Spelt to consult the dictionary and make an alternative spelling suggestion. Here is a brief outline of the algorithm for making a suggestion:
Most spelling algorithms, including this one, rely on a sort of "shot-gun" approach: for each potentially mispelled word in the query, they make a long list of "candidate" words, that is, words that could be suggested as a correction for the original misspelled word. Then the list of candidates is ranked using some sort of scoring formula, and finally the top-ranked candidate is presented to the user.
One might naively attempt to scan every word in the dictionary as a candidate. Unfortunately, the cost of scoring each one becomes prohibitive when the dictionary grows beyond about ten thousand words. So a strategy is needed to quickly come up with a list of a few hundred pretty good candidates. Spelt uses a novel approach that gives good speed and accuracy.
We began with a base of existing Java spelling correction code that had been contributed to the Lucene project (written by Nicolas Maisonneuve, based on code originally contributed by David Spencer). The base Lucene algorithm first breaks up the word we're looking for into 2, 3, or 4-character "n-grams" (for instance, the word primer might end up as: ~pri prim rime imer mer~ ). Next, it performs a Lucene OR query on the dictionary (also built with n-grams), retaining the top 100 hits (where a hit represents a correctly spelled word that shares some n-grams with the target word). Finally, it ranks the suggestions according to their "edit distance" to the original misspelled word. Those that are closest appear at the top. ("Edit distance" is a standard numerical measure of how far two words are from each other and is defined as the number of insert, replace, and delete operations needed to transform one word into the other.)
Unfortunately, the base method was quite slow, and often didn't find the correct word, especially in the case of short words. However, we made one critical observation: In perhaps 85-90% of the cases we examined, the correct word had an edit distance of 1 or 2 from the misspelled query word; another 5-10% had an edit distance of 3, and the extra edit was usually toward the end of the word. This observation intuitively rings true: the suggested word should be relatively "close" to the query word and those that are "far" away needn't even be considered.
Still, one wouldn't want to consider all possible one- and two-letter edits as it would still take a long time to check if all those possible edits were actually in the dictionary. Instead, Spelt checks all words in which four of the first six characters match in order. This effectively checks for an edit distance of two or less at the start of the word.
Take for example the word GLOBALISM. Here are the 15 keys that can be created by deleting two of the first six characters:
So, Spelt checks each of the 15 possible 4-letter keys for a given query word, and makes a merged list of all the words that share those same keys. This combined list is usually only a few hundred words long, and almost always contains within it the golden "correct" word we're looking for.
Given a list of candidate words, how does one find the "right" suggestion? This is the heart of most spelling algorithms and the area that needs the most "tweaking" to achieve good results. Spelt's ranking algorithm is no exception, and makes decisions by assigning a score to each candidate word. The score is a sum of the following factors:
The original query word itself is always considered as one of the candidates. This is to reduce the problem of suggesting replacements for correctly spelled words. However, the score of the query word is reduced slightly in case a more common word is found that is very similar.
In summary, for a single-word query, the list of all candidates is ranked by score, and the one with the highest score wins and is considered the "best" suggestion for the user.
But what about queries with multiple words? Testing showed that considering each word of a phrase independently and concatenating the result got plenty of important cases wrong. For instance, consider the query untied states. Actually, each of these words is spelled correctly, but it's clear the user probably meant to query for united states. Also, consider the word harrypotter... the best single-word replacement might be hairy, but that's not what the user meant. How do we go beyond single-word suggestions?
We need to know more than just the frequency of the words in the index; we need to know how often they occur together. So when Spelt builds the spelling dictionary, it additionally tracks the frequency of each pair of words as they occur in sequence.
Using the pair frequency data, we can take a more sophisticated approach to multi-word correction. Specifically, Spelt tries the following additional strategies to see if they yield better suggestions:
Despite all the above efforts, sometimes the spelling system makes bad suggestions. A couple of methods are used to minimize these.
First, a filter is applied to avoid suggesting "equivalent" words. In Spelt, the indexer performs several mapping functions, such as converting plural words to singular words, and mapping accented characters to their unaccented equivalents. It would be silly for the spelling system to suggest cat if the user entered cats, even if cat would normally be considered a better suggestion because it has higher frequency. The final suggestion is checked, and if it's equivalent (i.e. maps to the same words) to the user's initial query, no suggestion is made.
Second, it's quite possible for the spelling correction system to make a suggestion that will yield fewer results than the user's original query. While this isn't common, it happens often enough that it could be annoying. So after Spelt comes up with a suggestion, it's good practice for the system to run the modified query, and if fewer results are obtained, the suggestion is should be suppressed. Of course since Spelt simply supplies the suggestion, this re-running of the query is the job of the code that calls Spelt.
Now we turn to the dictionary used by the algorithm above to produce suggestions. How is it created during the Lucene indexing process? Let's find out.
One of the best features of Lucene is incremental indexing, the ability to quickly add a few documents to an existing index. So we needed an incremental spelling dictionary build process to stay true to the indexer's design. To do this we phase the work, with a low-overhead collection pass added to the main indexing process, and then a phase of intensive dictionary generation, optimized as much as possible.
During the main index run, Spelt simply collects information on which words are present, their frequencies, and the frequency of pairs. Data is collected in a fairly small RAM cache and periodically sorted and written to disk. Two filters assure that we avoid accumulating counts for rare words and rare word pairs. Words that occur only once or twice are disregarded (though this limit is configurable); likewise, pairs that occur only once or twice are not written to disk. Collection adds minimal CPU overhead to indexing.
Then comes the dictionary creation phase, which processes the queued word and pair counts to form the final dictionary (this can optionally be delayed until another index run). Here are the processing steps:
The data structures used to store the dictionary are motivated by the needs of the spelling correction algorithm. In particular, it needs the following information to make good suggestions:
Edit Map. Since at most 15 keys need to be read for a given input word, this data structure is mainly disk-based (it's never read entirely into RAM.) The disk file consists of one line per 4-letter key, listing all the words that share that key. At the end of the file is an index giving the length (in bytes) for each key, so that the correction engine can quickly and randomly access the entries.
The words in each list are prefix-compressed to conserve disk space. This is a way of compressing a list of words when many of them share prefixes. For example, say we have three words in the list for a key:
We always store the first word in its entirety; each word after that is stored as the number of characters it has in common with the previous word, plus the characters it doesn't share. For long lists of similar words, the compression becomes quite significant. In our example the compressed list is:
Here are some lines from a real edit map file, with the keys in bold:
Word Frequency Table. On disk this is stored as a simple text file with one line per word giving its frequency. The lines are sorted in ascending word order. This structure is read completely into RAM by the correction engine, as we need to potentially evaluate tens of thousands of candidate words per second in a high-volume application.
Here are some lines from a real word frequency file:
Pair Frequency Table. The correction engine needs to check the frequency of hundreds of thousands of word pairs per second. This implies a need for extremely fast access, so we need to pull the entire data structure into RAM and search it very quickly.
The table exists only in binary (rather than text) form, in a very simple structure. A "hash code" is computed for each pair of words. The hash code is a 64-bit integer that does a good job of characterizing the pair; for two different pairs, the chance of getting the same hash code is vanishingly small. The structure consists of a large array of the all the pairs' hash codes, sorted numerically, plus a frequency count per pair. This sorted data is amenable to a fast in-memory binary search.
Here's the disk layout:
# bytes | Description |
8 | Magic number (identifies this as a pair frequency file) |
4 | Total number of pairs in the file |
8 | Hash code of pair 1 |
4 | ... and frequency of pair 1 |
8 | Hash code of pair 2 |
4 | ... and frequency of pair 2 |
8 | Hash code of pair 3 |
... | etc. |
As you can see, we store each pair with exactly 12 bytes: 8 bytes for the 64-bit hash code, and 4 bytes for the 32-bit count. Working with fixed-size chunks makes the code simple, and also keeps the pair data file (and corresponding RAM footprint) relatively small.
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |