Methods and apparatus for performing spelling corrections using one or more variant hash tables

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart illustrating the overall flow of an exemplary distance one spelling correction algorithm;

FIG. 2 is a flow chart illustrating an exemplary process of testing variants of the candidate word against hash tables derived from the dictionary for distance one misspellings in accordance with the present invention;

FIG. 3 is a flow chart illustrating the overall flow of the distance two spelling correction algorithm;

FIG. 4 is a flow chart illustrating the process of testing variants of the candidate word against hash tables derived from the dictionary for distance two misspellings;

FIG. 5 is a flow chart illustrating the overall flow of the “soft” distance two spelling correction algorithm; and

FIG. 6 describes the process of testing variants of the candidate word against hash tables derived from the dictionary for “soft” distance two misspellings.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present invention provides improved techniques for real-time spelling correction of a term against a dictionary of valid words (including all word forms). The dictionary can be multi-lingual, i.e., it can be composed of multiple single language dictionaries. It can also be comprised of such things as sequences of nucleotides in biology, or any collection of valid “words” consisting of letters from a pre-established “alphabet.” While the dictionary size and alphabet size are presumed to be large, their actual size is unimportant, and average/maximum word length is assumed to be relatively small, i.e., many orders of magnitude smaller than the dictionary size.

For a dictionary size, D, alphabet size, A, and a maximum word length, W, the disclosed algorithm corrects distance one misspellings in O(W) time and distance two misspellings in O(W²) time. The required storage is O(D), or in the case W varies with D, equal to O(D*W), for distance one misspellings and O(D*W²) for distance two misspellings. In general, it is assumed that W is more or less constant and does not grow with D so that O(D) equals O(D*W) or O(D*W²).

According to a further aspect of the invention a soft algorithm is disclosed that uses a “soft” definition of distance two misspellings, where distance two spelling correction can be performed in O(W) time and O(D)=O(D*W) storage. “Soft” distance two means that only the following distance two errors are considered: double transposition, transposition-deletion, transposition-insertion, deletion-transposition, deletion-insertion, insertion-transposition, and insertion-deletion.

FIG. 1 is a flow chart illustrating the overall flow of an exemplary distance one spelling correction algorithm 100. In some instances of the spelling correction problem, it is adequate to detect only distance one spelling errors, and furthermore, distance one detection is the first step in the various distance two correction algorithms. A wild card is an arbitrary symbol, indicating a wildcard that is assumed to not appear in any dictionary word.

In the following discussion, the verb “to hash” or any of its grammatical variants refer to the act of placing something in a hash table. For example, the phrase “hashing all dictionary words” means placing all dictionary words in a hash table. Uses of hashtables and performance guarantees for simple hash table operations such as insertion and lookup are described in any standard reference on algorithms. See, for example, C. Cormen et al., Introduction to Algorithms, MIT Press (2001).

The method involves hashing all dictionary words, in a known manner, and all “replacements” of dictionary words, in accordance with the present invention. Replacements are hashed, using, for example, an asterisk ‘*’ as a wild card, as follows. If the dictionary word is COAT, then the following variants are hashed: *OAT, C*AT, CO*T and COA*. In general, if a word is of length W, then W such word variants are hashed. The (key, value) pairs are (*OAT, COAT), (C*AT, COAT), (CO*T, COAT), and (COA*, COAT). Separate hash tables are kept for the words (i.e., the dictionary) and for the replacement variants. These hash tables are assumed to be pre-created prior to when the distance one spelling corrector starts up (Step 110).

In response to obtaining the input candidate word (Step 120), say in this case the term is WXYZ, one first checks the word against the direct dictionary hash (Step 130). One then gets to the decision point 140. If a match is found in the dictionary hash, then the word is spelled correctly, and the program terminates indicating the correct spelling, as in Step 150. If, however, no match is found, a misspelling is assumed and one checks all distance one variants against the appropriate distance one hash tables, accumulating suggested spelling corrections using the process 200, discussed further below in conjunction with FIG. 2. Finally, in Step 160 the suggested corrections are output.

FIG. 2 is a flow chart illustrating an exemplary process 200 of testing variants of the candidate word against hash tables derived from the dictionary for distance one misspellings in accordance with the present invention. Upon starting and obtaining the candidate word (Step 210), one first generates all transpositions of adjacent characters, and single character deletions of the candidate word (Step 220). For the candidate word WXYZ, the transpositions would be XWYZ, WYXZ, and WXZY. The deletions would be XYZ, WYZ, WXZ, and WXY. These are each checked against the dictionary hash in Step 230 and any matches are accumulated. The transposition checking will undo a misspelling of the same kind since the inverse of a transposition is the same transposition and the deletion checking will undo a corresponding insertion. The next step is to generate all single character replacements and insertions (Step 240) and test these against the replacement hash (Step 250). Replacements in this case are *XYZ, W*YZ, WX*Z, and WXY*. Insertions are *WXYZ, W*XYZ, WX*YZ, WXY*Z, and WXYZ*. Replacements catch distance one replacement errors, and insertions catch distance one deletion errors. As usual, the final step is to output all hash table matches (Step 260). The total effort expended is 4W hash lookups which is O(W), and the memory used for storage of the hash is O(D)=O(D*W).

This algorithm affords no false positives. In other words, the algorithm never suggests a spelling correction that is more than distance one from the original word. On the other hand, if one were to just hash the dictionary together with all ordered subsequences of dictionary words of length W−1 as in Greene et al., “Multi-Index Hashing for Information Retrieval,” 35th Annual Symposium on Foundations of Computer Science, 722-731 (1994), and do a corresponding lookup, one would run into false positives. For example, for both the dictionary words COAT and OATH the ordered subsequence OAT would be hashed, and both would be a suggested distance one correction in response to the query “DOAT,” despite the fact that OATH is not distance one from DOAT.

FIG. 3 is a flow chart illustrating the overall flow of the distance two spelling correction algorithm 300. The method 300 involves the utilization of certain hash tables that are assumed to be pre-created. The following hash tables are needed: a transposition (t) hash, a deletion (d) hash, a transposition-replacement (tr) hash, a deletion-transposition (dt) hash, a double deletion (dd) hash, a deletion-replacement (dr) hash, and an insertion-replacement (ir) hash. Only a special form of the deletion-transposition hash is required, namely, those deletions followed by transpositions that first delete a character, and then transpose the characters initially surrounding the deleted character, as discussed further below. Each of these hash tables contains keys that correspond to certain variant forms of each dictionary word. The contents of each hash are again illustrated by considering the sample dictionary word COAT. Although the contents of the hash table are (key, value) pairs, in all cases for the dictionary word COAT, value ═COAT, so only the keys are shown.

Transposition (t) hash: OCAT, CAOT, COTA

Deletion (d) hash: OAT, CAT, COT, COA

Transposition-replacement (tr) hash: *CAT, O*AT, OC*T, OCA*, *AOT, C*OT, CA*T, CAO*, *OTA, C*TA, CO*A, COT*

Special deletion-transposition hash: ACT, CTO

Double deletion hash: CO, CA, CT, OA, OT, AT

Deletion-replacement hash: *AT, O*T, OA*, C*T, CA*, *OT, CO*, *OA, C*A

Insertion-replacement hash: **OAT, *C*AT, *CO*T, *COA*, C**AT, C*O*T, C*OA*, *O*AT, CO**T, CO*A*, *OA*T, C*A*T, COA**, *OAT*, C*AT*, CO*T*

These hash tables require, in total O(D*W²) storage.

Referring again to FIG. 3, with the above hash tables pre-created, a test is performed for a distance one misspelling (as discussed above in conjunction with FIG. 1). As noted earlier, this takes O(W) time and requires storage that is O(D)=O(D*W). If the distance one misspelling routine indicates that the candidate word is spelled correctly, the algorithm terminates, also indicating a correct spelling (Step 320). Otherwise, it checks to see if enough suggested corrections have been accrued in testing for a distance one correction (Step 330). If enough have been detected it outputs the suggested corrections and terminates (Step 340), otherwise it goes through the process of testing distance two variants of the candidate word against the distance two hash tables (Step 400), and only then outputs suggested corrections and terminates (Step 340).

FIG. 4 is a flow chart illustrating the process 400 of testing variants of the candidate word against hash tables derived from the dictionary for distance two misspellings. If a transposition is denoted by t, a deletion by d, an insertion by i, and a replacement by r; the following misspellings are possible: tt, td, ti, tr, dt, dd, di, dr, it, id, ii, ir, rt, rd, ri, rr.

The following table lists the misspelling type, the action, and the hash table checked for each of the 16 possible distance two misspellings. Note that the possible outcomes of two successive misspellings xy, where x,y are elements of {t,d,i,r} are the same as the successive misspellings of yx, except in the single case where td # dt, since for example, on the one hand, starting with the word COAT one can reach CTO via a transposition followed by a deletion, but not via a deletion followed by a transposition, and on the other hand, starting from COAT one can reach OAT via a transposition followed by a deletion but not vice versa. Note that there is also an asymmetry in it and ti, where, for example (again from the word COAT) the CO*AT variant is not obtainable from ti and the ti variant CA*OT is not obtainable from it. However, the first of these variants is caught in a distance one simple insertion check, so can be disregarded, and the second variant is caught just like all other it or ti variants by the d Test Action against the t hash table. Only in the two cases, of td and dt are two separate actions followed by hash table checks required. The dt hash is a special hash since it does not need to store all deletions followed by transpositions, since most of these will be caught by the t test action against the d hash table. The exceptional cases are those where one first deletes a character and then transposes the characters that were originally around the deleted character. Only these O(W) deletion-transpositions need to be stored in the dt hash table.

TABLE 1

Misspelling Type
Test Action
Hash table checked

tt
t
t

td
t
d

(none)
d

ti
d
t

tr
r
tr

dt
t
d

(none)
dt

dd
(none)
dd

di
d
d

dr
r
dr

it
d
t

id
d
d

ii
dd
dictionary

ir
r
ir

rt
r
tr

rd
r
dr

ri
r
ir

rr
dd
dd

Returning to FIG. 4, after starting (Step 405) and obtaining the candidate word, one tests the candidate word against the deletion, deletion-transposition and double deletion hash tables, and accumulates matches in Step 410, as verified by checking Table 1. Next, one generates all single move transpositions of the candidate word in Step 415 and tests these variants against the transposition and deletion hashes in Step 420. Next, in Step 425 one generates all single character deletion variants and in Step 430 tests these against the transposition and deletion hashes. Note that just the various test actions that go against the same hash tables in Table 1 are accumulated and being executed in a single step, for the sake of brevity of explanation. Next, in Step 435 one generates all single character replacement variants of the candidate word and in Step 440 tests against the translation-replacement, deletion-replacement, and insertion-replacement hash tables. Finally, in Step 445 all double deletions of the candidate word are generated and in Step 450 these are tested against the double deletion hash. Having accumulated all hash table matches, the results are output in Step 455.

It is noted that except for distance two misspellings that involve double insertions, double deletions, or replacements, all actions can be done in O(W) time with O(D*W)=O(D) storage. However, replacements are less usual than the other single operations, and may be considered to be a deletion followed by an insertion. Also, double insertions and double deletions are relatively rare types of misspellings. Hence, if distance two misspellings are re-defined to exclude these possibilities (i.e., don't test these cases), a correction algorithm is provided that runs in O(W) time with O(D*W)=O(D) storage. This is referred to as a “soft” distance two correction.

FIG. 5 is a flow chart illustrating the overall flow of the “soft” distance two spelling correction algorithm 500. The diagram of FIG. 5 is identical to FIG. 3, the complete distance two correction flow, except that in lieu of testing all distance two variants against the relevant distance two hash tables, only distance two variants are tested that do not include a replacement, and do not include double deletion and double replacement. As usual, relevant hash tables are assumed to be pre-created. Distance one correction is initially performed (Step 100). If no misspelling is detected, the process 500 outputs that the word is spelled correctly and terminates (Step 520). Otherwise, a test is performed to see if enough candidates are found (Step 530). If enough candidates are found, the process outputs just the distance one corrections and terminates (Step 540). Otherwise, the process 500 continues with soft distance two variant testing (Step 600), as discussed further below in conjunction with FIG. 6, and only after this testing output the results and terminate (Step 540).

FIG. 6 describes the process 600 of testing variants of the candidate word against hash tables derived from the dictionary for “soft” distance two misspellings. The algorithm 600 starts in block 610 and obtains the candidate word. Following that, the unchanged candidate word is tested against the deletion and deletion-transposition hash tables in Step 620. Next all single step transpositions and single step deletions are generated (Step 630), and then tested against the single step transposition and single step deletion hash tables (Step 640). Finally, the accumulated matches are output in Step 650.

System and Article of Manufacture Details

As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer readable medium having computer readable code means embodied thereon. The computer readable program code means is operable, in conjunction with a computer system, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer readable medium may be a recordable medium (e.g., floppy disks, hard drives, compact disks, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic media or height variations on the surface of a compact disk.

The computer systems and servers described herein each contain a memory that will configure associated processors to implement the methods, steps, and functions disclosed herein. The memories could be distributed or local and the processors could be distributed or singular. The memories could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from or written to an address in the addressable space accessed by an associated processor. With this definition, information on a network is still within a memory because the associated processor can retrieve the information from the network.

It is to be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims

1. A method for correcting spelling of at least one candidate word, said method comprising: obtaining at least one variant dictionary hash table based on variants of a set of known correctly spelled words, wherein said variants are obtained by applying one or more of a deletion, insertion, replacement, and transposition operation on said correctly spelled words;obtaining from the at least one candidate word one or more lookup variants using one or more of said deletion, insertion, replacement, and transposition operations;evaluating one or more of said at least one candidate word and said lookup variants against said at least one variant dictionary hash table; andindicating a candidate correction if there is at least one match in the at least one variant dictionary hash table.
2. The method of claim 1, further comprising the step of obtaining a dictionary hash table having entries in a dictionary of known correctly spelled words and wherein said dictionary hash table and said at least one variant dictionary hash table are based on said dictionary and are comprised of at least one distance one variation for each of said entries, wherein said distance one variation comprises one or more of a deletion, insertion, replacement, and transposition operation performed on said entries; and wherein the step of evaluating one or more of said candidate word and said lookup variants against said at least one variant dictionary hash table further comprises the step of evaluating one or more distance one variants against said at least one variant dictionary hash table.
3. The method of claim 2, wherein said distance one variation comprises a replacement operation to generate a replacement hash table having entries of single character wild card replacements of said entries in said dictionary and said method further comprises the steps of generating single character replacements and insertions of said candidate word and comparing said single character replacements and insertions against said replacement hash table.
4. The method of claim 3, wherein the replacement hash table is obtained by: generating variants of each word in the dictionary, each variant is comprised of replacing any one character in the word with a wild card character and leaving other characters unchanged, thereby generating W variants for each word of length W; andfor each generated variant of a word in the dictionary, storing a key-value pair in a hash table, wherein a key is a generated variant having a value that is the word itself.
5. The method of claim 2, further comprising the steps of generating one or more distance one variants of said at least one candidate word and testing said distance one variants against one or more of said dictionary hash table and said at least one variant dictionary hash table, and accumulating matches.
6. The method of claim 5, wherein said one or more distance one variants comprises adjacent character transpositions of said at least one candidate word obtained by generating all variants of the candidate word wherein any one pair of adjacent characters are interchanged, and the remaining characters are left unchanged.
7. The method of claim 5, wherein said one or more distance one variants comprises single character deletions of said at least one candidate word obtained by generating all variants of the candidate word where any single character is deleted and other characters are unchanged.
8. The method of claim 5, wherein said one or more distance one variants comprises single character wild card replacements of said at least one candidate word obtained by generating variants of said at least one candidate word by replacing any one character in said at least one candidate word with a chosen wild card character and leaving other characters unchanged, thereby generating W variants of said at least one candidate word of length W.
9. The method of claim 5, wherein said one or more distance one variants comprises single character wild card insertions of said at least one candidate word obtained by generating variants of said at least one word which comprise inserting a wild card character before or after any character of said at least one candidate word and leaving the other characters unchanged, thereby generating W+1 variants of said at least one candidate word of length W.
10. The method of claim 2, wherein said at least one variant dictionary hash table comprises one or more of: a transposition hash table having entries comprising of single adjacent character transpositions of the words in the dictionary;a deletion hash table having entries comprising of single character deletions of the words in the dictionary;a transposition-replacement hash table having entries comprising of single adjacent character transpositions followed by single character wild card replacements of the words in the dictionary;a deletion-transposition hash table having entries comprising of single character deletions followed by transposition of characters adjacent to the just deleted character;a double-deletion hash table having entries comprising of a single character deletion followed by another single character deletion of the words in the dictionary;a deletion-replacement hash table having entries comprising of single character deletions followed by single character wild card replacements of the words in the dictionary; andan insertion-replacement hash table having entries comprising of a single character insertions followed by single character replacements of the words in the dictionary.
11. The method of claim 10, further comprising the step of testing said at least one candidate word against the deletion hash table, the deletions-transposition hash table, and the double deletion hash table and accumulating matches.
12. The method of claim 10, further comprising the steps of generating all adjacent character transpositions of said at least one candidate word and testing said adjacent character transpositions against transposition and deletion hash tables, and accumulating matches.
13. The method of claim 10, further comprising the steps of generating all single character deletions of said at least one candidate word and testing said single character deletions against transposition and deletion hash tables, and accumulating matches.
14. The method of claim 10, further comprising the steps of generating all single character replacements of said at least one candidate word and testing said single character replacements against the transposition-replacement, deletion-replacement, and insertion-replacement hash tables, and accumulating matches.
15. The method of claim 10, further comprising the steps of generating two character deletions of said at least one candidate word and testing said two character deletions against the double deletion hash table, and accumulating matches.
16. The method of claim 2, wherein said at least one variant dictionary hash table comprises one or more of: a deletion hash table having entries comprising single character deletions of the words in the dictionary;a deletion-transposition hash table having entries comprising single character deletions followed by transposition of characters adjacent to the just deleted character; anda transposition hash table having entries comprising single adjacent character transpositions of the words in the dictionary.
17. The method of claim 16, further comprising the step of testing at least one candidate word against the deletion hash table and the deletion-transposition hash table, and accumulating matches.
18. The method of claim 16, further comprising the step of generating all adjacent character transpositions of said at least one candidate word and testing said adjacent character transpositions against the transposition and deletion hash tables, and accumulating matches.
19. The method of claim 16, further comprising the step of generating all single character deletions of said at least one candidate word and testing said single character deletions against the transposition and deletion hash tables, and accumulating matches.
20. A method as recited in claim 16, further comprising a step of generating arbitrary character transpositions of said at least one candidate word by generating all variants of the candidate word wherein any one pair of not necessarily adjacent characters are interchanged, and the remaining characters are left unchanged.
21. The method of claim 2, wherein said at least one variant dictionary hash table comprises one or more of: a transposition hash table having entries comprising single not necessarily adjacent character transpositions of the words in the dictionary;a transposition-replacement hash table having entries comprising of single not necessarily adjacent character transpositions followed by single character wild card replacements of the words in the dictionary; anda deletion-transposition hash table having entries comprising of single character deletions followed by transposition of characters not necessarily adjacent to the just deleted character.
22. The method of claim 21, further comprising the steps of generating all not necessarily adjacent character transpositions of said at least one candidate word and testing said not necessarily adjacent character transpositions against the transposition hash table and a deletion hash tables, and accumulating matches.
23. A system for correcting spelling of at least one candidate word, said system comprising: a memory; andat least one processor, coupled to the memory, operative to:obtain at least one variant dictionary hash table based on variants of a set of known correctly spelled words, wherein said variants are obtained by applying one or more of a deletion, insertion, replacement, and transposition operation on said correctly spelled words;obtain from the candidate word one or more lookup variants using one or more of said deletion, insertion, replacement, and transposition operations;evaluate one or more of said candidate word and said lookup variants against said at least one variant dictionary hash table; andindicate a candidate correction if there is at least one match in the at least one variant dictionary hash table.
24. An article of manufacture for correcting spelling of at least one candidate word, comprising a machine readable medium containing one or more programs which when executed implement the steps of: obtaining at least one variant dictionary hash table based on variants of a set of known correctly spelled words, wherein said variants are obtained by applying one or more of a deletion, insertion, replacement, and transposition operation on said correctly spelled words;obtaining from the candidate word one or more lookup variants using one or more of said deletion, insertion, replacement, and transposition operations;evaluating one or more of said candidate word and said lookup variants against said at least one variant dictionary hash table; andindicating a candidate correction if there is at least one match in the at least one variant dictionary hash table.

Methods and apparatus for performing spelling corrections using one or more variant hash tables

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims