User web search queries are used to obtain search query results from a search engine. However, many user queries contain misspellings. This could result for many reasons, such as, an unfamiliar subject matter, or the user is entering a name that was heard from radio or television, or the user introduces lexical errors inadvertently while typing.
Misspellings can be corrected using different methods, such as using a dictionary. When a user query term does not appear in a dictionary, a dictionary entry with the lowest edit distance can be used or suggested as an alternative to the misspelled term. The edit distance refers to the number of characters within the misspelled term that need to be added, deleted, or changed in order to achieve a correctly spelled term. For example, “amand” has an edit distance of one, if corrected to “amend.” For another example, “Cincinatti” has an edit distance of two, when corrected to “Cincinnati,” where one letter was added (n) and another letter was removed (t). However, a static dictionary may not contain colloquial terms or many names that are currently popular, which the dictionary may predate. In addition, updating a dictionary typically relies on costly human labor.
Another spell correction system uses dynamic lookup tables of misspelled/corrected pairs. The misspelled query term is altered to the most common term that has a low edit distance from the user query misspelled term. However, the correctly spelled term may have a large edit distance if it was derived from a longer misspelled term. Therefore, a corrected term may be excluded from consideration due to a large edit distance.
A trie is another tool used with some spell correction systems. A trie is an ordered tree data structure that is used to store an associative array, where the keys are usually strings. A trie can be populated with one or more dictionaries, histograms, word bi-grams, or frequently used spellings. However, as with other systems, a corrected term may be excluded from consideration due to a large edit distance.
Embodiments of the invention are defined by the claims below. A high-level overview of various embodiments is provided to introduce a summary of the systems, methods, and media that are further described in the detailed description section below. This summary is neither intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in isolation to determine the scope of the claimed subject matter.
Systems, methods, and computer-readable storage media are described for generating spelling candidates. In some embodiments, a method of generating one or more spelling candidates includes receiving a text fragment log. The text fragment log is divided into one or more common context groups. Each term or phrase of the divided text fragment log is ranked according to a frequency of occurrence within each of the one or more common context groups to form one or more respective ranked lists. A chain algorithm is implemented to each of the respective ranked lists to identify a base word or phrase and a set of one or more subordinate words or phrases paired with the base word or phrase. The base word or phrase is aggregated with all sets of one or more subordinate words or phrases from all of the respective ranked lists to form one or more resulting chains of spelling candidates for the base word or phrase.
In other embodiments, a spelling candidate generator system contains a context group component, an algorithm component, and an aggregation component. The context group component contains a text fragment log divided into one or more common context groups. The algorithm component contains one or more lists of terms or phrases from the divided text fragment log. The one or more lists of terms or phrases are ranked according to a frequency of occurrence within each respective common context group to obtain individual base words or phrases and one or more associated subordinate words or phrases. The aggregation component contains one or more aggregated pairs of the individual base terms or phrases paired with their associated subordinate terms or phrases.
In yet other embodiments, one or more computer-readable storage media have computer-readable instructions embodied thereon, such that a computing device performs a method of generating one or more spelling candidates upon executing the computer-readable instructions. The method includes receiving a query log, which contains one or more user-input queries. The user-input queries are divided into one or more common context groups. Each term of the user-input queries within a common context group are ranked according to a frequency of occurrence for each of the one or more common context groups to form one or more respective ranked lists. For each respective ranked list, a top-ranked word or phrase is identified as a correctly spelled word or phrase. An edit distance is determined for a next-ranked word or phrase from the top-ranked word or phrase for each respective ranked list. The next-ranked word or phrase is labeled as a misspelling of the top-ranked word or phrase when the edit distance is within a threshold level for each respective ranked list. The top-ranked word or phrase and all sets of one or more next-ranked words or phrases from all of the respective ranked lists are aggregated to form one or more chains of spelling candidates for the top-ranked word or phrase.
Illustrative embodiments of the invention are described in detail below, with reference to the attached drawing figures, which are incorporated by reference herein, and wherein:
a-3c are tables of spelling candidate generation scoring used in accordance with embodiments of the invention;
d is a screenshot used in accordance with embodiments of the invention;
a is a flowchart of a chain algorithm used in accordance with embodiments of the invention;
b is a table of spelling candidate generation scoring used in accordance with embodiments of the invention; and
Embodiments of the invention provide systems, methods and computer-readable storage media for spelling candidate generation.
The terms “step,” “block,” etc. might be used herein to connote different acts of methods employed, but the terms should not be interpreted as implying any particular order, unless the order of individual steps, blocks, etc. is explicitly described. Likewise, the term “module,” etc. might be used herein to connote different components of systems employed, but the terms should not be interpreted as implying any particular order, unless the order of individual modules, etc. is explicitly described.
Embodiments of the invention include, without limitation, methods, systems, and sets of computer-executable instructions embodied on one or more computer-readable media. Computer-readable media include both volatile and nonvolatile media, removable and non-removable media, and media readable by a database and various other network devices. By way of example and not limitation, computer-readable storage media comprise media implemented in any method or technology for storing information. Examples of stored information include computer-useable instructions, data structures, program modules, and other data representations. Media examples include, but are not limited to information-delivery media, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact-disc read-only memory (CD-ROM), digital versatile discs (DVD), Blu-ray disc, holographic media or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage, and other magnetic storage devices. These examples of media can be configured to store data momentarily, temporarily, or permanently. The computer-readable media include cooperating or interconnected computer-readable media, which exist exclusively on a processing system or distributed among multiple interconnected processing systems that may be local to, or remote from, the processing system.
Embodiments of the invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computing system, or other machine or machines. Generally, program modules including routines, programs, objects, components, data structures, and the like refer to code that perform particular tasks or implement particular data types. Embodiments described herein may be implemented using a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. Embodiments described herein may also be implemented in distributed computing environments, using remote-processing devices that are linked through a communications network, such as the Internet.
Having briefly described a general overview of the embodiments herein, an exemplary computing system is described below. Referring to
The computing device 100 includes a bus 110 that directly or indirectly couples the following devices: memory 112, one or more processors 114, one or more presentation components 116, input/output (I/O) ports 118, input/output components 120, and an illustrative power supply 122. The bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of
The components described above in relation to the computing device 100 may also be included in a wireless device. A wireless device, as described herein, refers to any type of wireless phone, handheld device, personal digital assistant (PDA), BlackBerry®, smartphone, digital camera, or other mobile devices (aside from a laptop), which communicate wireles sly. One skilled in the art will appreciate that wireless devices will also include a processor and computer-storage media, which perform various functions. Embodiments described herein are applicable to both a computing device and a wireless device. In embodiments, computing devices can also refer to devices which run applications of which images are captured by the camera in a wireless device. The computing system described above is configured to be used with the several computer-implemented methods, systems, and media for spelling candidate generation, generally described above and described in more detail hereinafter.
Embodiments of the invention can be implemented as software instructions executed by one or more processors in a computing device, such as a general purpose computer, cell phone, or gaming console. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components which include, but are not limited to Field-Programmable Gate Arrays (FPGAs), Application Specific Integrated Chips (ASICs), Program-specific Standard Products (ASSPs), Systems-on-a-chip (SOCs), or Complex Programmable Logic Devices (CPLDs).
Input methods for embodiments of the invention may be implemented by a Natural User Interface (NUI). NUI is defined as any interface technology that enables a user to interact with a device in a “natural” manner, free from artificial constraints imposed by input devices such as mice, keyboards, remote controls, and the like.
Examples of NUI methods include those relying on speech recognition, touch and stylus recognition, gesture recognition both on-screen and adjacent to the screen, air gestures, head and eye tracking, voice and speech, vision, touch, gestures, and machine intelligence. Specific categories of NUI technologies include, but are not limited to touch sensitive displays, voice and speech recognition, intention and goal understanding, motion gesture detection using depth cameras (such as stereoscopic camera systems, infrared camera systems, rgb camera systems and combinations of these), motion gesture detection using accelerometers/gyroscopes, facial recognition, 3D displays, head, eye, and gaze tracking, and immersive augmented reality and virtual reality systems, all of which provide a more natural interface. NUI also includes technologies for sensing brain activity using electric field sensing electrodes.
a is a table illustrating several queries 310, where each of the queries resulted in the same URL 320 being clicked upon or selected by the associated user. The table has been truncated, but if the table was expanded, it would illustrate queries that resulted in one or more clicks to the same URL. The number of clicks 330 of each query is also illustrated.
Referring back to
Λ=Σ[log10(θn)+1]
b is a table illustrating ranked results from the commonly grouped URLs illustrated in
The top-ranked term or phrase is identified as the prominent term or prominent phrase, then a chain algorithm is applied to determine the edit distance of each term or phrase from the previous term or phrase in step 240. The previous term may be the prominent term or a previous subordinate term. An illustration will be given for step 240, using the information from the tables in
The second highest-ranked term or phrase is selected from the set of words or phrases within the same common context group. In
The third highest-ranked term or phrase is selected from the set of words or phrases within the same common context group. In
Embodiments of the chain algorithm of step 240 in
In step 245, a determination is made whether there is another context group. If another context group exists, then the method returns to step 230, where the terms or phrases of the subsequent context group are ranked. If there are no more context groups, then the method continues to step 250. In step 250, results for all common context groups are aggregated. The table in
The extracted pairs of prominent/subordinate words or phrases can be scored according to the following embodiment. The likelihood of a subordinate term or phrase being a misspelling of the dominant term or phrase is given by the fraction of the number of contexts in which the subordinate term or phrase was corrected to the dominant term or phrase, and the total number of contexts in which the subordinate term or phrase appeared. A mathematical illustration is given below.
Let: Ψ=the total number of common contexts in which one or more queries or text fragments contained a possibly incorrect spelling of a word/phrase (W/P); Φ=the number of common contexts not corrected (considered correct); Ω=the number of common contexts in which a possibly incorrect spelling of a W/P was found to be a misspelled word or phrase of W/P. A common context could be a single word, a multi-word phrase, or an entire query.
Likelihood of original word or phrase being correct=Φ/(Φ+Ψ)
Likelihood of changing W/P to W′/P′=Ω/(Φ+Ψ)
d is an example of how embodiments of the invention can be used in a user interface. A screenshot 301 illustrates a returned result. In this example, a user input the term, “schwarznegger” 302. The total results included results for “schwarzenegger,” 303 (the correct spelling), and also included a question, asking if results for “schwarznegger” were wanted 304.
a is a flow diagram illustrating the chain algorithm discussed above. The chain algorithm is implemented in step 240 of the flow diagram illustrated in
In the ranked list of
The chain algorithm illustrated in
The spelling candidate generation system 500 also contains an algorithm component 530. Each of the common context groups 520 in the context group component 510 are ranked within their respective common context groups 520 according to frequency of occurrence. Therefore, the first common context group 520 within the context group component 510 will have a corresponding ranked list 540 within the algorithm component 530. The table in
The chain algorithm, discussed above with reference to
The spelling candidate generation system 500 also contains an aggregation component 550. The aggregation component 550 combines the pairs of base words or phrases with associated subordinate words or phrases. An alternative embodiment combines pairs of correctly spelled words or phrases with associated variantly or incorrectly spelled words or phrases. Aggregated pairs are formed from all of the individual ranked lists 540 for all of the common context groups 520. The aggregation component 550 forms one or more chains 560 for each base word (BW) and its associated one or more subordinate words (SWn).
In a conventional spelling candidate generator, “swarzeneggar” would probably not be linked to “schwarzenegger” because “swarzeneggar” is three edit distances away from “schwarzenegger.” However, embodiments of the invention provide one or more intermediate subordinate terms to be chained to the base word, wherein each subordinate term falls within an acceptable threshold edit distance from the most previous term, either the base word or another subordinate word. As a result, each term within a chain can be logged as a linked misspelling of the base word.
Many different arrangements of the various components depicted, as well as embodiments not shown, are possible without departing from the spirit and scope of the invention. Embodiments of the invention have been described with the intent to be illustrative rather than restrictive.
It will be understood that certain features and subcombinations are of utility and may be employed without reference to other features and subcombinations and are contemplated within the scope of the claims. Not all steps listed in the various figures need be carried out in the specific order described.