Over the last few years, people search has emerged as an important on-line search service. Unlike general searches where users are looking for information on a wide range of topics including people, products, news, events, etc., people search is about people. Hence, personal names are used predominantly as queries in people search. However, it is known that a significant percentage of queries in people search are misspelled.
Spelling errors in personal names are of a different nature compared to those in general text. Long before people search became widely popular, researchers working on the problem of personal name matching had recognized the human tendency to be inexact in recollecting names from the memory and specifying them. A study of personal names in hospital databases found that only 39% of the errors in the names were single typographical errors, whereas 80% of misspelled words in general text are due to single typographical errors. Further, multiple typographical errors, phonetic errors, cognitive errors and word substitutions are observed relatively more frequently in personal names compared to general text.
In addition to within-the-word errors, people search queries can be plagued by errors that are not usually seen in general text. For instance, one study discovered that 36% of the errors were due to addition or deletion of a word.
Personal name spelling correction suggestion technique embodiments described herein generally provide suggestions for alternate spellings of a personal name. In one general embodiment this involves creating a personal name directory which can be queried to suggest spelling corrections for personal names. A hashing-based scheme is used to characterize the personal names in the directory. More particularly, in one general implementation for creating a personal name directory which can be queried to suggest spelling corrections for a personal name, a hash function is computed that maps any personal name in a particular language and misspellings thereof to similar binary codewords. Once the hash function has been computed, it is used to produce one or more binary codewords for each personal name in the aforementioned language that is found in the personal name directory. The codeword or codewords produced for each personal name are then associated with that name in the directory.
The same hashing-based scheme can also be used to characterize a personal name included in query prior to it being used to obtain suggested spelling corrections for the name from the directory. More particularly, in one general implementation for providing one or more suggested spelling corrections for a personal name included in a query, a personal name query that includes a personal name in the aforementioned particular language is input. The hash function is then used to produce one or more binary codewords from the query personal name. Next, the previously constructed personal name directory is employed to identify up to a prescribed number of personal names, each of which has one or more of the associated binary codewords that are similar to one or more of the binary codewords produced from the personal name query. The identified personal names are then designated as potential personal name corrections. Then one or more of the potential personal name corrections are suggested as alternate names for the personal name from the personal name query.
It should also be noted that this Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of personal name spelling correction suggestion technique embodiments reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the technique.
Personal name spelling correction suggestion technique embodiments described herein generally involve creating a personal name directory which can be queried to suggest spelling corrections for personal names. A hashing-based scheme is used to characterize the personal names in the directory. The same hashing-based scheme can also be used to characterize a personal name included in query prior to it being used to obtain suggested spelling corrections for the name from the directory. Generally, given a query, it is desired to return the global best match, or up to a prescribed number of the top matches, from the personal name directory.
Referring to
Referring now to
In order to expedite the query matching process, especially for large name directories, in one embodiment the query search is performed in two stages-namely a name bucketing stage and a name matching stage. In the name bucketing stage, for each token of the query, an approximate nearest neighbor search of the name tokens of the directory is done to produce a list of candidate matches (i.e., tokens that are approximate matches of the query token). For the purposes of this description, a token is defined as a word in a personal name exhibiting a continuous string of characters unbroken by a space and whose characters are consistent with the types of characters employed in personal names in the language of the name. Using the aforementioned list of candidate tokens, a list of candidate names is extracted which contain at least one of the approximately matching tokens. In the name matching stage, a rigorous matching of the query with candidate names is performed.
Success in finding the right personal name suggestion for the query in the name matching stage depends on the success in getting the right name suggestion in the list of candidates produced by the name bucketing stage search. Therefore, employing a name similarity search technique that can ensure very high recall without producing too many candidates would be advantageous. Hashing is believed to be ideally suited for this task of fast and approximate name matching. In operation, the query tokens, as well as personal name directory tokens are generally hashed into d bit binary codewords (e.g., 32 bit codewords). With binary codewords, finding approximate matches for a query token is as easy as finding all the database tokens whose codeword is at a distance of r or less from the query token codeword (e.g., a Hamming distance). When the binary codewords are compact, this search can be done in a fraction of a second on directories containing millions of names on even a simple computing device.
The hashing procedure used to hash the personal name directory tokens and the query tokens can be performed in different ways. In one embodiment, a novel data-driven technique for learning hash functions for mapping similar names to similar binary codewords is used based on a set of personal names in a given language (i.e., monolingual data). In another embodiment, learning the hash functions for mapping similar names to similar binary codewords is based on using name equivalents in multiple languages (i.e., multilingual data). It is noted that the language of an equivalent personal name can also be in a different script from the other equivalent names. For example, in a two-language implementation of the latter approach, name pairs are used as the training data—where one of the names is in the language and script that it is anticipated the personal name queries will exhibit, and the other is the equivalent name in a different language and possibly even a different script. However in both hashing implementations, the idea is the same: learning hash functions that map similar names in the training data to similar binary codewords. The foregoing hashing implementations will be described in more detail in the sections to follow.
In general, the task of learning hash functions using monolingual names is formulated as an optmization problem whose relaxation can be solved as a generalized Eigenvalue problem. Let (s, s′) be a pair of names from a set of name pairs T={(s, s′)}. It is noted that in one tested implementation, 30,000 single token names in English were employed in learning a monolingual name hash function. Now, let w(s, s′) be a measure of a name pair's similarity. In one implementation, 1—length normalized Edit Distance between s and s′ was used as w(s, s′). More particularly, w(s, s′)=Damerau-Levenshtein Edit distance between s and s′/max {|s|, |s′|}. Featurization of the actual names is typically employed for computational purposes in mapping the names to codewords. In one implementation each name is represented as a feature vector over character bigrams. For instance, the name token Klein has the bigrams {•k, kl, le, ei, in, n•} as features. Thus, let φ(s)εRd
minimize:Σ(s,s′)εTw(s,s′)∥f(s)−f(s′)∥2 (1)
s.t.:
Σs:(s,s′)εTf(s)=0 (2)
Σs:(s,s′)εTf(s)f(s)T=ρ2Id (3)
f(s),f(s′)ε{−1,1}d (4)
where Id is an identity matrix of size d×d. Note that the foregoing minimization is based on a Hamming distance of a codeword y to another codeword y′ being ¼ ∥y−y′∥2. It is also noted that the foregoing second constraint (Eq. (3)) helps avoid the trap of mapping all names to the same codeword and thereby making the Hamming error zero while satisfying the first and last constraints. It can be shown that the above minimization problem is NP-hard even for 1-bit codewords.
Further, the optimal solution gives codewords only for the names in the training data. As it is ultimately desired that f be defined for all s, the out-of-training-sample extension problem can be handled by relaxing f as follows:
f
R(s)=ATφ(s)=(a1Tφ(s), . . . ,adTφ(s))t (5)
where A=[a1, . . . , ad]εRd
After the linear relaxation of Eq. (5), the first constraint (Eq. (2)) simply means that the data be centered, i.e., have zero mean. Here Φ is centered by subtracting the mean of Φ from every Φ(S)εΦ to get {circumflex over (Φ)}.
Given the above relaxation, the following optimization problem can be formulated:
minimize:TrAT{circumflex over (Φ)}K{circumflex over (Φ)}TA (6)
s.t.:
A
T{circumflex over (Φ)}{circumflex over (Φ)}TA=ρ2Id (7)
where L is the graph Laplacian for the similarity matrix W defined by the pairwise similarities w(s, s′).
This minimization task can be transformed into a generalized Eigenvalue problem and solved efficiently using either Cholesky factorization or the QZ algorithm:
{circumflex over (Φ)}L{circumflex over (Φ)}TA={circumflex over (Φ)}{circumflex over (Φ)}TA (8)
where is a d×d diagonal matrix.
Once A has been estimated from the training data, the codeword of a name a can be produced by binarizing each coordinate of fR(s):
f(s)=(sgn(a1Tφ(s)), . . . ,sgn(adTφ(s)))T (9)
where sgn(u)=1 if u>0 and −1 otherwise for all uεR.
It is noted that in one implementation, the top 32 Eigenvectors found via Eq. (8) where chosen to form the hash function resulting in the output of a 32 bit codeword (i.e., d=32). It was found that a 32 bit codeword provided an acceptable tradeoff between retrieval accuracy and speed. However, it is not intended that the personal name spelling correction suggestion technique embodiments described herein be limited to 32 bit codewords. Codewords of other bit lengths can be employed as desired.
In view of the foregoing, one implementation of computing a hash function that maps any personal name in a particular language and misspellings thereof to similar binary codewords using monolingual data is accomplished as follows. Referring to
As indicated previously, learning hash functions using multilingual equivalent personal names as training data involves two or more different languages. For example, in one tested two-language implementation, the languages chosen were English and Hindi. However, the personal name spelling correction suggestion technique embodiments described herein are not limited to just two language implementations, or to the tested languages. For example, but without limitation, other languages (and scripts) that could be used include Russian, Greek, Hebrew, Arabic, among others. In addition, any combination of two or more languages can be used. For example, a three-language implementation might employ English, Hindi and Russian to learn the aforementioned hash functions.
In general, the task of learning hash functions using multilingual equivalent names is formulated as an optimization problem whose relaxation can be solved as a generalized Eigenvalue problem. For example, consider an implementation using two languages. Let (s, t) be a pair of name s and its equivalent t in a different language. Given the set T={(s, t)} as the training data (in one tested two-language implementation, about 15,000 pairs of parallel single token names in English-Hindi were employed), let φ(s)εRd
minimize:Σ(s,t)εT∥f(s)−g(t)∥2 (10)
s.t.:
Σs:(s,t)εTf(s)=0 (11)
Σt:(s,t)εTg(t)=0 (12)
Σs:(s,t)εTf(s)f(s)T=ρ2Id (13)
Σt:(s,t)εSg(t)g(t)T=ρ2Id (14)
f(s),g(t)ε{−1,1}d (15)
where Id is an identity matrix of size d×d.
As it is desired that f (and resp. g) to be defined for all s (and resp. t), f (and resp. g) are relaxed as follows:
f
R(s)=ATφ(s) (16)
g
R(t)=BTψ(s) (17)
where A=[a1, . . . , ad]εRd
Given the above relaxation, the following optimization problem can be formulated:
minimize:TrH(A,B;{circumflex over (Φ)},{circumflex over (Ψ)}) (18)
s.t.:
A
T{circumflex over (Φ)}{circumflex over (Φ)}TA=ρ2Id (19)
B
T{circumflex over (Ψ)}{circumflex over (Ψ)}TB=ρ2Id (20)
where H(A,B; {circumflex over (Φ)}, {circumflex over (Ψ)})=(AT{circumflex over (Φ)}−BTΨ)(AT{circumflex over (Φ)}−BT{circumflex over (Ψ)})T.
This minimization can be solved as a generalized Eigenvalue problem:
{circumflex over (Φ)}{circumflex over (Ψ)}TB={circumflex over (Φ)}{circumflex over (Φ)}TA (21)
{circumflex over (Ψ)}{circumflex over (Φ)}TA={circumflex over (Φ)}{circumflex over (Φ)}TB (22)
where is a d×d diagonal matrix. Further, Equations (21) and (22) find the canonical coefficients of {circumflex over (Φ)} and {circumflex over (Ψ)}. Here again in one implementation, the top 32 Eigenvectors found via Eqs. (21) and (22) where chosen to form hash functions resulting in the output of a 32 bit codeword (i.e., d=32).
As with monolingual learning, the codeword of s is obtained by binarizing the coordinates of fR(s):
f(s)=(sgn(a1Tφ(s)), . . . ,sgn(adTφ(s)))T (23)
It is noted that as a biproduct, it is possible to hash names in the second language using g:
g(t)=(sgn(b1Tψ(t)), . . . ,sgn(bdTψ(t)))T (24)
Extension of the foregoing two-language example, to add one or more additional languages is straightforward. Let O={oi}i=1n be a set of multi-view data objects and xi(k) be the kth view of the object oi, where Xi(k)εd
As it is desired to enable cross-view similarity search through hashing, the hash functions will map similar objects to similar codewords over all the views. More specifically, if oi and oj are two similar data objects, each of the hash functions f(k) will to map oi and oj to similar codewords. Now, the Hamming distance between the codewords of oi and oj summed over all the views is
d
ij=Σk−1Kd(yi(k),yj(k))+Σk=1KΣk′>kKd(yi(k),yj(k′)) (25)
Hash functions are sought that minimize the similarity weighted Hamming distance between the codewords of the training data objects. Further, along the lines of the two language hash learning scheme, a couple of constraints are imposed: first, each bit will have an equal chance of being 1 or −1; second, the bits will be uncorrelated. Thus, the following problem is arrived at which is a generalization of Spectral Hashing to multiview data objects:
where e is a n×1 vector of all 1s and Id is an identity matrix of size d×d. From Equations (25) and (26), it follows easily that:
where L′=2L+(K−1)D, D is a diagonal matrix such that Dii=Σj=1n Dij and L=D−W is the Laplacian. Note that
The foregoing optimization problem is NP hard as it reduces trivially to the optimization problem when K=1 and the latter is known to be NP hard.
Assume that yi(k) is a low-dimensional linear embedding of xi(k) but make no assumption on the distribution of the data objects:
y
i
(k)
=A
(k)
x
i
(k) (31)
where A(k)=[a1(k), . . . , ad(k)]εRd
f
(k)(x(k))=(sgn(<a1(k),x(k)>), . . . ,sgn. (32)
Post relaxation the problem can be rewritten as follows:
The constraint in Eq. (34) simply means that the data objects should have zero mean in each of the views. This can be easily ensured by centering the data objects.
The relaxed objective function is
d
which is convex in A(k), k=1, . . . , K.
The problem of learning hash functions has now been transformed into a parameter estimation problem. To estimate the parameters A(k), each of the partial derivatives of
X
(k)
L′X
(k)
A
(k)−Σk′≠kKX(k)WX(k′)
In view of the foregoing, one implementation of computing a hash function that maps any personal name in a particular language and misspellings thereof to similar binary codewords using multilingual data is accomplished as follows. Referring to
It is noted that the personal name tokens can be featurized prior to being used to compute the hash function. In cases where the name tokens are featurized, each token is represented as a vector of features, as are the equivalents of each token in the other languages. As such, the hash function is computed so as to map a featurized version of any token in the particular language and featurized versions of misspellings of that token to similar binary codewords using the featurized tokens derived from of the training personal names and the one or more featurized equivalents of each of the tokens.
Once the aforementioned hash function is computed, a personal name directory made up of numerous names in the language associated with the hash function is indexed to make it queryable. More particularly, referring to
It is noted that the personal name tokens can be featurized prior to being hashed and indexed. In cases where the name tokens are featurized, each token is represented as a vector of features, and the hash function is applied to each featurized unique token to produce the binary codeword representation thereof.
Once the personal name directory index is built, it can be queried for personal name spelling correction suggestions. In general, this is accomplished by submitting a personal name for which spelling correction suggestions are sought. The submitted personal name query is then tokenized and hashed in the same manner as described previously in connection with building the directory index. The resulting binary codewords generated from the query name tokens are then compared to the codewords in the directory index to ultimately identify similar personal names. One or more of the discovered similar names are then provided to the querying user.
In one implementation, the tokenizing and hashing of the personal name included in the query is accomplished as follows. Referring to
If the personal name directory index was generated using featurized tokens, then the personal name included in the query would be featurized before hashing as well. To this end, in cases where the name tokens are featurized, each identified unique token from the personal name query is represented as a vector of features, and the hash function is applied to each featurized unique token from the personal name query to produce a binary codeword for each of these featurized tokens.
In operation, the foregoing querying procedure is accomplished in two stages—namely a name bucketing stage and a name matching stage. These two stages will now be described.
Given a personal name query that has been broken up into its constituent tokens Q=S1S2 . . . SI, each token Si is hashed into a codeword Yi using the appropriate previously learned hash function (i.e., the hash function learned from the monolingual training names, or the hash function learned for the language of the query when multilingual training names were employed). For each of the resulting query codewords Yi, those codewords yi′ in the previously built directory index that are at a prescribed distance (e.g., Hamming distance) of r or less from yi are identified. For example, in tested embodiments, a Hamming distance of 4 was used. The name tokens corresponding to each of the identified codewords are then retrieved from the index and ranked. In one implementation, this ranking involves the use of a unique token-level similarity scoring procedure.
In one implementation, this token-level similarity scoring entails the use of a logistic function applied over multiple distance measures to compute the similarity score between name tokens s from the query and s′ of the name tokens corresponding to the identified codewords retrieved from the index. For example, this token-level similarity scoring function can take the form of:
where K(s, s′) is the token-level similarity score between s and s′, di is the ith distance measure and αi is a weighting factor for the ith distance measure.
While a variety of distance measures can be employed in Eq. (25), two appropriate choices are the normalized Damerau-Levenshtein edit distance between s and s′ and the Hamming distance between the codewords of s and s′, (∥f(S)−f(s′)∥) It is noted that when the normalized Damerau-Levenshtein edit distance is employed, it has been found that the continuous relaxation ∥fR(s)−fR(s′)∥ provided better results than ∥f(s)−fs′ and hence can be used as a substitute as desired. It is further noted that the weighting factor αi for each similarity measure can be established empirically using a set of known similar and dissimilar examples.
Once the name tokens corresponding to each of the aforementioned identified codewords retrieved from the index are ranked, a prescribed number (e.g., 100) of the highest ranking tokens are retained. The retained tokens are then used to retrieve all the personal names associated therewith in the personal name directory index to form a pool of candidate personal names for use in the name matching stage.
In one exemplary implementation, the foregoing name bucketing procedure is accomplished as follows. Referring to
When all the binary codeword associated with the personal name included in the personal name query have been selected and processed, a previously unselected one of the identified unique tokens is selected (710), and a token level similarity measure is computed between the selected token and each token associated with the personal name query (712). It is then determined if there are any remaining previously unselected identified unique tokens (714). If so, process actions 710 through 714 are repeated. Otherwise, the identified unique tokens are ranked based on their computed token-level similarity measure (716). A prescribed number of the top ranking unique tokens are retained (718), and the personal names in the personal name directory index that include any of the retained top ranking unique tokens are designated as candidate personal names (720).
In general, the name matching task involves finding the best match, or up to a prescribed number (e.g., 10) of the top scoring matches, between the personal name query and the candidate personal names from the candidate pool. However, it is pointed out that the query and personal names in the candidate pool will typically have multiple name parts (i.e., multiple words or tokens making up the personal name). Thus, a measure of similarity between the full personal name in the query and each of the full candidate names in the candidate pool is computed. This can be done using the individual token-level similarity scores computed for each token associated with both the query and the names in the candidate pool. In one implementation, this multi-token name similarity measure is computed as follows.
Let Q=s1s2 . . . sI and D=s′1s′2 . . . s′J be two multi-token names, where, as before, Q corresponds to the personal name query, and where D corresponds to one of the candidate personal names from the candidate pool. To compute the similarity between Q and D, a weighted bipartite graph is formed with a node for each si and a node for each s′j, and with the edge weight between each node being set to the previously computed token-level similarity measure K(si, s′j). The weight (Kmax) of the maximum weighted matching in this graph is then computed. This maximum weighted matching represents the greatest possible sum of the individual edge weights following a node-to-node path through the graph. It is noted that in practice, a maximal matching computed using a greedy approach suffices since many of the edges in the bipartite graph will typically have a low weight.
Given the foregoing, in one implementation the similarity between Q and D is computed as:
where K(Q, D) is the similarity score between the personal name query Q and a candidate personal name D, I is the number of tokens in the personal name query Q and j is the number of tokens in the candidate personal name D.
In one exemplary implementation, the foregoing name matching procedure is accomplished as follows. Referring to
It is noted that the foregoing check to ensure the top ranking candidate personal names have a score that exceeds the similarity threshold is why it was indicated previously that up to a prescribed number of the top scoring matches are identified in the name matching task. While a prescribed number of candidate personal names are involved, some of them may not pass the similarity threshold test, and so not make the final list of potential personal name corrections.
A brief, general description of a suitable computing environment in which portions of the personal name spelling correction suggestion technique embodiments described herein may be implemented will now be described. The technique embodiments are operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 10 may also contain communications connection(s) 22 that allow the device to communicate with other devices. Device 10 may also have input device(s) 24 such as keyboard, mouse, pen, voice input device, touch input device, camera, etc. Output device(s) 26 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
The personal name spelling correction suggestion technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Still further, the aforementioned instructions could be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
In the foregoing description of learning hash functions using multilingual equivalent names as training data it was mentioned that the names could be in different scripts. However, in an alternate implementation where the languages involved are not in English, the name tokens and/or their equivalents in another language can be Romanized first and then featurized. For example consider the Chinese equivalent the name token “Michael”. This is Romanized to “Maikeer” and then featurized into {̂m, mi, ic, ch, ha, ae, el, l$}.
It is noted that any or all of the aforementioned embodiments throughout the description may be used in any combination desired to form additional hybrid embodiments. In addition, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.