The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
An automated directory assistance (DA) system is typically a spoken dialog system that provides a caller with the phone number and/or address of a desired business, government or residential listing. The DA system can be used to complement or replace the traditional human operated 411 services in order to cut cost.
A typical DA system includes two components, a speech recognizer and a search module. The speech recognizer receives a user's utterance and transforms it into text form. The text then serves as the basis for a query that is used by the search module in order to find the desired listing in a database of listings.
Due to errors made by the speech recognizer and the search module, the chance of finding the desired listing is typically well below 30% in most current DA systems if no special measure is taken to improve the search module. Speech recognition is nowhere near perfect. In fact, even using a domain specific language model (LM) trained with all listings for a given city, the speech recognizer may still have a word error rate (WER) of about 70% under some conditions. However, the search module often relies on exact matching of the query and the listing, and hence, is not robust to speech recognition errors.
Another source of errors is due to variations in users' references to listings. For example, the listing “Kung Ho Cuisine of China” is often referred to as “Kung Ho restaurant” or “Kung Ho Chinese restaurant.” Users' utterances often contain other information in addition to a business name, such as the spelling of the name or the street it is on.
Yet another source of errors is due to inconsistent text normalization for listings in the database. For example, 5 in “Big 5 Sporting Goods” is not normalized, while 9 in “K nine” is. So no matter whether inverse text normalization (ITN) is performed on ASR outputs or not, some queries will not match the listings on those terms.
This Summary and the Abstract herein are provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary and the Abstract are not intended to identify key features or essential features of the claimed subject matter, nor are they intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
An aspect herein described is a computer system for searching a database such as but not limited to DA system. The database comprises a plurality of listings, which are not long documents, but rather, each listing is a small plurality of terms. A query recognizer receives a query from a user. The query recognizer can take numerous forms. For example, the query recognizer can be a speech recognizer, handwriting recognizer or simply a module that takes text in such as through a keyboard and formulates a query.
A search module receives the output from the query recognizer and uses it as a basis for searching the database for listing(s) matching the query. In particular, the search module uses a cosine similarity metric of a vector-space model, with vector elements weighted by term frequency-inverse document frequency (Tf/Idf) scores. (Henceforth the search module is referred to as “Tf/Idf algorithm”). Use of the Tf/Idf algorithm in a DA system, or in use with other databases having listings or information similar to a listing is scalable and robust.
Directory assistance system 104 includes a query recognizer 106 configured to process a query 101 by applying a grammar 108 so as to identify the content of the query. In one embodiment, not by limitation, query recognizer 106 is a speech recognition engine configured to apply a speech recognition grammar 108 so as to recognize the content of a query 101, which is an utterance spoken by a user 102. In another embodiment, query recognizer 106 is a handwriting recognition engine configured to apply a collection of handwriting characteristics 108 so as to recognize the content of a query 101, which is handwriting input by a user 102. In yet another embodiment, query recognizer 106 is a text recognition engine configured to apply a collection of text recognition characteristics 108 so as to recognize the content of a query 101, which is illustratively text input by a user 102. While aspects herein described are not limited to any particular method of recognition, solely for the purposes of illustrating an embodiment, it will be assumed that queries 101 are spoken, that query recognizer 106 is a speech recognition engine, and that grammar 108 is a speech recognition grammar.
The recognized content 109 of a query 101 is provided to a search module or component 110. Component 110 is illustratively configured to compare the recognized content against a database of listings 112. In this manner, component 110 identifies, from the database, one or more listings that match (or sufficiently surpass a pre-established threshold of similarity) the recognized content of the query. These matching listing or listings (i.e., listings 103) are provided to the user that initiated the corresponding query 101.
At this point it should be noted that aspects herein described are not limited to use only with DA system 104, but rather may be beneficial with numerous other database systems having listings, in other words, where each entry or record has a small number of words for example, in one embodiment, less than approximately 25 words. As the number of words in each entry goes down, the aspects herein described become more beneficial so in yet a further embodiment, where each entry has less than approximately 10 words enhanced performance can be realized over current techniques. Without limitation some other databases can include listings or entries related to movies, books, articles, songs, actors, directors, authors, artists, library material, geographical material and financial information to name just a few.
If the recognized content 109 is robustly normalized to the intended listings in the database 112, then better search results will be obtained. Generally, search capability can be provided by a statistical classifier, in which each listing is treated as a classification destination for users' queries 101.
A vector space model is commonly used in information retrieval, where both documents and queries are represented as vectors. The similarity between a document vector and a query vector, often measured as the cosine of the angle between the two vectors, determines the relevance of the document. Each element wi in a vector w represents the weight (importance) of a term (e.g. a word or a bigram), which can be represented as:
wi=−tfi*log(dfi)
Where the term frequency tfi is the frequency that term i occurs in a document, and the document frequency dfi is the frequency of documents that contain term i. The inverse document frequency, −log(dfi), lowers the importance of a term that occurs frequently across different documents, such as words like “Inc.” and “Restaurant” in business listings. This is commonly referred to as a term frequency-inverse document frequency (“Tf/Idf”) algorithm.
Given a document vector wd (in this case, a vector for a canonical business listing) and a query vector wq, the relevance of the document to the query can be measured by:
sim(wd,wq)=wd·wq/|wd∥wq|
The document vectors wd are constructed from normalized listings. For example, the listing “Big 5 Sporting Goods” can be vectorized as [1*Idf(“Big”), 1*Idf(“Five”), 1*Idf(“Sporting”), 1*Idf(“Goods”)], where the term frequency for all the 5 words are 1 and the Idf values for the 5 words are pre-computed. The normalization of the words (e.g., “5”→“five”) can be implemented according to a lexicon or grammar 108.
In typical information retrieval, documents are generally long. Accordingly, the term frequency of a document reassembles the true statistical distribution underlying the document. Listings, on the other hand, are short in general (typically, one to 25 words), so the term frequency may not represent the distribution well. Hence, a small change in a listing is more likely to yield different search results. This is illustrated in
A first aspect herein described is use of the Tf/Idf search algorithm to retrieve information from a listing database 112 such as in the DA system 104. In one form of prior system, the utterance is directly mapped to the listing(s) by the recognizer. In other words, an acoustic model (AM) probability and a language model (LM) probability is used. In its simplest form, the LM is just a list of the listing names constructed as a probabilistic context free grammar (PCFG), such that the LM path used by the decoder of the speech recognizer indicates the target listing. It has been observed, however, the automation rate of the DA system with this simple formulation is extremely low due to the fact that callers usually don't know, don't remember, or don't say the exact name listed in the directory.
A natural improvement to this DA design is to model the different ways callers actually say a listing. This usually requires collecting a huge number of real calls. Second, the approach is not scalable given that there are numerous ways in which different callers may refer to the same listing. If the application is the U.S. yellow pages, which has 18 million listings, the permutations are virtually endless. As a result, this approach is practical only when the listing database is small.
In contrast, use of the Tf/Idf algorithm in a DA system, or in use with other databases containing listings or information similar to a listing (i.e., where there are a substantial number of entries each of which has approximately 25 or less words, and in a further embodiment, 10 or less words) is scalable and robust. Generally, as a further aspect herein described one or more of the listings is modified to include information, in addition or in substitution for terms present in the listing. This additional information is searchable, thereby increases the dimensions of the vector space. In particular, but optionally, one or more of the techniques described below can be employed, for example, to improve robustness.
In a first technique or method illustrated in
It should be understood that the manner in which duplicate terms are represented can take numerous forms. The example herein provided (i.e. “—2nd”, “—3rd”, etc.) is just one form for purposes of illustration only, and therefore should not be considered limiting.
In some embodiments, database 112 is a structured database. In other words, some if not all entries include data in two or more fields. Using the DA system 104 by way of example, each entry in database 112 would include a first field for the name of the person, business, etc., a second field for the telephone number and one or more fields for the address or portions of the address (street number, street name, city, state, zip code).
Another method illustrated in
It should be noted suitable forms of word normalization (e.g. “5” to “five”) and/or word breaking (e.g. “outback” to “out” and “back”) can be used by search module 110 (as well for recognition by query recognizer 106), if desired, but further elaboration is not necessary for purposes of this description.
Similarly, a portion of the query can be used for disambiguation as illustrated in
The next two following techniques are particularly beneficial for making the search robust to speech recognition errors. In a first technique, instead of using words as indexing units (terms) in the Tf/Idf search, character n-grams are used. Referring to
In the second technique, instead of using character n-grams as indexing units in the Tf/Idf search, phone n-grams are used. Referring to
In the technique of
In the technique of
A yet further technique is illustrated in
Another type of scoring function that can be used in combination with the Tf/Idf search result scores is a form of “edit distance”, for example, character edit distance. Generally, an “edit distance” between two patterns A and B is defined as the minimum number of changes (insertion, substitution or deletion) that have to be done to the first one in order to obtain the second one. If the associated insertion and deletion costs are same, edit distance can be symmetric. For instance, the similarity between each of the string(s) is computed using a modified version of the dynamic programming algorithm for edit-distance calculation (Wagner and M. Fischer. “The String-to-String Correction Problem” Journal of Association for Computing Machinery. 1974).
Two measures are provided, normalized Match Ratio (NMR) and Normalized Order Ratio (NOR) as:
Both these measures can be understood intuitively. NMR looks at the number of matches of characters between a listing and the query referenced to the number of characters in the query. Ideally we would want the NMR measure to be one. Clutter in a particular listing is reflected by NOR. If a particular listing has a lot of non-matching characters with the query the denominator increases and thus NOR is inversely proportional to clutter in a particular listing. In one embodiment, the goal is to rank all the search result listings based on a function of their NMR and NOR scores. A simple ranking function RF can be the product of the score from the Tf/Idf algorithm and the combination of the NMR and NOR scores:
RF=(NMR+NOR)×Cosine(query,listing)
In addition to the examples herein provided, other well known computing systems, environments, and/or configurations may be suitable for use with concepts herein described. Such systems include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The concepts herein described may be embodied in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Those skilled in the art can implement the description and/or figures herein as computer-executable instructions, which can be embodied on any form of computer readable media discussed below.
The concepts herein described may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both locale and remote computer storage media including memory storage devices.
With reference to
Computer 1310 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 1310 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 1300.
The system memory 1330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 1331 and random access memory (RAM) 1332. A basic input/output system 1333 (BIOS), containing the basic routines that help to transfer information between elements within computer 1310, such as during start-up, is typically stored in ROM 1331. RAM 1332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 1320.
The computer 1310 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 1310 through input devices such as a keyboard 1362, a microphone 1363, and a pointing device 1361, such as a mouse, trackball or touch pad. These and other input devices are often connected to the processing unit 1320 through a user input interface 1360 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB). A monitor 1391 or other type of display device is also connected to the system bus 1321 via an interface, such as a video interface 1390.
The computer 1310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1380. The remote computer 1380 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 1310. The logical connections depicted in
When used in a LAN networking environment, the computer 1310 is connected to the LAN 1371 through a network interface or adapter 1370. When used in a WAN networking environment, the computer 1310 typically includes a modem 1372 or other means for establishing communications over the WAN 1373, such as the Internet. The modem 1372, which may be internal or external, may be connected to the system bus 1321 via the user-input interface 1360, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 1310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
It should be noted that the concepts herein described can be carried out on a computer system such as that described with respect to
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above as has been determined by the courts. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
5479488 | Lennig et al. | Dec 1995 | A |
5638425 | Meador, III et al. | Jun 1997 | A |
5943417 | Cox et al. | Aug 1999 | A |
5987408 | Gupta | Nov 1999 | A |
6073095 | Dharanipragada et al. | Jun 2000 | A |
6189003 | Leal | Feb 2001 | B1 |
6643640 | Getchius et al. | Nov 2003 | B1 |
6681222 | Kabra et al. | Jan 2004 | B2 |
6757692 | Davis et al. | Jun 2004 | B1 |
6826559 | Ponte | Nov 2004 | B1 |
6999563 | Thorpe et al. | Feb 2006 | B1 |
7076484 | Dworkis et al. | Jul 2006 | B2 |
20020165873 | Kwok et al. | Nov 2002 | A1 |
20020174101 | Fernley et al. | Nov 2002 | A1 |
20030112729 | Nichols et al. | Jun 2003 | A1 |
20030130998 | Fox et al. | Jul 2003 | A1 |
20030187649 | Logan et al. | Oct 2003 | A1 |
20040010518 | Montemer | Jan 2004 | A1 |
20040024754 | Mane et al. | Feb 2004 | A1 |
20040132433 | Stern et al. | Jul 2004 | A1 |
20050015307 | Simpson et al. | Jan 2005 | A1 |
20050102251 | Gillespie | May 2005 | A1 |
20050169364 | Singh | Aug 2005 | A1 |
20060093120 | Thorpe et al. | May 2006 | A1 |
20060242142 | Coen et al. | Oct 2006 | A1 |
Entry |
---|
Jing et al. (“Improved Feature Selection Approach TFIDF in Text Mining”, Nov. 2002, IEEE, pp. 944-946). |
Jan et al., “Automatic Construction of Unique Signatures and Confusable Sets for Natural Language Directory Assistance Applications”, 2003. |
Minka, “Make Character Strings Unique”, 2005, R Documentation. |
L. Boves, D. Jouvet, J. Sienel, R. de Mori, F. Bechet, L. Fissore, and P.Laface, ASR for automatic directory assistance: the SMADA project, In: Proceedings: ESCA ITRW Automatic Speech Recognition. Challenges for the new millennium. Paris: LIMSI-CNRS, 2000, pp. 249-254. |
den Os, E., de Koning, N., Jongebloed, H. and Boves. L.: Usability of a Speech-Centric Multimodal Directory Assistance Service. Proc. of the CLASS Workshop on Information Presentation and Natural Multimodal Dialogs, Verona, Italy, 2001, 65-69. |
Béchet, F., den Os, E. Boves, L., Sienel, J. “Introduction to the IST-HLT project Speech-driven Multimodal Automatic Directory Assistance (SMADA)”, Proc. ICSLP-2000, Beijing. |
Levin, E., and Mane, A.M., “Voice User Interface Design for Automated Directory Assistance”,in Proc. INTERSPEECH,2005, vol. 3,pp. 2509-2512. |
A scalable architecture for directory assistance automation, Natarajan, P.; Prasad, R.; Schwartz, R.M.; Makhoul, J. , (ICASSP '02). IEEE International Conference, 2002, pp. I-21-I-24 vol. 1. |
Kellner,A.,et al.,“using Combined Decisions and Confidence Measures for Name Recognition in Automatic Directory Assistance systems”, in Proc.ICSLP,1998,pp. 2589-2862. |
Seide,F.,and Kellner,A.,“Towards an Automated Directory Information System”, in Proc.Eurospeech,1997,vol. 3,pp. 1327-1330. |
Scharenborg,O.,Sturm,J.,Boves,L., Business Listings in Automatic Directory Assistance Directory Assistance, in Proc.Eurospeech 2001,pp. 2381-2384. |
Automation of Telecom Italia directory assistance service: field trial results, Billi, R.; Canavesio, F.; Rullent, C. , Interactive Voice Technology for Telecommunications Applications, 1998. IVTTA '98. Proceedings. 1998 IEEE 4th Workshop, vol., Iss., Sep. 29-30, 1998, pp. 11-16. |
Number | Date | Country | |
---|---|---|---|
20080281806 A1 | Nov 2008 | US |