This invention generally relates to voice recognition systems, and more particularly relates to voice recognition systems used to retrieve items from long lists of items.
Current voice recognition systems require a user to enter the entire name of an item before it can be identified within a collection or list of items. In the case of music players, a user would need to recite the entire name of an album or song title before the album or song title could be properly located. For example, in order to play songs from the album “Hotel California,” the user is required to recite “Hotel California.” If the user recites only a portion of the title (e.g., “Hotel” or “California”), many speech engines will not return “Hotel California” but will typically return a homonym of “Hotel California” (e.g., “Go Tell” for “Hotel”). By requiring the user to state, in exact terms, the selection title, unnecessary rigidity is introduced into the selection process, thereby rendering the voice recognition tools inconvenient for the user to use and master. The problem becomes particularly acute as databases grow larger and larger inasmuch as the user must be able to recall with specificity, a significant number of titles (potentially thousands).
Some attempts have been made to overcome the above-referenced problem by phonically transcribing each word found within a phrase. For example, in the above-referenced example, the words “Hotel” and “California” would be combined in context with one another and the speech engine would come back with the result “Hotel” + “California.” This solution at first, seems to be ideal; however, with large lists comprising several thousand songs, the recognition rate drops significantly because searching for an entire phrase using individual word recognition on average, may require three to four times more entries (one entry for each word in a title) than searching for a match to a complete phrase or full title search.
It is therefore desirable to provide for a flexible speech recognition system that recognizes either full phrases or partial words from long lists of items in a manner that is easy to use.
The present invention solves or minimizes the problem associated with multi-context voice recognition search of long lists by conducting simultaneous, multiple dictionary searches. To achieve this and other advantages, in accordance with the purpose of the present invention as embodied and described herein, one aspect of the present invention provides for a speech recognition system having an audio input device for accepting a phrase based, search request. The system has a first speech engine coupled to a phrase based dictionary for searching out matches between the phrase based, search request and one or more entries in the phrase based dictionary. The system further includes a second speech engine coupled to a keyword based dictionary for searching out matches between one or more component words of the phrase based, search request and the one or more entries in the keyword based dictionary.
According to another aspect of the present invention, a method of searching lists of items is provided. The method includes the steps of receiving a spoken request from a user, and passing the spoken request to first and second speech engines. The first speech engine is loaded with a phrase based dictionary, and the second speech engine is loaded with a keyword based dictionary created by parsing keywords from the phrase based dictionary. The method also includes the step of comparing the spoken request with entries contained in the phrase based dictionary. The method further includes a step of comparing the spoken request with entries contained in the keyword based dictionary. The method further generates one or more speech recognition matches based on the steps of comparing.
In one embodiment, the present invention solves the problem associated with finding an entry from a long list of items. One embodiment of the present system allows a user to articulate a complete name or, in the alternative, to conduct a search on one or more words from a complete name.
These and other features, advantages and objects of the present invention will be further understood and appreciated by those skilled in the art by reference to the following specification, claims and appended drawings.
The present invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Now referring to
Once the user 10 has audibly placed a request (e.g., “Dancing Queen”), the audible request 12 is captured by an audio transducer (e.g., a microphone) 14. The audio signal captured by audio transducer 14 is converted into an electrical signal that is then processed by an analog-to-digital (A/D) conversion system (e.g., a codec device) 16 which converts an audio signal into voice data. The voice data is transferred to a dual speech buffer 18, which may be contained within the voice (speech) recognition engines or reside as a separate entity outside the voice recognition engines. Dual speech buffer 18 creates a voice data stream which represents the entire string as dictated by user 10. This voice data is passed to a first speech engine 22 along data path 20. Speech engine 22 searches the entire string passed to it by dual speech buffer 18 by comparing the text of the entire string to the entries found in the entire phrase dictionary 24. Because speech engine 22 searches using the exact title/album string, it looks for an exact match within phrase dictionary 24 and by its nature, this process will have greater accuracy than a word-by-word search method. Any matches found by speech engine 22 between the text passed to it along data path 20 and a text entry found in phrase dictionary 24 are sent along to the results manager 26 wherein the results found by search engine 22 are presented to the user for selection.
Entire phrase dictionary 24 is populated by way of information stored in music metadata database 36, according to one embodiment. Music metadata database 36 stores music titles, artists' names, and other metadata associated with songs. Although it is possible to store the actual song information on the metadata database 36, in most applications it is preferable to store the actual song content in compressed format (e.g., MP3 files) in a separate database. This database can be on the same media used to store the music metadata database 36 or it can be on a separate storage device (i.e. SD card, etc.). Every time that a new song is offered to the user for selection, the metadata for the song is loaded into the metadata database. A “grapheme to phoneme” (G2P) converter 25 accepts the text based information stored in the music metadata database 36 and converts it to phonemes or symbols that are recognized by the speech engine 22.
In addition to dual speech buffer 18 presenting the voice data 20 to speech engine 22 along data path 20, dual speech buffer 18 also presents the voice data 20 to a second speech engine 30 by way of data path 28. Unlike speech engine 22, speech engine 30 attempts to match each word spoken by user 10 to keywords in keyword dictionary 32. In each instance where speech engine 30 successfully matches a word or multiple words sent from dual speech buffer 18 against an entry found in keyword dictionary 32, a request is issued by speech engine 30 along data path 34 to retrieve all entries within the music metadata database that contain the words matched by speech engine 30. The entries retrieved from metadata database 36 are sent to results manager 26 where they are displayed to the user for final selection by the user.
A parser 27 parses the individual words stored in music metadata databases 36 and presents them to the G2P converter 29. In the embodiment shown and described herein, the parser 27 is typically a software program that separates (or “parses”) the individual words from a song title. In some embodiments, not every word from a song title is kept. For example, key words may be retained from a song title but other, nonessential words (such as “the,” “and,” “or,” “but,” etc.) can be eliminated. “Grapheme to phoneme” (G2P) converter 29 forms the same function as that already discussed in conjunction with G2P converter 25. Alternatively, the parser 27 and G2P converter 29 process (or phonetic transcription process), may be preprocessed to create the dictionaries.
In one embodiment, first speech engine 22 and second speech engine 30 run concurrently on separate software threads but at generally similar thread priorities. Under this arrangement, the first search engine 22 will generally complete its search task ahead of the second search engine 30 because the context in which search engine 22 must search is narrower in scope than the context in which search engine 30 must search (i.e. a word-by-word search is, by definition, going to take longer than a phrase based search provided search engines 22 and 30 are processed at the same speed). In another embodiment, first speech engine 22 could execute first and then after completing the voice recognition, second speech engine 30 could execute (sequential recognition). This embodiment does not take advantage of parallel processing which may result in a longer overall recognition time, however, it would save memory since only one system is running at a time.
A HMI (Human Machine Interface) can be used to dynamically populate the list compiled by the results manager 26. Shortly thereafter, the word-by-word search engine 30 will return a word-by-word search. If user 10 recites an entire item (e.g., entire song title), speech engine 22 will return accurate results quickly. If user 10 states less than all of the words used in the entire title, speech engine 30 will return the results after a more extended period of time. Results manager 26 will display the search results to the user 10 from both the “exact match” found by speech engine 22 and also the “word-by-word” matches found by speech engine 30. In one embodiment, the search results can be listed and sorted using any number of sorting schemes including alphabetically or by confidence level as determined by speech engines 22 and 30.
The various components of the speech recognition system may be executed on one or more microprocessors and memory, as should be apparent to those skilled in the art. The speech engines 22 and 30 may be implemented as software stored in memory and processed by one or more microprocessors. The first and second speech engines 22 and 30 may be implemented in software having different objects or different instances of the same object for performing the phrase and keyword searches. The entire phrase dictionary 24, keyword dictionary 32, and music metadata database 36 may be located in memory that is readable by the microprocessor(s). The memory may include random access memory (RAM), read-only memory (ROM), electronically erasable programmable read only memory (EEPROM), flash memory, and other memory medium. The dual speech buffer 18 may be likewise implemented in memory. It should be appreciated that the various components of the speech recognition system may otherwise be implemented with analog and/or digital circuitry, without departing from the teachings of the present invention.
In view of the above description, it has been demonstrated that the speech recognition system and method of the present invention allows a user to audibly state an item and effectively uses the information spoken by the user to provide the best matches for the item (i.e. album/song/title) that are currently found in a reference database.
Although the present invention has been primarily discussed in the context of retrieving album titles/artist titles, and song titles from a resident database according to an exemplary embodiment, it is also contemplated that the search methodology set forth herein is equally beneficial for use in any system which requires a user to select one or more items from a long list of stored items (such as road names and the like).
Now referring to
In the “keyword” thread step 58, each component word contained within the voice data is compared in step 66 against entries in a “keyword” dictionary 32. If any matches are found in step 68 between the component words from the “entire string” text and the word entries found within the “keyword” dictionary 32, these matches are then sent to a database to search for items containing either any or all of the keywords returned by the speech engine 30. Different types of queries can be sent to the database to search on the keywords returned by the speech engine 30. The items retrieved from the database are then displayed to the user for final selection in step 70. If no matches above a certain confidence are found between the component words from the “parsed string” text and the entries within the “keyword” dictionaries, control is passed back to the beginning of the method at step 50.
Accordingly, the speech recognition system and method of the present invention advantageously solves or minimizes the problem associated with the searching of an item from long item lists conducting simultaneous, multiple dictionary searches. The invention may allow for a user to articulate a complete name or, in the alternative, to conduct a search on one or more words from a complete name.
It will be understood by those who practice the invention and those skilled in the art, that various modifications and improvements may be made to the invention without departing from the spirit of the disclosed concept. The scope of protection afforded is to be determined by the claims and by the breadth of interpretation allowed by law.