The invention relates to a speech recognition system, and more particularly to a system that generates a vocabulary for a speech recognizer.
Speech recognition systems may interface users to machines. Some speech recognition systems may be configured to process a received speech input and control a connected device. When speech is received, some of speech recognition systems search through a large number of stored speech patterns to try and match the input. If the speech recognition system has limited processing resources, a user may notice poor system performance. Therefore, a need exists for an improved speech recognition system.
A speech recognition system receives digital data. The system determines whether a memory contains some or all of the digital data. When some or all of the digital data does not exist in the memory, the system generates a transcription of the missing parts and stores the missing portion and a corresponding transcription in the memory.
The speech recognition system includes an interface, a processor, and a memory. The interface receives digital data from an external source. The processor determines whether some or all of the received digital data exists in the memory. Digital data missing from the memory is transcribed and the digital data along with the transcription are stored in the memory.
Other systems, methods, features and advantages will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts through the different views.
In some speech recognition systems 100, the input device 102 may include several microphones coupled together, such as a microphone array. Signals received from the microphone array may be processed by a beamformer which may exploit the lag time from direct and reflected signals arriving from different directions to obtain a combined signal that has a specific directivity. This may be particularly useful if the speech recognition system is used in a noisy environment, such as in a vehicle cabin or other enclosed area.
The speech recognition system in
An interface 105 may receive a digital data representing information that may be used by the speech recognizer 101 to control a connected device. The interface may be configured to receive the digital data through a network connection. The network connection may be a wireless protocol. In some speech recognition systems 100, the wireless protocol may be the radio data system (“RDS”) or Radio Broadcast Data System (“RBDS”) which may transmit data relating to radio station's name, abbreviation, program type, and/or song information. Other wireless protocols may include Bluetooth®, WiFi, UltraBand, WiMax, Mobil-Fi, Zigbee, or other mobility connections or combinations.
The digital data received by interface 105 may be used to provide additional vocabulary data to the speech recognizer 101. A processor 110 may be coupled to the interface 105. The processor 110 may determine whether some or all of the received digital data is present in a memory 107. The processor 110 may receive a digital data and may separate the data into data fragments according to categories. These categories may include letters, numbers, and/or special characters. A data fragment may include one character or a sequence of several characters. A character may include letters, numbers (digits), and/or special characters, such as a dash, a blank, or a dot/period.
The memory 107 may be configured as a look up table comprising lists of digital data and corresponding transcriptions of the digital data. The processor 110 may be coupled to the memory 107 and may determine whether some or all of the received data is present in the memory 107 by comparing a data fragment to the list of entries stored in the memory 107.
The processor 110 may also be configured to generate phonetic transcriptions of some or all of the received digital data if it is determined that the digital data is not already stored in the memory 107. The processor 110 may include a text-to-speech module and/or software that are configured to phonetically transcribe received digital data that is not present in the memory 107. The phonetic transcription may include generating data representing a spelled form, a pronounced form, or a combined spelled and pronounced form of a data fragment. A spelled form may generate data where each character of the data fragment is spelled. In pronounced form, a sequence of characters may be pronounced or enunciated as a whole word. In a combined form, part of the data fragment may be spelled and another part may be pronounced. The form of a phonetic transcription may depend on various criteria. These criteria may include the length of a data fragment (number of characters), the type of neighboring fragments, the presence of consonants and/or vowels, and/or the prediction or presence of upper or lower case characters. For exemplary purposes, a data fragment consisting of only consonants may be phonetically transcribed in spelled form.
Each data fragment and corresponding phonetic transcription may be stored in the memory 107 which is also accessible by the speech recognizer 101. Alternatively, the data fragment and corresponding phonetic transcription could be passed to the speech recognizer 101 and stored in memory 104 or stored in a memory internal to the processor 110.
In an alternate speech recognition system 100, memory 107 may be integrated with or coupled to the processor 110. In other speech recognition systems 100, the phonetic transcription may be performed by a device external to the processor 110.
At Act 202, the digital data “SWR 4 HN” may be decomposed (e.g., separated) according to predetermined categories. The predetermined categories may include “letters,” “number,” and/or “special characters.” The digital data “SWR 4 HN” may be categorized as “letters” and “numbers.” Analysis of the digital data word “SWR 4 HN” may start with the left most character which is the “S.” This character could be categorized as a “letter.” The subsequent characters “W and “R” would also be categorized as “letters.” After these three letters, there is a blank which may be categorized as a “special character.” The character “4,” may be categorized as a “number.” Therefore, the sequence of characters belonging to the same category, namely, the category “letters” is terminated and a first data fragment “SWR” is determined. The following blank constitutes a next fragment.
The number “4” is followed by a blank and, then, by the character “H”, which is categorized as a “letter.” Therefore, another fragment is determined to consist of the number “4.” This fragment is categorized as “numbers.” Following the “H” is the letter “N.” This is a last fragment consisting of the letters “H” and “N.” As a result, the digital data “SWR 4 HN” could be decomposed into fragments “SWR”, “4”, “HN,” and two special character fragments consisting of blanks.
Other variants of decomposing the digital data may be used. The data may be decomposed in different parts that are separated from another by a blank or a special character such as a dash or a dot. A system may perform the decomposition into letters and numbers as described above. In “SWR 4 HN” example, decomposition into sequences of characters being separated by a blank would already yield the three fragments “SWR”, “4” and “HN” and the two special character fragments. A further decomposition into letter fragments and number fragments would not change this decomposition. Other variants of decomposing the digital data may begin the operation from the right as opposed to the left.
At act 203, a memory (e.g., dictionary) that may retain a reference list may be searched to determine whether there are any entries matching one or a sequence of the decomposed data fragments. Searching the dictionary may include matching each character of a data fragment with the characters of an entry stored in the dictionary. Alternatively, searching the dictionary may include a phonetic comparison of the data fragment with an entry in the dictionary.
The dictionary may include words and/or abbreviations. Where the speech recognition system is used to control a radio, the dictionary may include the names and/or abbreviations of radio stations. For each data fragment or possibly for a sequence of data fragments, the dictionary is searched. The dictionary may also be decomposed into different sub-dictionaries each including entries belonging to a specific category. In this case, one sub-dictionary may include entries consisting of letters and another sub-dictionary may include entries consisting of numbers. Then, only the letter sub-dictionary would be searched with respect to letter data fragments and only the number sub-dictionary would be searched with regard to number data fragments. In this way, the processing time may be considerably reduced.
At act 204, it is determined whether there is any data fragment that does not match an entry in the dictionary. If this is not the case, the process may be terminated at act 207 since the digital word data is already present in the dictionary. Since the dictionary includes the phonetic transcription, the speech recognizer 101 has all the necessary information for recognizing these fragments.
If there are one or more data fragments for which no matching entry has been found in the dictionary, the process proceeds to act 205. At act 205, each data fragment is phonetically transcribed. Phonetic transcription may include generating a speech pattern corresponding to the pronunciation of the data fragment. A text to speech (“TTS”) synthesizer may be used to generate the phonetic transcription. At act 205, it is also decided according to a predetermined criterion what phonetic transcription is to be performed. In some speech recognition systems, a criterion may be that for data fragments consisting of less than a predetermined number of characters, a phonetic transcription in spelled form is always selected. The criterion may also depend (additionally or alternatively) on the appearance of upper and lower case characters, on the type and/or presence of neighboring (preceding or following) fragments, the length of a data fragment (number of characters), and/or the presence of consonants and/or vowels
Other phonetic transcription criteria may include spelling letter data fragments that consist of all consonants. In other words, the resulting phonetic pattern corresponds to spelling the letters of the data fragment. This is particularly useful for abbreviations not containing any vowels which would also be spelled by a user. However, in other cases, it might be useful to perform a composed phonetic transcription consisting of phonetic transcriptions in spelled and in pronounced form.
At act 206, the phonetic transcriptions and the corresponding digital data fragments may be provided to the speech recognizer 101. The phonetic transcriptions and corresponding digital data fragments may be stored in the memory of the speech recognizer and/or stored in an external memory accessible by the speech recognizer. Thus, the vocabulary for speech recognition is extended.
At act 302, a list of receivable stations may be determined. When scanning a frequency band, each time a frequency is encountered at which a radio signal is received, this frequency may be stored with other specific information. The information may include the name and/or abbreviation of the received radio station, programming type, signal frequency, or other information.
At act 303, it is determined whether there is already a list of receivable radio stations present or whether the current list has changed with respect to a previously stored list of radio stations. The latter may happen in the case of a vehicle radio when the driver is moving between different transmitter coverage areas. In this situation, some radio stations may become receivable at a certain time whereas other radio stations may no longer be receivable. Act 303 may determine if a list of receivable radio stations has changed by comparing a previously stored list to a recently received list. If the list of receivable radio stations has changed, the system may overwrite the previously stored list, or may remove the old stations that are no longer present and add the new stations. At act 304 vocabulary corresponding to the list of updated radio stations may be generated. This may be performed according to the method illustrated in
In the case of radio stations that are identified by their frequency, the dictionary may also comprise entries corresponding to different ways to pronounce or spell this frequency. For exemplary purposes, if a radio station is received at 94.3 MHz, the dictionary could include entries corresponding to “ninety-four dot three,” “ninety-four three,” “nine four three,” and/or “nine four period three.” Therefore, a user may pronounce the “dot” or not. In both cases, a speech recognizer could recognizer the frequency.
In the foregoing, the method for generating a vocabulary for a speech recognizer was described in the context of a radio, in particular, a vehicle radio. The method may be used in other fields as well including a speech recognizer for mobile phones. In such a case, a vocabulary may be generated based on an address book stored on the SIM card of the mobile phone or in a mobile phone's memory. In such a case, this address book database may be uploaded, when switching on the mobile phone and the method according to
The method shown in
A “computer-readable medium,” “machine-readable medium,” “propagated-signal” medium, and/or “signal-bearing medium” may comprise any means that contains, stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical connection “electronic” having one or more wires, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM” (electronic), a Read-Only Memory “ROM” (electronic), an Erasable Programmable Read-Only Memory (EPROM or Flash memory) (electronic), or an optical fiber (optical). A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
PCT/EP05/05568 | May 2005 | EP | regional |
04012134.5 | May 2004 | EP | regional |
This application claims the benefit of priority from International Application No. PCT/EP2005/005568, filed May 23, 2005, which is incorporated by reference.