The present invention relates to communication systems and in particular to low bit rate speech communication systems.
There are approximately 890,000 distinct words in the English language, but in general only 10,000 are in the vocabulary of the common educated person. In addition, some words are used much more frequently than others. For example, the top twenty most frequently used words in spoken English are: the, and, I, to, of, a, you, that, in, it, is, yes, was, this, but, on, well, he, have, and for.
It has been estimated that a typical person speaks at a rate of approximate 4 words per second, and that the average word is made of 6.66 phonemes. This means that approximately either 4 words or 27 phonemes per second must be transmitted to accurately convey the information. Definitions vary, but spoken English can be represented by approximately 50 distinct phonemes. Therefore, each of the phonemes can be represented distinctly as a 6-bit number. If phonemes were transmitted as a representation of the speech, approximately 162 bits/second would be required.
As an alternative to transmitting symbols representing phonemes, symbols representing the actual words can be transmitted. Estimates vary, but an educated person has a vocabulary of 10,000 words. A single 15-bit number can be assigned to each of the commonly used words (and word forms) in the English dictionary. If a person speaks at 4 words/second, then 60 bits/second would be necessary to represent the speech using this approach. As a further enhancement to this technique, shorter bit strings may be used to represent the most commonly used words, and even the most commonly used groups of words (“and the” for example). This technique may reduce the required bit rate to as little as 30 bits/second.
The human vocal tract can be represented as a glottal pulse train convolved through a vocal tract convolutional filter (of approximately 10 coefficients). The glottal pulse train represents the pitch of the speech and the filter coefficients determine the other sound characteristics. The pitch and the filter coefficients change as one speaks so each glottal pulse is convolved through a slightly different filter as one speaks to generate the sounds we hear. In an artificial speech generator, changing or updating the coefficients and pitch about 30 times/second is sufficient to generate natural sounding speech. Certain sounds, such as “ssss” or “zzz” do not contain the glottal pulse (are unvoiced), and can be represented as a sound directly from the filter, or with a much higher pitch frequency. Any given person will speak with a certain range of filter coefficients and glottal pulse shapes and frequency, giving them their particular speech sound. As one speaks, this range can be modeled and passed to the speech regenerator to help reconstitute speech that sounds like the original speaker. By passing only the range of pitch and filter coefficients, but not the coefficients themselves, little bandwidth is required to mimic the original speaker.
Prior art patents relating to the present invention include the following patents: U.S. Pat. No. 7,124,082, “Phonetic speech-to-text-to-speech system and method”, Freedman, 2006; U.S. Pat. No. 6,035,273, “Speaker-specific speech-to-text/text-to-speech communication system with hypertext-indicated speech parameter changes”, Spies, 1996; U.S. Pat. No. 5,724,410, “Two-way voice messaging terminal having a speech to text converter”, Parvulescu, 1998.
The present invention provides a very low bit rate speech communication system. In preferred embodiments, an off-the-shelf module is adapted to convert a speaker's voice to text. A processor is provided to separate the text into individual words. The processor is programmed with a dictionary which provides a pre-assigned specific 14-bit numeric value (words used more frequently may be assigned shorter codes) for each word. The processor creates a numeric stream from 14-bit numeric values and this numeric stream is then transmitted to a receiver. Typical speech contains 4 words/second, so bit rates as low as 50 bits/second may be achieved with this technique. At the receiving end, the stream of received 14-bit numeric values, representing the speaker's words, are looked up in a dictionary identical to that at the transmitting end and the text of the words reconstructed. Text-to-speech techniques common to the industry are then used to regenerate the speech.
Preferred embodiments of the present invention are described by reference to the drawings. In a first preferred embodiment, the speaker's sounds are converted to symbols representing words. These word symbols are then transmitted at the rate of four symbols per second. At the receiving end, the symbols are converted back to words and then to sound recognizable as speech.
Receiver 6 receives the output of transmitter 5 and presents 14-bit digital words to dictionary look-up module 7, which creates a string of textual words corresponding to the 14-bit numbers. The output of dictionary look-up module 7 is presented to text-to-speech module 8 (such as Fonix DecTalk 5), which creates a waveform facsimile of the speaker's voice, based on the text from module 7. The waveform is presented by computer 9 to loudspeaker 10 which creates an acoustic wave that may be heard by listener.
In a preferred embodiment of the invention dictionary conversion module 4 and dictionary look-up module 7 are custom software applications developed using Microsoft Speech SDK 5.1 for the personal computer.
In the preferred embodiment of the invention, the audio input is derived from Microphone 1, but may alternatively be provide by another sound source such as a computer file, amplifier, telephone, radio, or other source.
In the preferred embodiment of the invention, the audio speech recognition module is a customized version of the Microsoft Speech to Text engine as stated above. However, several other vendors are available with software and hardware to perform this function. In other embodiments of the invention, this module may also analyze the speaker's voice to determine pitch and vocal tract characteristics.
In the preferred embodiment of the invention, this is custom-written software that converts textual words to 14-bit numbers, using a 15,000 word common dictionary. In other embodiments of the invention, the dictionary may be customized to fit the particular context of speech or operating environment.
In the preferred embodiment of the invention, this is custom-written software that converts 14-bit numbers to textual words, using a 15,000 word common dictionary. In other embodiments of the invention, the dictionary may be customized to fit the particular context of speech or operating environment.
In the preferred embodiment of the invention, the Text-to-Speech function is performed using Fonix's DecTalkS software as stated above, which allows customization for multiple speakers (it has the ability to generate several different voices). The text-to-speech function is generic and may or may not be based on phoneme recognition. In other embodiments of the invention, the speaker's voice will be parameterized to mimic the sound of the speaker's voice. Several vendors provide both software and hardware products that perform the text-to-speech function.
Though not shown in the
Although not shown in
There are many potential applications of the present invention some of which are outlined below and many of which will be obvious to persons skilled in the communication art:
The underwater environment limits the penetration of both electromagnetic and acoustic signals to only very low frequencies. Acoustic carrier signals of approximately 10 kHz are typically used for sonar and communications, and electromagnetic signals of approximately 200 Hz are used for communications. Lower frequencies penetrate much farther underwater, and the low bit rates of the speech coding technique of the present invention will significantly extend the range of underwater acoustic speech transmission systems, as illustrated in
Wireless communication from the surface to the earth to deep underground has become a safety issue, but communicating wirelessly to depths of several hundred meters is not practical at frequencies above ˜2 kHz. By going to lower carrier frequencies, the penetration is greatly enhanced. A frequency of approximately 1 KHz should have detectable signal at a depth of >100 m underground. The present invention allows speech communications systems to be built that are capable of wirelessly communicating from the surface to depths of >100 m.
Online computer games and virtual worlds have been created in which the players are represented online as ‘avatars’ which are seen by the other players in the game or world. Often these avatars look and act very different that the ‘real-life’ person. In an application of the present invention, the player's online avatar can speak the words of the player to the other online players, but in a voice of the players choosing, rather than his own. In this application of the invention, the object is not to mimic the speaker's voice, but to give it a different, more fanciful semblance, or to make all players speak with the same voice or set of voices.
Telephone applications of all sorts can benefit from the present invention, either wireless, cellular, wired, Internet, or other. Bandwidth for voice communications is becoming more expensive, and more users are being added all the time. The present invention allows substantially more users to be accommodated in the same amount of bandwidth employed by current techniques.
While the present invention has been described in terms of specific embodiments, certain other modifications and improvements will therefore occur to those skilled in the art upon reading the foregoing description. The embodiment described herein is based on a specific architecture but the present invention is not so limited. So the scope of the invention should be determined by the appended claims and their legal equivalence.