This invention relates generally to voice communications, and more particularly to a bandwidth efficient method and system of communication using speech units such as diphones, triphones, or phonemes.
On wireless telecommunication systems, bandwidth (BW) is very expensive. There are many techniques for compressing audio to maximize bandwidth utilization. Often, these techniques provide either low quality voice with reduced BW or high quality voice with high BW.
Embodiments in accordance with the present invention can provide utilize known voice recognition and concatenative text to speech (TTS) synthesis techniques in a bandwidth efficient manner that provides high quality voice. In most embodiments herein, systems herein can improve bandwidth efficiency over time without necessarily degrading voice quality.
In a first embodiment of the present invention, a method for improved bandwidth and enhanced concatenative speech synthesis in a voice communication system can include the steps of receiving a speech input, converting the speech input to text using voice recognition, segmenting the speech input into speech units such as diphones, triphones or phonemes, comparing the speech units with the text and with stored speech units in a database, combining speech units with the text in a data stream if the speech unit is a new speech unit to the database, and transmitting the data stream. The new speech units can be stored in the database and if the speech unit is an existing speech unit in the database, then it does not need to be transmitted in the datastream. The method can further include the step of extracting voice parameters among speech rate or gain for each speech unit where the gain can be determined by measuring an energy level for each speech unit and the rate can be determined from a voice recognition module. The method can further include the step of determining if a new voice is detected (the speech input is for a new voice) and resetting the database. Note, the speech units can be compressed and stored in the database and transmitted. This method can be done at a transmitting device. The method can also increase efficiency in terms of bandwidth use by increasingly using stored speech units as the database becomes populated with speech units.
In a second embodiment of the present invention, another method for improved bandwidth and enhanced concatenative speech synthesis in a voice communication system can include the steps of extracting data into parameters, text, voice and speech units, forwarding speech units and parameters to a text to speech engine, storing a new speech unit missing from a database into the database, and retrieving a stored speech unit for each text portion missing an associated speech unit from the data. The method can further include the step of comparing a speech unit from the extracted data with speech unit stored in the database. From the parameters sent to a text to speech engine, the method can further include the step of reconstructing prosody. Note, this method can be done at a receiving device such that the database at a receiver can be synchronized with a database at a transmitter. The method can further include the step of recreating speech using the new speech units and the stored speech units. Further note that the database can be reset if a new voice is detected from the extracted data.
In a third embodiment of the present invention, a voice communication system for improved bandwidth and enhanced concatenative speech synthesis in a voice communication system can include at a transmitter a voice recognition engine that receives a speech input and provides a text output, a voice segmentation module coupled to the voice recognition engine that segments the speech input into a plurality of speech units, a speech unit database coupled to the voice segmentation module for storing the plurality of speech units, a voice parameter extractor coupled to the voice recognition engine for extracting among rate or gain or both, and a data formatter that converts text to speech units and compresses speech units using a vocoder. The data formatter can merge speech units and text into a single data stream. The system can further include at a receiver an interpreter for extracting parameters, text, voice, and speech units from the data stream, a parameter reconstruction module coupled to the interpreter for detecting gain and rate, a text to speech engine coupled to the interpreter and parameter reconstruction module, and a second speech unit database that is further populated with speech units from the data stream that are missing in the second speech unit database. The receiver can further include a voice identifier that can reset the database if a new voice is detected from the data stream. Note, the second speech unit database can be synchronized with the speech unit database.
Other embodiments, when configured in accordance with the inventive arrangements disclosed herein, can include a system for performing and a machine readable storage for causing a machine to perform the various processes and methods disclosed herein.
While the specification concludes with claims defining the features of embodiments of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the following description in conjunction with the figures, in which like reference numerals are carried forward.
In wired or wireless IP networks, traffic conditions or congestion can be improved upon using a bandwidth efficient communication technique that also provides reasonable speech voice quality as described herein. Embodiments herein can use Voice Recognition and concatenative TTS synthesis techniques to efficiently use BW. Methods in accordance with the present invention can use snippets or speech units of pre-recorded voice from a transmitter end the bits of pre-recorded voice can be put together at a receiver end. The snippets or speech units can be diphones, triphones, syllables, or phonemes for example. Diphones are usually a combination of two sounds. In general American English there are 1444 possible diphones. For example, “tip”, “steep”, “spit”, “butter”, and “button” involve five different pronunciations of “t”. At a transmitter 10 as illustrated in
Referring again to
The transmitter 10 can further include a voice parameter extraction module 28 that obtains information about at least speech rate and gain. The gain of the speech is extracted and it is sent with the text for later prosodic reconstruction (Stress, accentuation, etc.). Energy for each phoneme or diphone can be easily measured to determine the gain per snippet (phoneme or diphone). The chart of
With respect to the voice segmentation module 24, there are many ways of doing voice segmentation using Voice recognition. Voice recognition software can perform recognition and link the word to corresponding phoneme and audio. Once the phoneme is detected, a diphone can be formed. As noted previously, diphones are a combination of 2 phonemes.
Referring to
Note, any Voice recognition engine (with dictation capabilities) is acceptable for the module 12 of
At a receiver 50 as illustrated in
The efficiency of the method of communication in accordance with several of the embodiments herein will be low at the beginning of a call, will increase as the call continues as it reaches a steady state condition where no diphones are sent at all and transmission consists of text, speech rate, and gain info. Both history databases can be synchronized in case of packet loss. Hence, the receiver in a synchronized scenario has to acknowledge every time that a diphone is received. If the transmitter does not get the acknowledgement from the receiver, then the diphone can be deleted from the local (transmit side) database (18 or 41). If a diphone is not received, a pre-recorded diphone can be used on the receiver side (50). The pre-recorded diphone database 58 can have all the diphones and can be used in combination with received diphones in case of packet loss. Note, embodiments herein can use any method of voice compression to reduce the size of the diphone to be sent.
Every TTS system has a pre recorded database with all the speech units (diphones). In embodiments herein, the database 58 will serve the TTS engine 56 except that the speech units or diphones are not all present at the beginning. The database 58 gets populated during the communication. This can be totally transparent to the TTS engine 56. Every time that the TTS engine 56 requests a diphone or other speech unit, it will be available whether it is obtained from the database 58 or freshly extracted from the data stream. The diphones or other speech units are stored in compressed format at the history database 58 to reduce the memory usage on the receiver 50.
Note, using the embodiments herein that the voice prosody (stress, intonation) is degraded where the amount of degradation will depend on the BW used. To improve the voice quality, the number of voice parameters transmitted (related to the voice prosody such as pitch) can be increased; hence the quality will improve with some effect to BW. The overall BW is variable and improves with time. Each diphone or speech unit that is repeated (existing on the database) is not necessarily transferred again. After the most common diphones or speech units have been transferred, the BW is reduced to a minimum level.
To determine a worst case scenario for bandwidth, note the following:
The worst case BW is:
(*) For example: Mean diphone duration = 140 ms -> Avg. of 7 diphones per second considering and an avg. of 5 bytes per diphone.
At the beginning, the rate is equivalent to today's technology. But after a few seconds the rate can be drastically reduced (the diphones start to exist or populate on the database). After the database is populated with the most frequent diphones (500 diphones), the rate is lowered to 500 bps (approximated). After the most frequent diphones are received, if a non-existent diphone is received, the rate will have peaks of 1000 bps. Note, a complete conversation can be made using only 1300 diphones from a total of 1600.
In light of the foregoing description, it should be recognized that embodiments in accordance with the present invention can be realized in hardware, software, or a combination of hardware and software. A network or system according to the present invention can be realized in a centralized fashion in one computer system or processor, or in a distributed fashion where different elements are spread across several interconnected computer systems or processors (such as a microprocessor and a DSP). Any kind of computer system, or other apparatus adapted for carrying out the functions described herein, is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the functions described herein.
In light of the foregoing description, it should also be recognized that embodiments in accordance with the present invention can be realized in numerous configurations contemplated to be within the scope and spirit of the claims. Additionally, the description above is intended by way of example only and is not intended to limit the present invention in any way, except as set forth in the following claims.