The disclosed embodiments relate generally to methods, systems, and devices for audio communications. More particularly, the disclosed embodiments relate to methods, systems, and devices for speech transduction.
Traditionally, audio devices such as telephones have operated by seeking to faithfully reproduce the sound that is acquired by one or more microphones. However, phone call quality is often very poor, especially in hands-free applications, and significant improvements are needed. For example, consider the operation of a speakerphone, such as those that are commonly built into cellular telephone handsets. A handset's microphone is operating in a far field mode, with the speaker typically located several feet from the handset. In far field mode, certain frequencies do not propagate well over distance, while other frequencies, which correspond to resonant geometries present in the room, are accentuated. The result is the so-called tunnel effect. To a listener, the speaker's voice is muffled, and the speaker seems to be talking from within a deep tunnel. This tunnel effect is further confounded by ambient noise present in the speaker's environment.
The differences between near and far field are further accentuated in the case of cellular telephones and voice over IP networks. In cellular telephones and voice over IP networks, codebook-based signal compression codecs are heavily employed to compress voice signals to reduce the communication bandwidth required to transmit a conversation. In these compression schemes, the selection of which codebook entry to use to model the speech is typically heavily influenced by the relative magnitudes of different frequency components in the voice. Acquisition of data in the far field has a tendency to alter the relative magnitudes of these components, leading to a poor codebook entry selection by the codec and further distortion of the compressed voice.
Similar problems occur with the voice quality of speech acquired by far field microphones in other devices besides communications devices (e.g., hearing aids, voice amplification systems, audio recording systems, voice recognition systems, and voice-enabled toys or robots).
Accordingly, there is a need for improved methods, systems, and devices for speech transduction that reduce or eliminate the problems associated with speech acquired by far-field microphones, such as the tunnel effect.
The present invention overcomes the limitations and disadvantages described above by providing new methods, systems, and devices for speech transduction.
In accordance with some embodiments, a computer-implemented method of speech transduction is performed. The computer-implemented method includes receiving far-field acoustic data acquired by one or more microphones. The far-field acoustic data is analyzed. The far-field acoustic data is modified to reduce characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
In accordance with some embodiments, a computer system for speech transduction includes: one or more processors; memory; and one or more programs. The one or more programs are stored in the memory and configured to be executed by the one or more processors. The one or more programs include instructions for: receiving far-field acoustic data acquired by one or more microphones; analyzing the far-field acoustic data; and modifying the far-field acoustic data to reduce characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
In accordance with some embodiments, a computer readable storage medium has stored therein instructions, which when executed by a computing device, cause the device to: receive far-field acoustic data acquired by one or more microphones; analyze the far-field acoustic data; and modify the far-field acoustic data to reduce characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
Thus, the invention provides methods, systems, and devices with improved speech transduction that reduces the characteristics of far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
For a better understanding of the aforementioned aspects of the invention as well as additional aspects and embodiments thereof, reference should be made to the Description of Embodiments below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
Methods, systems, devices, and computer readable storage media for speech transduction are described. Reference will be made to certain embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the embodiments, it will be understood that it is not intended to limit the invention to these particular embodiments alone. On the contrary, the invention is intended to cover alternatives, modifications and equivalents that are within the spirit and scope of the invention as defined by the appended claims.
Moreover, in the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these particular details. In other instances, methods, procedures, components, and networks that are well-known to those of ordinary skill in the art are not described in detail to avoid obscuring aspects of the present invention.
Speech transduction devices 1040 can be any of a number of devices (e.g., hearing aid, speaker phone, telephone handset, cellular telephone handset, microphone, voice amplification system, videoconferencing system, audio-instrumented meeting room, audio recording system, voice recognition system, toy or robot, voice-over-internet-protocol (VOIP) phone, teleconferencing phone, internet kiosk, personal digital assistant, gaming device, desktop computer, or laptop computer) used to enable the activities described below. Speech transduction device 1040 typically includes a microphone 1080 or similar audio inputs, a loudspeaker 1100 or similar audio outputs (e.g., headphones), and a network interface 1120. In some embodiments, speech transduction device 1040 is a client of speech transduction server 1020, as illustrated in
Speech transduction server 1020 is a server computer that may be used to process acoustic data for speech transduction. Speech transduction server 1020 may be located with one or more speech transduction devices 1040, remote from one or more speech transduction devices 1040, or anywhere else (e.g., at the facility of a speech transduction services provider that provides services for speech transduction).
Communication network(s) 1060 may be wired or wireless communication networks, including wired communication networks, for example those communicating through phone lines, power lines, cable lines, or any combination thereof, wireless communication networks for example those communicating in accordance with one or more wireless communication protocols, such as IEEE 802.11 protocols, time-division-multiplex-access (TDMA), code-division-multiplex-access (CDMA), global system for mobile (GSM) protocols, WIMAX protocols, or any combination thereof, and any combination of such wired and wireless communication networks. Communication network(s) 1060 may be the Internet, other wide area networks, local area networks, metropolitan area networks, and the like.
Network Communication Module 2120 may include Audio Module 2140 that coordinates audio communications (e.g., conversations) between speech transduction devices 1040 or between speech transduction device 1040 and speech transduction server 1020. In some embodiments, the audio communications between speech transduction devices 1040 are performed in a manner that does not require the use of server 1020, such as via peer-to-peer networking.
Acoustic Data Analysis Module 2160 is adapted to analyze acoustic data. The Acoustic Data Analysis Module 2160 is further adapted to determine characteristics of the acoustic data that are incompatible with human speech characteristics of acoustic data.
Acoustic Data Synthesis Module 2180 is adapted to modify the acoustic data to reduce the characteristics of the acoustic data that are incompatible with human speech characteristics of acoustic data. In some embodiments, Acoustic Data Synthesis Module 2180 is further adapted to convert the modified far-field acoustic data to produce an output waveform.
Voice Model Library 2200 contains two or more Voice Models 2220. Voice Model 2220 includes human speech characteristics for segments of sounds, and characteristics that span multiple segments (e.g., the rate of change of formant frequencies). A segment is a short frame of acoustic data, for example of 15-20 milliseconds duration. In some embodiments, multiple frames may partially overlap one another, for example by 25%. A list of human speech characteristics that may be included in a voice model is listed in Table 1.
In some embodiments, the human speech characteristics include at least one pitch. Pitch can be determined by well known methods, for example autocorrelation. In some embodiments, the maximum, minimum, mean, or standard deviation of the pitch across multiple segments are calculated.
In some embodiments, the human speech characteristics include unvoiced consonant attack time and release time. The unvoiced consonant attack time and release time can be determined, for example by scanning over the near-field acoustic data. The unvoiced consonant attack time is the time difference between onset of high frequency sound and onset of voiced speech. The unvoiced consonant release time is the time difference between stopping of voiced speech and stopping of speech overall (in a quiet environment). The unvoiced consonant attack time and release time may be used in a noise reduction process, to distinguish between noise and unvoiced speech.
In some embodiments, the human speech characteristics include formant filter coefficients and excitation (also called “excitation waveform”). In analysis and synthesis of speech, it is helpful to characterize acoustic data containing speech by its resonances, known as ‘formants’. Each ‘formant’ corresponds to a resonant peak in the magnitude of the resonant filter transfer function. Formants are characterized primarily by their frequency (of the peak in the resonant filter transfer function) and bandwidth (width of the peak). Formants are commonly referred to by number, in order of increasing frequency, using terms such as F1 for the frequency of formant #1. The collection of formants forms a resonant filter that when excited by white noise (in the case of unvoiced speech) or by a more complex excitation waveform (in the case of voiced speech) will produce an approximation to the speech waveform. Thus a speech waveform may be represented by the ‘excitation waveform’ and the resonant filter formed by the ‘formants’.
In some embodiments, the human speech characteristics include magnitudes of harmonics of the excitation waveform. The magnitude of the first harmonic of the excitation waveform is H1, and the magnitude of the second harmonic of the excitation waveform is H2. H1 and H2 can be determined, for example, by calculating the pitches of the excitation waveform, and measuring the magnitude of a power spectrum of the excitation waveform at the pitch frequencies.
In some embodiments, the human speech characteristics include ta and te, which are parameters in an LF-model (also called a glottal flow model with four independent parameters), as described in Fant et al., “A Four-Parameter Model of Glottal Flow,” STL-QPSR, 26(4): 1-13 (1985).
In some embodiments, Memory 2060 stores one Voice Model 2220 instead of a Voice Model Library 2200. In some embodiments, Voice Model Library 2200 is stored at another server remote from Speech Transduction Server 1020, and Memory 2060 includes a Voice Module Receiving Module that receives a Voice Model 2220 from the server remote from Speech Transduction Server 1020.
Each of the above identified modules and applications corresponds to a set of instructions for performing one or more functions described above. These modules (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 2060 may store a subset of the modules and data structures identified above. Furthermore, memory 2060 may store additional modules and data structures not described above.
Although
Network Communication Module 3120 may include Audio Module 3140 that coordinates audio communications (e.g., conversations) between speech transduction devices 1040 or between speech transduction device 1040 and speech transduction server 1020.
In some embodiments, Memory 3060 stores one Voice Model 2220 instead of a Voice Model Library 2200. In some embodiments, Voice Model Library 2200 is stored at another server remote from speech transduction device 1040, and Memory 3060 stores Voice Module Receiving Module that receives a Voice Model 2220 from the server remote from speech transduction device 1040.
As illustrated schematically in
In some embodiments, prior to receiving far-field acoustic data acquired by one or more microphones, a voice model 2220 is created (4010). In some embodiments, the voice model 2220 is produced by a training algorithm that processes near-field acoustic data. In some embodiments, to produce a voice model, near-field acoustic data containing human speech is acquired. In some embodiments, the acquired near-field acoustic data is segmented into multiple segments, each segment consisting, for example, of 15-20 milliseconds of near-field acoustic data. In some embodiments, multiple segments may partially overlap one another, for example by 25%. Human speech characteristics are calculated for the segments Some characteristics, such as formant frequency, are typically computed for each segment. Other characteristics that require examination of time-based trends, such as the rate of change of formant frequency, are typically computed across multiple segments. In some embodiments, the voice model 2220 includes maximum and minimum values of the human speech characteristics. In some embodiments, the created voice model 2220 is contained (4020) in a voice model library containing two or more voice models.
A device (e.g., server 1020 or speech transduction device 1040-2) receives (4030) far-field acoustic data acquired by one or more microphones. For example, server 1020 may receive far-field acoustic data acquired by one or more microphones 1080 in a client speech transduction device (e.g., device 1040-1,
As used in the specification and claims, the one or more microphones 1080 acquire “far-field” acoustic data when the speaker generates speech at least a foot away from the nearest microphone among the one or more microphones. As used in the specification and claims, the one or more microphones acquire “near-field” acoustic data when the speaker generates speech less than a foot away from the nearest microphone among the one or more microphones.
The far-field acoustic data may be received in the form of electrical signals or logical signals. In some embodiments, the far-field acoustic data may be electrical signals generated by one or more microphones in response to an input sound, representing the sound over a period of time, as illustrated in
In some embodiments, the acquired far-field acoustic data is processed to reduce noise in the acquired far-field acoustic data (4040). There are many well known methods to reduce noise in acoustic data. For example, the noise may be reduced by performing a multi-band spectral subtraction, as described in “Speech Enhancement: Theory and Practice” by Philipos C. Loizou, CRC Press (Boca Raton, Fla.), Jun. 7, 2007.
The far-field acoustic data (either as-received or after noise reduction) is analyzed (4050). The analysis of the far-field acoustic data includes determining (4060) characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
In some embodiments, a table containing human speech characteristics may be used to determine characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data. The table typically contains maximum and minimum values of human speech characteristics of near-field acoustic data. In some embodiments, the table receives the maximum and minimum values of human speech characteristics of near-field acoustic data, or other values of human speech characteristics of near-field acoustic data from a voice model 2220, as described below.
In some embodiments, the received far-field acoustic data is segmented into multiple segments, and characteristic values are calculated for each segment. For each segment, characteristic values are compared to the maximum and minimum values for corresponding characteristics in the table, and if at least one characteristic value of the far-field acoustic data does not fall within a range between the minimum and maximum values for that characteristic, the characteristic value of the far-field acoustic data is determined to be incompatible with human speech characteristics of near-field acoustic data. In some embodiments, a predefined number of characteristics that fall outside the range between the minimum and maximum values may be accepted as not incompatible with human speech characteristics of near-field acoustic data. In some other embodiments, the range used to determine whether the far-field acoustic data is incompatible with human speech characteristics of near-field acoustic data may be broader than between the minimum and maximum values. For example, the range may be between 90% of the minimum value and the 110% of the maximum value. In some embodiments, the range may be determined based on the mean and standard deviation or variance of the characteristic value, instead of the minimum and maximum values.
In a related example, the table may contain frequencies generated in human speech. The maximum frequency may be, for example 500 Hz, and the minimum frequency may be, for example 20 Hz. If any segment of the far-field acoustic data contains any sound of frequency 500 Hz or above, such sound is determined to be incompatible with human speech characteristics.
In some embodiments, multivariate methods can be used to determine (4060) characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data. For example, least squares fits of the characteristic values or their power, Euclidean distance or logarithmic distance among the characteristic values, and so forth can be used to determine characteristics incompatible with human speech characteristics of near-field acoustic data.
The received far-field acoustic data is modified (4070) to reduce the characteristics of the far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
In some embodiments, if the far-field acoustic data contains sound that is not within the frequency range of human speech (e.g., a high frequency metal grinding sound), a band-pass filter or low-pass filter well-known in the field of signal processing may be used to reduce the high frequency metal grinding sound.
In some embodiments, when the pitch of speech in the far-field acoustic data is too high, the far-field acoustic data are stretched in time to lower the pitch. Conversely, when the pitch of speech in the far-field acoustic data is too low, the far-field acoustic data may be compressed in time to raise the pitch.
In some embodiments, the far-field acoustic data is modified (4080) in accordance with one or more speaker preferences. For example, a speaker may be speaking in a noisy environment and may want to perform additional noise reduction. In some embodiments, a speaker may provide a type of environment (e.g., via preference control settings on the device 1040) and the additional noise reduction may be tailored for the type of environment. For example, a speaker may be driving, and the speaker may activate a preference control on the device 1040 to reduce noise associated with driving. The noise reduction may use a band-pass filter to reduce low frequency noise, such as those from the engine and the road, and high frequency noise, such as wind noise.
In some embodiments, the far-field acoustic data is modified (4090) in accordance with one or more listener preferences. Such listener preferences may include emphasis/avoidance of certain frequency ranges, and introduction of spatial effects. For example, a listener may have a surround speaker system 1100, and may want to make the sound emitted from the one or more speakers sound like the speaker is speaking from a specific direction. In another example, a listener may want to make a call sound like a whisper so as not to disturb other people in the environment.
In some embodiments, the modified far-field acoustic data is converted (4100) to produce an output waveform. In some embodiments, the modified far-field acoustic data include mathematical equations, an index to an entry in a database (such as a voice model library), or values of human speech characteristics. Therefore, converting (4100) the modified far-field acoustic data includes processing such data to synthesize an output waveform that a listener would recognize as human speech.
For example, when the modified far-field acoustic data includes a vocal tract excitation and a formant, converting the modified far-field acoustic data to produce an output waveform requires mathematically calculating the convolution of the vocal tract excitation and the excitation. In some other embodiments, the modified far-field acoustic data exists in the form of a waveform, similar to the example shown in
In some embodiments, the output waveform is modified (4110) in accordance with one or more speaker preferences. In some embodiments, this modification is performed in a manner similar to modifying (4080) the far-field acoustic data in accordance with one or more speaker preferences. In some embodiments, the output waveform is modified (4120) in accordance with one or more listener preferences. In some embodiments, this modification is performed in a manner similar to modifying (4090) the modified far-field acoustic data in accordance with one or more speaker preferences.
In some embodiments, when the synthesis is performed at a speech transduction server 1020, the output waveform may be sent to a speech transduction device 1040 for output via a loudspeaker 1100. In some embodiments, when the synthesis is performed at a speech transduction device 1040, the output waveform may be an output from a loudspeaker 11100.
In some embodiments, the modified far-field acoustic data is sent to a remote device (4130). For example, the modified far-field acoustic data may be sent from a speech transduction server 1020 to a speech transduction device 1040, where the modified far-field acoustic data may be converted to an output waveform (e.g., by loudspeaker 1100 on device 1040).
In some embodiments, the far-field acoustic data is analyzed (4130) based on a voice model that includes human speech characteristics. In some embodiments, the human speech characteristics include (4220) at least one pitch. A respective pitch represents a frequency of sound generated by a speaker while the speaker pronounces a segment of a predefined word. As described above, the voice model may include maximum and minimum values of human speech characteristics, which may be used to determine characteristics of far-field acoustic data that are incompatible with human speech characteristics of near-field acoustic data.
In some embodiments, the voice model is selected (4140) from two or more voice models contained in a voice model library. In some embodiments, the selected voice model is created (4150) from one identified speaker. For example, Speaker A may create a voice model based on Speaker A's speech, and name the voice model as “Speaker A's voice model.” Speaker A knows that the “Speaker A's voice model” was created from Speaker A, an identified speaker, because Speaker A created the voice model and because the voice model is named as such.
In some embodiments, when Speaker A is speaking, it is preferred that Speaker A's voice model is used. Therefore, in some embodiments, the voice model is selected (4180) at least partially based on an identity of a speaker. For example, if Speaker A's identity can be determined, Speaker A's voice model will be used. In some embodiments, the speaker provides (4190) the identity of the speaker. For example, like a computer log-in screen, a phone may have multiple user login icons, and Speaker A would select an icon associated with Speaker A. In some other embodiments, several factors, such as the time of phone use, location, Internet protocol (IP) address, and a list of potential speakers, may be used to determine the identity of the speaker.
In some embodiments, the voice model is selected (4200) at least partially based on matching the far-field acoustic data to the voice model. For example, if the pitch of a child's voice never goes below 200 Hz, a voice model is selected in which the pitch does not go below 200 Hz. In some embodiments, similar to the method of identifying characteristics of the far-field acoustic data that are incompatible with human speech characteristics of the near-field acoustic data, characteristics of the far-field acoustic data are calculated, and a voice model whose characteristics match the characteristics of the far-field acoustic data is selected. Exemplary methods of matching the characteristics of the far-field acoustic data and the characteristics of voice models include the table-based comparison as described with reference to determining the incompatible characteristics and multivariate methods described above.
In some embodiments, the selected voice model is created (4160) from a category of human population. In some embodiments, the category of human population includes (4170) male adults, female adults, or children. In some embodiments, the category of human population includes people from a particular geography, such as North America, South America, Europe, Asia, Africa, Australia, or the Middle-East. In some embodiments, the category of human population includes people from a particular region in the United States with a distinctive accent. In some embodiments, the category of human population may be based on race, ethnic background, age, and/or gender.
In some embodiments, the far-field acoustic data is analyzed at a speech transduction device 1040 (e.g., hearing aid, speaker phone, telephone handset, cellular telephone handset, microphone, voice amplification system, videoconferencing system, audio-instrumented meeting room, audio recording system, voice recognition system, toy or robot, voice-over-internet-protocol (VOIP) phone, teleconferencing phone, internet kiosk, personal digital assistant, gaming device, desktop computer, or laptop computer), and the voice model library 2200 is located at a server 1020 remote from the speech transduction device. In some embodiments, the speech transduction device 1040 receives the voice model 2220 from the voice model library 2200 at the server 1020 remote from the speech transduction device 1040 when the speech transduction device 1040 selects the voice model.
Formants of the emphasized far-field acoustic data are estimated (5040), and excitations of the emphasized far-field acoustic data are estimated (5050). Methods for estimating formants and excitations are known in the field. For example, the formants and excitations can be estimated by a linear predictive coding (LPC) method. See Makhoul, “Linear Prediction, A Tutorial Review”, Proceedings of the IEEE, 63(4): 561-580 (1975). Also, a computer program to perform the LPC method is commercially available. See lpc function in Matlab Signal Processing Toolbox (MathWorks, Natick, Mass.).
The estimated excitation is modified (5060). In some embodiments, the estimated excitation is compared to excitations stored in a voice model. If a matching excitation is found in the voice model, the matching excitation from the voice model is used in place of the estimated excitation. In some embodiments, matching the estimated excitation to the excitation stored in a voice model depends on the estimated formants. For example, a record is selected within the voice model that contains formants to which the estimated formants are a close match. Then the estimated excitation is updated to more closely match the excitation stored in that voice model record. In some embodiments, the matched excitation stored in the selected voice model record is stretched or compressed so that the pitch of the excitation from the library matches the pitch of the far-field acoustic data.
The estimated formants are modified (5070). In some embodiments, the estimated formants are modified in accordance with a Steiglitz-McBride method. For example, see Steiglitz and McBride, “A Technique for the Identification of Linear Systems,” IEEE Transactions on Automatic Control, pp. 461-464 (October 1965). In some embodiments, a parameterized model, such as the LF-model described in Fant et al., is used to fit to the low-pass filtered excitation. The LF-model fit is used for modifying the estimated formants. An initial error is calculated as follows:
(Initial error)=[(LF-model fit)×(initially estimated formant)×(initially estimated formant)]−[(emphasized far-field acoustic data)×(initially estimated formant)],
where × indicates convolution.
Having determined the initial error, the formant coefficients are adjusted in a linear solver to minimize the magnitude of the error. Once the formant coefficients are adjusted, the adjusted formant is used to recalculate the error (termed the “iterated error”) as follows:
(Iterated error)=[(LF-model fit)×(initially estimated formant)×(adjusted formant)]−[(emphasized far-field acoustic data)×(adjusted formant)],
where × indicates convolution.
The modified formants may be further processed, for example via pole reflection, or additional shaping.
The modified formants and estimated excitation are convoluted to synthesize a waveform (5080). The waveform is again emphasized (5090) to produce (5100) an output waveform.
The speech transduction system 600 further includes voice model library 650 configured to store the new voice model 630. In some embodiments, the voice model library 650 contains personalized models of the voice of each speaker as the speaker's voice would sound under ideal conditions. In some embodiments, the voice model library 650 generates personalized speech models through automatic analysis and categorization of a speaker's voice. In some embodiments, the speech transduction system 600 includes tools for modifying the models in the voice model library 650 to suit the preferences of the person speaking, e.g., to smooth a raspy voice, etc.
The voice model library 650 may be stored in various locations. In some embodiments, the voice model library 650 is stored within a telephone network. In some embodiments, it is stored at the listener's phone handset. In some embodiments, the voice model library 650 is stored within the speaker's phone handset. In some embodiments, the voice model library 650 is stored within a computer network that is operated independently of the telephone network, i.e., a third party service provider.
A conversation microphone 660 captures far-field sound waves (in other words, far-field acoustic data) of the current speaker and transmits the far-field acoustic data to a sound device 670. In some embodiments, the sound device 670 may be a hearing aid, a speaker phone or audio-instrumented meeting room, a videoconferencing system, a telephone handset, including a cell phone handset, a voice amplification system, an audio recording system, voice recognition system, or even a children's toy.
A model selection module 640 is coupled to the sound device 670 and the voice model library 650. The model selection module 640 accommodates multiple users of the sound device 670, such as a cellular telephone, by selecting which personalized voice model from the voice model library 650 to use with the current speaker. This model selection module 640 may be as simple as a user selection from a menu/sign-in, or may involve more sophisticated automatic speaker-recognition techniques.
A voice replicator 680 is also coupled to the sound device 670 and the voice model library 670. The voice replicator 680 is configured to produce a resulting sound that is a replica of the speaker's voice in good acoustic conditions 690. As shown in
The parameter estimation module 682 analyzes the acoustic data. The parameters estimation module 682 matches the acoustic data acquired by one or more microphones to the stored model of the speaker's voice. The parameter estimation module 682 outputs an annotated waveform. In some embodiments, the annotated waveform is transmitted to the model selection module 640 for automatic identification of the speaker and selection of the personalized voice model of the speaker.
The synthesis module 684 constructs a rendition of the speaker's voice based on the voice model 630 and on the acquired far-field acoustic data. The resulting sound is a replica of the speaker's voice in good conditions 690 (e.g., the speaker's voice sounds as if the speaker was speaking into a near-field microphone).
In some embodiments, the speech transduction system 600 also includes a modifying function that tailors the synthesized speech to the preferences of the speaker and/or listener.
Each of the methods described herein may be governed by instructions that are stored in a computer readable storage medium and that are executed by one or more processors of one or more servers or clients. Each of the operations shown in
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
This application claims priority to U.S. Provisional Application No. 60/959,443, filed on Jul. 13, 2007, which application is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
60959443 | Jul 2007 | US |