Modern computing applications may capture and playback audio of a user's speech. Such applications include videoconferencing applications, multi-player gaming applications, and audio messaging applications. The audio often suffers from poor quality both at capture and playback.
Typically, a microphone used to capture speech audio for a computing application is built-in to a user device, such as a smartphone, tablet or notebook computer. These microphones capture low-quality audio which exhibits, for example, low signal-to-noise ratios and low sampling rates. Even off-board, consumer-grade microphones provide poor quality audio when used in a typical audio-unfriendly physical environment.
High-quality speech audio, if captured, may also present problems. High-quality audio consumes more memory and requires more transmission bandwidth than low-quality audio, and therefore may negatively affect system performance or consume an unsuitable amount of resources. On playback, even high-quality audio may fail to integrate suitably with the hardware, software and physical environment in which the audio is played.
Systems are desired to efficiently provide suitable speech audio to computing applications.
The following description is provided to enable any person in the art to make and use the described embodiments. Various modifications, however, will remain apparent to those in the art.
Embodiments described herein provide a technical solution to the technical problem of inefficient and poor-quality audio transmission and playback in a computing environment. According to some embodiments, clear speech audio is generated by a trained network based on input text (or speech audio) and is processed based on the context of its sending and/or receiving environment prior to playback. Some embodiments conserve bandwidth by transmitting text data between remote sending and receiving systems and converting the text data to speech audio at the receiving system.
Embodiments may generate speech audio of a quality which is not limited by the quality of the capturing microphone or environment. Processing of the generated speech audio may reflect speaker placement, room response, playback hardware and/or any other suitable context information.
System 100 includes microphone 105 located within physical environment 110. Microphone 105 may comprise any system for capturing audio signals, and may be separate from or integrated with a computing system (not shown) to any degree as is known. Physical environment 110 represents the acoustic environment in which microphone 105 resides, and which affects the sonic properties of audio acquired by microphone 110. In one example, physical properties of environment 110 may generate echo which affects the speech audio captured by microphone 105.
According to the example of
“Text data” as referred to herein may comprise ASCII data or any other type of data for representing text. The text data may comprise another form of coding, such as a language-independent stream of phoneme descriptions including pitch information, or another other binary format that isn't understandable by humans. The text data may include indications of prosody, inflection, and other vocal characteristics that convey meaning but are outside of a simple word-based format. Generally, speech-to-text component 115 may be considered to “encode” or “compress” the received audio signals to the desired text data transmission format.
Speech-to-text component 115 may comprise any system for converting audio to text that is or becomes known. Component 115 may comprise a trained neural network deployed on a computing system to which microphone 105 is coupled. In another example, component 115 may comprise a Web Service which is called by a computing system to which microphone 105 is coupled.
The text data generated by speech-to-text component 115 is provided to text-to-speech component 120 via network 125. Network 125 may comprise any combination of public and/or private networks implementing any protocols and/or transmission media, including but not limited to the Internet. According to some embodiments, text-to-speech component 120 is remote from speech-to-text component 115 and the components communicate with one another over the Internet, with or without the assistance of an intermediate Web server. The communication may include data in addition to the illustrated text data. More-specific usage examples of systems implementing some embodiments will be provided below.
Text-to-speech component 120 generates speech audio based on the received text data. The particular system used to generate the speech audio depends upon the format of the received text data. Text-to-speech component 120 may be generally considered a decoder counterpart to the encoder of speech-to-text component 115, although the intent of text-to-speech component 120 is not to reproduce the audio signals which were encoded by speech-to-text component 115.
In the illustrated example, text-to-speech component 120 may utilize trained model 130 to generate the speech audio. Trained model 130 may comprise, in some embodiments, a Deep Neural Network (DNN) such as Wavenet which has been trained to generated speech audio from input text as is known in the art.
The dotted line of
According to some embodiments, the text data may be in a first language and be translated into a second language prior to reception by text-to-speech component 120. Text-to-speech component 120 then outputs speech audio in the second language based on trained model 130, which has preferably been trained based on speech audio and text of the second language.
Playback control component 135 processes the speech audio output by text-to-speech component 120 to reflect any desirable playback context information 140. Playback context information 140 may include reproduction characteristics of headset (i.e., loudspeaker) 145 within playback environment 150, an impulse response of playback environment 150, an impulse response of recording environment 110, spatial information associated with microphone 105 within recording environment 110 or associated with a virtual position of microphone 105 within playback environment 150, signal processing effects intended to increase perception of the particular audio signal output by component 120, and any other context information.
In some embodiments, the speech audio generated by component 120 is agnostic of acoustic environment and includes substantially no environment-related reverberations. This characteristic allows playback control 135 to apply virtual acoustics to the generated speech audio with more perceptual accuracy than otherwise. Such virtual acoustics include a virtualization of a specific room (i.e., a room model), audio equipment such as an equalizer, compressor, reverberator. The aforementioned room model may represent, for example, an “ideal” room for different contexts such as a meeting, solo work requiring concentration, and group work.
Playback context information 140 may also include virtual acoustic events to be integrated into the generated speech audio. Interactions between the generated speech audio and these virtual acoustic events can be explicitly crafted, as the generated speech audio can be engineered to interact acoustically with the virtual acoustic events (e.g., support for acoustical perceptual cues: frequency masking, doppler effect, etc.).
Some embodiments may therefore provide “clean” speech audio in real-time based on recorded audio, despite high levels of noise while recording, poor capture characteristics of a recording microphone, etc. Some embodiments also reduce the bandwidth required to transfer speech audio between applications while still providing high-quality audio to the receiving user.
Initially, speech audio signals are received at S210. The speech audio signals may be captured by any system for capturing audio signals, for example microphone 105 described above. As also described above, the speech audio signals may be affected by the acoustic environment in which they are captured as well as the recording characteristics of the audio capture device. The captured speech audio signals may be received at S210 by a computing system intended to execute S220.
At S220, a text string is generated based on the received speech audio signals. S220 may utilize any speech-to-text system that is or becomes known. The generated text string may comprise any data format for representing text, including but not limited to ASCII data.
According to some embodiments, S210 and S220 are executed by a computing system operated by a first user intending to communicate with a second user via a communication application. In one example, the communication application is a Voice Over IP (VOIP) application. The communication application may comprise a videoconferencing application, a multi-player gaming application, or any other suitable application.
Next, at S230, speech audio signals are synthesized based on the text string. With respect to the above-described example of S210 and S220, the text string generated at S220 may be transmitted to the second user prior to S230. Accordingly, at S230, a computing system of the second user may operate to synthesize speech audio signals based on the text string. Embodiments are not limited thereto.
The speech audio signals may be synthesized at S230 using any system that is or becomes known. According to some embodiments, S230 utilizes a trained model 130 to synthesize speech audio signals based on the input text string.
Network 310 is trained using training text 320, ground truth speech 330 and loss layer 340. Embodiments are not limited to the architecture of system 300. Training text 320 includes sets of text strings and ground truth speech 330 includes speech audio file associated with each set of text strings of training text 320.
Generally, and according to some embodiments, network 310 may comprise a network of neurons which receive input, change internal state according to that input, and produce output depending on the input and internal state. The output of certain neurons is connected to the input of other neurons to form a directed and weighted graph. The weights as well as the functions that compute the internal state can be modified by a training process based on ground truth data. Network 310 may comprise any one or more types of artificial neural network that are or become known, including but not limited to convolutional neural networks, recurrent neural networks, long short-term memory networks, deep reservoir computing and deep echo state networks, deep belief networks, and deep stacking networks.
During training, network 310 receives each set of text strings of training text 320 and, based on its initial configuration and design, outputs a predicted speech audio signal for each set of text strings. Loss layer component 340 determines a loss by comparing each predicted speech audio signal to the ground truth speech audio signal which corresponds to its input text string.
A total loss is determined based on all the determined losses. The total loss may comprise an L1 loss, and L2 loss, or any other suitable measure of total loss. The total loss is back-propagated from loss layer component 340 to network 310, which changes its internal weights in response thereto as is known in the art. The process repeats until it is determined that the total loss has reached an acceptable level or training otherwise terminates. At this point, the now-trained network implements a function having a text string as input and an audio signal as output.
The synthesized speech audio is processed based on contextual information at S240. As described with respect to
The processed speech audio is transmitted to a loudspeaker for playback at S250. The loudspeaker may comprise any one or more types of speaker systems that are or become known, and the processed signal may pass through any number of amplifiers or signal processors as is known in the art prior to arrival at the loudspeaker.
Playback control 470 is executed to process the synthesized speech audio signals based on playback context information 480. Playback context information 480 may include any context information described above, but is not limited thereto. As illustrated by a dotted line, context information for use by playback control 470 may be received from environment 410, perhaps along with the aforementioned text data. This context information may provide acoustic information associated with environment 420, position data associated with sender 420, or other information.
The processed audio may be provided to headset 490 which is worn by a receiving user (not shown). Some embodiments may include a video stream from environment 410 to environment 450 which allows the receiving user to view user 420 as shown in
Some embodiments may be used in conjunction with mixed-, augmented-, and/or virtual-reality systems.
Device 500 includes a speaker system for presenting spatialized sound and a display for presenting images to a wearer thereof. The images may completely occupy the wearer's field of view, or may be presented within the wearer's field of view such that the wearer may still view other objects in her vicinity. The images may be holographic.
Device 500 may also include sensors (e.g., cameras and accelerometers) for determining the position and motion of device 500 in three-dimensional space with six degrees of freedom. Data received from the sensors may assist in determining the size, position, orientation and visibility of images displayed to a wearer.
According to some embodiments, device 500 executes S230 through S250 of process 200.
Device 500 includes a wireless networking component to receive text data at S230. The text data may be received via execution of a communication application on device 500 and/or on a computing system to which device 500 is wirelessly coupled. The text data may have been generated based on remotely-recorded speech audio signals as described in the above examples, but embodiments are not limited thereto.
Device 500 also implements a trained network for synthesizing speech audio signals based on the received text data. The trained network may comprise parameters and/or program code loaded onto device 500 prior to S230, where it may reside until the communication application terminates.
As illustrated by a dotted line and described with respect to
As shown in
The example of
According to some embodiments, device 500 may also receive text data generated from speech audio of user 720 as described above. Device 500 may then execute S230 through S250 to synthesize speech audio signals based on the text data, process the synthesized speech audio signals based on contextual information (e.g., the sender context and the receiver context of
Context information of user 920 and of environment 910 may then be used to process speech audio signals synthesized by the trained network associated with user 920. Similarly, context information of user 940 and of environment 910 may be used to process speech audio signals synthesized by the trained network associated with user 940. As shown by speech bubbles 930 and 950, device 500 may play back the processed audio signals within a same user session of environment 910 such that they appear to the wearer to emanate from user 920 and user 940, respectively. It should be noted that devices operated by one or both of users 920 and 940 may similarly receive text data from device 500 and execute S230 through S250 to play back corresponding processed speech audio signals as described herein.
Device 1210 may communicate with the application executed by system 1200 to provide recorded speech signals thereto, intended for a user of device 1220. As described above, system 1200 receives the speech audio signals, generates a text string based on the signals, and synthesizes speech audio signals based on the text string. System 1200 may process the signals using context information and provide the processed signals to device 1220 for playback. Device 1220 may further process the received speech signals prior to playback, for example based on context information local to device 1220.
System 1200 may support bi-directional communication between devices 1210 and 1220, and any other one or more computing systems. Each device/system may process and playback received speech signals as desired.
Each functional component described herein may be implemented at least in part in computer hardware, in program code and/or in one or more computing systems executing such program code as is known in the art. Such a computing system may include one or more processing units which execute processor-executable program code stored in a memory system.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. For example, any computing device used in an implementation of a system according to some embodiments may include a processor to execute program code such that the computing device operates as described herein.
All systems and processes discussed herein may be embodied in program code stored on one or more non-transitory computer-readable media. Such media may include, for example, a hard disk, a DVD-ROM, a Flash drive, magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. Embodiments are therefore not limited to any specific combination of hardware and software.
Those in the art will appreciate that various adaptations and modifications of the above-described embodiments can be configured without departing from the claims. Therefore, it is to be understood that the claims may be practiced other than as specifically described herein.