The disclosure relates to audio signal processing and, more specifically, applying noise suppression to audio signals.
Wireless communication devices (e.g., mobile phones, smart phones, smart pads, laptops, tablets, etc.) may be used in noisy environments. For example, a mobile phone may be used at a concert, bar, or restaurant where environmental, background, or ambient noise introduced at a transmitter side reduces intelligibility and degrades speech quality at a receiver side. Wireless communication devices, therefore, typically incorporate noise suppression in a transmitter side audio pre-processor in order to reduce noise and clean-up speech signals before presenting the speech signals to a vocoder for coding and transmission.
In the case where a user is talking on a transmitter side wireless communication device amidst music, or in the case where the user is attempting to capture the music itself for transmission to a receiver side device, the noise suppression treats the music signals as noise to be eliminated in order to improve intelligibility of any speech signals. The music signals, therefore, are suppressed and distorted by the noise suppression prior to bandwidth compression (e.g., encoding) and transmission such that a listener at the receiver side will hear a low quality recreation of the music signals at the transmitter side.
In general, this disclosure describes techniques for performing adaptive noise suppression to improve handling of both speech signals and music signals at least up to super wideband (SWB) bandwidths. The disclosed techniques include identifying a context or environment in which audio data is captured, and adaptively changing a level of noise suppression applied to the audio data prior to bandwidth compression (e.g., encoding) of the audio data based on the context. In the case that the audio data has a valid speech context (i.e., the user intends to primarily transmit speech signals), an audio pre-processor may set a first level of noise suppression that is relatively aggressive in order to suppress noise (including music) in the speech signals. In the case that the audio data has a valid music context (i.e., the user intends to primarily transmit music signals or both music and speech signals), the audio pre-processor may set a second level of noise suppression that is less aggressive in order to leave the music signals undistorted. In this way, a vocoder at a transmitter side wireless communication device may properly compress or encode both speech and music signals with minimal distortions.
In one example, this disclosure is directed to a device configured to provide voice and data communications, the device comprising one or more processors configured to obtain an audio context of input audio data, prior to application of a variable level of noise suppression to the input audio data, wherein the input audio data includes speech signals, music signals, and noise signals; apply the variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data with an audio encoder based on the audio context; and bandwidth compress the input audio data to generate at least one audio encoder packet. The device further comprising a memory, electrically coupled to the one or more processors, configured to store the at least one audio encoder packet, and a transmitter configured to transmit the at least one audio encoder packet.
In another example, this disclosure is directed to an apparatus capable of noise suppression comprising means for obtaining an audio context of input audio data, prior to application of a variable level of noise suppression to the input audio data, wherein the input audio data includes speech signals, music signals, and noise signals; means for applying a variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data with an audio encoder based on the audio context; means for bandwidth compressing the input audio data to generate at least one audio encoder packet; and means for transmitting the at least one audio encoder packet.
In a further example, this disclosure is directed to a method used in voice and data communications comprising obtaining an audio context of input audio data, during a conversation between a user of a source device and a user of a destination device, wherein music is playing in a background of the user of the source device, prior to application of a variable level of noise suppression to the input audio data from the user of the source device, and wherein the input audio data includes a voice of the user of the source device and the music playing in the background of the user of the source device; applying a variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data with an audio encoder based on the audio context including the audio context being speech or music, or both speech and music; bandwidth compressing the input audio data to generate at least one audio encoder packet, and transmitting the at least one audio encoder packet from the source device to the destination device.
The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques will be apparent from the description and drawings, and from the claims.
This disclosure describes techniques for performing adaptive noise suppression to improve handling of both speech signals and music signals at least up to super wideband (SWB) bandwidths. Conventional noise suppression units included in audio pre-processors of wireless communication device are configured to compress any non-speech signals as noise in order to improve intelligibility of speech signals to be encoded. This style of noise suppression works well with vocoders configured to operate according to traditional speech codecs, such as adaptive multi-rate (AMR) or adaptive multi-rate wideband (AMRWB). These traditional speech codecs are capable of coding (i.e., encoding or decoding) speech signals at low bandwidths, e.g., using algebraic code-excited linear prediction (ACELP), but are not capable of coding high quality music signals. The recently standardized Enhanced Voice Services (EVS) codec is capable of coding speech signals as well as music signals up to super wideband bandwidths (i.e., 0-16 kHz) or even full band bandwidths (i.e., 0-24 kHz). Conventional noise suppression units, however, continue to suppress and distort music signals prior to encoding.
The techniques described in this disclosure include identifying a context or environment in which audio data (speech, music, or speech and music) is captured, and adaptively changing a level of noise suppression applied to the audio data prior to encoding of audio data based on the context. For example, in accordance with the disclosed techniques, a wireless communication device may include one or more of a speech-music (SPMU) classifier, a proximity sensor, or other detectors within a transmitter side audio pre-processor used to determine whether the audio data is captured in either a valid speech context or a valid music context.
In the case that the audio data has a valid speech context (i.e., the user intends to primarily transmit speech signals to engage in a conversation with a listener), the audio pre-processor may set a first level of noise suppression that is relatively aggressive in order to suppress noise (including music) before passing the speech signals to a vocoder for coding and transmission. In the case that the audio data has a valid music context (i.e., the user intends to primarily transmit music signals or both music and speech signals for a listener to experience), the audio pre-processor may set a second level of noise suppression that is less aggressive to allow undistorted music signals to pass to a vocoder for coding and transmission. In this way, a vocoder configured to operate according to the EVS codec at the transmitter side wireless communication device may properly encode both speech and music signals to enable complete recreation of an audio scene at a receiver side device with minimal distortions to SWB music signals.
Destination device 14 may receive the encoded audio data to be decoded via computer-readable medium 16. Computer-readable medium 16 may comprise any type of medium or device capable of moving the encoded audio data from source device 12 to destination device 14. In one example, computer-readable medium 16 may comprise a communication medium to enable source device 12 to transmit encoded audio data directly to destination device 14 in real-time. The encoded audio data may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 14. The communication medium may comprise any wireless or wired communication medium, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. The communication medium may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 12 to destination device 14.
In some examples, encoded audio data may be output from source device 12 to a storage device (not shown). Similarly, encoded audio data may be accessed from the storage device by destination device 14. The storage device may include any of a variety of distributed or locally accessed data storage media such as a hard drive, Blu-ray discs, DVDs, CD-ROMs, flash memory, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded audio data. In a further example, the storage device may correspond to a file server or another intermediate storage device that may store the encoded audio generated by source device 12. Destination device 14 may access stored audio data from the storage device via streaming or download. The file server may be any type of server capable of storing encoded audio data and transmitting that encoded audio data to the destination device 14. Example file servers include a web server (e.g., for a website), an FTP server, network attached storage (NAS) devices, or a local disk drive. Destination device 14 may access the encoded audio data through any standard data connection, including an Internet connection. This may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., DSL, cable modem, etc.), or a combination of both that is suitable for accessing encoded audio data stored on a file server. The transmission of encoded audio data from the storage device may be a streaming transmission, a download transmission, or a combination thereof.
The illustrated system 10 of
In the example of
In some examples, microphones 18 may be physically coupled to source device 12, or may be wirelessly communicating with source device 12. To illustrate the wireless communication with source device 12,
In some examples, Microphones 18 of source device 12 may include at least one microphone integrated into source device 12. In one example where source device 12 comprises a mobile phone, microphones 18 may include at least a “front” microphone positioned near a user's mouth to pick up the user's speech. In another example where source device 12 comprises a mobile phone, microphones 18 may include both a “front” microphone positioned near a user's mouth and a “back” microphone positioned at a backside of the mobile phone to pick up environmental, background, or ambient noise. In a further example, microphones 18 may comprise an array of microphones integrated into source device 12. In other examples, source device 12 may receive audio data from one or more external microphones via an audio interface, retrieve audio data from a memory or audio archive containing previously captured audio, or generate audio data itself. The captured, pre-captured, or computer-generated audio may be bandwidth compressed and encoded by audio encoder 20. The encoded audio data in at least one audio encoder packet may then be transmitted by TX 21 of source device 12 onto a computer-readable medium 16.
Computer-readable medium 16 may include transient media, such as a wireless broadcast or wired network transmission, or storage media (that is, non-transitory storage media), such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, or other computer-readable media. In some examples, a network server (not shown) may receive encoded audio data from source device 12 and provide the encoded audio data to destination device 14. e.g., via network transmission. Similarly, a computing device of a medium production facility, such as a disc stamping facility, may receive encoded audio data from source device 12 and produce a disc containing the encoded audio data. Therefore, computer-readable medium 16 may be understood to include one or more computer-readable media of various forms, in various examples.
Destination device 14 may receive, with RX 31, the encoded audio data in the at least one audio encoder packet from computer-readable medium 16 for decoding by audio decoder 30. Speakers 32 playback the decoded audio data to a user. Speakers 32 of destination device 14 may include at least one speaker integrated into destination device 14. In one example where destination device 14 comprises a mobile phone, speakers 32 may include at least a “front” speaker positioned near a user's ear for use as a traditional telephone. In another example where destination device 14 comprises a mobile phone, speakers 32 may include both a “front” speaker positioned near a user's ear and a “side” or “back” speaker positioned elsewhere on the mobile phone to facilitate use as a speaker phone. In a further example, speakers 32 may comprise an array of speakers integrated into destination device 14. In other examples, destination device 14 may send decoded audio data for playback on one or more external speakers via an audio interface. In this way, destination device 14 includes at least one of speakers 32 configured to render an output of audio decoder 30 configured to decode the at least one audio encoder packet received by destination device 14.
Audio encoder 20 and audio decoder 30 each may be implemented as any of a variety of suitable encoder circuitry, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure. Each of audio encoder 20 and audio decoder 30 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (codec or vocoder) in a respective device.
In addition, source device 12 includes memory 13 and destination device 14 includes memory 15 configured to store information during operation. The integrated memory may include a computer-readable storage medium or computer-readable storage device. In some examples, the integrated memory may include one or more of a short-term memory or a long-term memory. The integrated memory may include, for example, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), magnetic hard discs, optical discs, floppy discs, flash memory, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable memory (EEPROM). In some examples, the integrated memory may be used to store program instructions for execution by one or more processors. The integrated memory may be used by software or applications running on each of source device 12 and destination device 14 to temporarily store information during program execution.
In this way, source device 12 includes memory 13 electrically coupled to one or more processors and configured to store the at least one audio encoder packet, and transmitter 21 configured to transmit the at least one audio encoder packet over the air. As used herein, “coupled” may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and combinations thereof. Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc. Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples. In some implementations, two devices (or components) that are communicatively coupled, such as in electrical communication, may send and receive electrical signals (digital signals or analog signal) directly or indirectly, such as via one or more wires, buses, networks, etc. For example, memory 13 may be in electrical communication with the one or more processors of source device 12, which may include audio encoder 20 and pre-processor 22 executing noise suppression unit 24. As another example, memory 15 may be in electrically coupled to one or more processors of destination device 14, which may include audio decoder 30.
In some examples, source device 12 and destination device 14 are mobile phones that may be used in noisy environments. For example, source device 12 may be used at a concert, bar, or restaurant where environmental, background, or ambient noise introduced at source device 12 reduces intelligibility and degrades speech quality at destination device 14. Source device 12, therefore, includes a noise suppression unit 24 within audio pre-processor 22 in order to reduce noise and improve (or, in other words, clean-up) speech signals before presenting the speech signals to audio encoder 20 for bandwidth compression, coding, and transmission to destination device 14.
In general, noise suppression is a transmitter side technology that is used to suppress background noise captured by a microphone while a user is speaking in a transmitter side environment. Noise suppression should not be confused with active noise cancellation (ANC), which is a receiver side technology that is used to cancel any noise encountered in the receiver side environment. Noise suppression is performed during pre-processing at the transmitter side in order to prepare captured audio data for encoding. That is, noise suppression may reduce noise to permit more efficient compression to be achieved during encoding that results in smaller (in term of size) encoded audio data in comparison to encoded audio data that has not been pre-processed using noise suppression. As such, noise suppression is not performed within audio encoder 20, but instead is performed in audio pre-processor 22 and the output of noise suppression in audio pre-processor 22 is the input to audio encoder 20, sometimes with other minor processing in between.
Noise suppression may operate in narrowband (NB) (i.e., 0-4 kHz), wideband (WB) (i.e., 0-7 kHz), super wideband (SWB) (i.e., 0-16 kHz) or full band (FB) (i.e., 0-24 kHz) bandwidths. For example, if the input audio data to noise suppression is SWB content, the noise suppression may process the audio data to suppress noise in all frequencies in the range 0-16 kHz, and the intended output is clean speech signals in the range 0-16 kHz. If the input audio data bandwidth is high, e.g., FB bandwidth, a fast Fourier transform (FFT) of the noise suppression may split the input audio data into more frequency bands and post processing gains may be determined and applied for each of the frequency bands. Later, an inverse FFT (IFFT) of the noise suppression may combine the audio data split among the frequency bands into a single output signal of the noise suppression.
In the case where a user is talking on source device 12 amidst music, or in the case where the user is attempting to capture the music itself for transmission to destination device 14, conventional noise suppression during audio pre-processing treats the music signals as noise to be eliminated in order to improve intelligibility of the speech signals. The music signals, therefore, are suppressed and distorted by the conventional noise suppression prior to encoding and transmission such that a user listening at destination device 14 will hear a low quality recreation of the music signals.
Conventional noise suppression works well with vocoders configured to operate according to traditional speech codecs, such as adaptive multi-rate (AMR) or adaptive multi-rate wideband (AMRWB). These traditional speech codecs are capable of coding (i.e., encoding or decoding) speech signals at low bandwidths, e.g., using algebraic code-excited linear prediction (ACELP), but are not capable of coding high quality music signals. For example, the AMR and AMRWB codecs do not classify incoming audio data as speech content or music content, and encode accordingly. Instead, the AMR and AMRWB codecs treat all non-noise signals as speech content and codes the speech content using ACELP. The quality of music coded according to the AMR or AMRWB codecs, therefore, is poor. In addition, the AMR codec is limited to audio data in the narrowband (NB) bandwidth (i.e., 0-4 kHz) and the AMRWB codec is limited to audio signals in the wideband (WB) bandwidth (i.e., 0-7 kHz). Most music signals, however, include significant content above 7 kHz, which is discarded by the AMR and AMRWB codecs.
The recently standardized Enhanced Voice Services (EVS) codec is capable of coding speech signals as well as music signals up to super wideband (SWB) bandwidths (i.e., 0-16 kHz) or even full band (FB) bandwidths (i.e., 0-24 kHz). In general, other codecs exist that are capable of coding music signals, but these codecs are not used or intended to also code conversational speech in a mobile phone domain (e.g., Third Generation Partnership Project (3GPP)), which require low delay operation. The EVS codec is a low delay conversational codec that can also code in-call music signals at high quality (e.g., SWB or FB bandwidths).
The EVS codec, therefore, offers users the capability of transmitting music signals within a conversation, and recreating a rich audio scene present at a transmitter side device, e.g., source device 12, at a receiver side device, i.e., destination device 14. Conventional noise suppression during audio pre-processing, however, continues to suppress and distort music signals prior to encoding. Even in the case where the captured audio data includes primary music signals at high signal-to-noise ratio (SNR) levels rather than in the background, the music signals are highly distorted by the conventional noise suppression.
In the example of
At audio encoder 20, VAD 27 may provide an output as a 1 when the input audio data includes speech content, and provide an output as a 0 when the input audio data includes non-speech content (such as music, tones, noise, etc.). SPMU classifier 26 determines whether audio data input to audio encoder 20 includes speech content, music content, or both speech and music content. Based on this determination, audio encoder 20 selects the best LB and HB encoding methods for the input audio data. Within LB encoding unit 28A, one encoding method is selected when the audio data includes speech content, and another encoding method is selected when the audio data includes music content. The same is true within HB encoding unit 28B. SPMU classifier 26 provides control input to LB encoding unit 28A and HB encoding unit 28B indicating which coding method should be selected within each of LB encoding unit 28A and HB encoding unit 28B. Audio encoder 20 may also communicate the selected encoding method to audio decoder 30 such that audio decoder 30 may select the corresponding LB and HB decoding methods to decode the encoded audio data.
The operation of a SPMU classifier in the EVS codec is described in more detail in Malenovsky, et al., “Two-Stage Speech/Music Classifier with Decision Smoothing and Sharpening in the EVS Codec,” 40TH IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, Brisbane. Australia, 19-24 Apr. 2015. The operation of a SPMU classifier in a selectable mode vocoder (SMV) is described in more detail in Song, et al., “Analyasis and Improvement of Speech/Music Classification for 3GPP2 SMV Based on GMM,” IEEE Signal Processing Letters, Vol. 15, 2008.
In case that SPMU classifier 26 classifies input audio data as music content, the best quality audio encoding may be achieved using transform domain coding techniques. If, however, conventional noise suppression is applied to music signals of the audio data during pre-processing, distortions may be introduced to the music signals by the aggressive level of noise suppression. The distorted music signals may cause SPMU classifier 26 to misclassify the input audio data as speech content. Audio encoder 20 may then select a less than ideal encoding method for the input audio data, which will reduce the quality of the music signals at the output of audio decoder 30. Furthermore, even if SPMU classifier 26 is able to properly classify the input audio data as music content, the selected encoding method will encode distorted musical signals, which will also reduce the quality of the music signals at the output of audio decoder 30.
This disclosure describes techniques for performing adaptive noise suppression to improve handling of both speech signals and music signals at least up to SWB bandwidths. In some examples, the adaptive noise suppression techniques may be used to change a level of noise suppression applied to audio data during a phone call based on changes to a context or environment in which the audio data is captured.
In the illustrated example of
The devices, apparatuses, systems and methods disclosed herein may be applied to a variety of computing devices. Examples of computing devices include mobile phones, cellular phones, smart phones, headphones, video cameras, audio players (e.g., Moving Picture Experts Group-i (MPEG-1) or MPEG-2 Audio Layer 3 (MP3) players), video players, audio recorders, desktop computers/laptop computers, personal digital assistants (PDAs), gaming systems, etc. One kind of computing device is a communication device, which may communicate with another device. Examples of communication devices include mobile phones, laptop computers, desktop computers, cellular phones, smart phones, e-readers, tablet devices, gaming systems, etc.
A computing device or communication device may operate in accordance with certain industry standards, such as International Telecommunication Union (ITU) standards or Institute of Electrical and Computing Engineers (IEEE) standards (e.g., Wireless Fidelity or “Wi-Fi” standards such as 802.11a, 802.11b, 802.11g, 802.11n or 802.11ac). Other examples of standards that a communication device may comply with include IEEE 802.16 (e.g., Worldwide Interoperability for Microwave Access or “WiMAX”), Third Generation Partnership Project (3GPP), 3GPP Long Term Evolution (LTE), Global System for Mobile Telecommunications (GSM) and others (where a communication device may be referred to as a User Equipment (UE), NodeB, evolved NodeB (eNB), mobile device, mobile station, subscriber station, remote station, access terminal, mobile terminal, terminal, user terminal, subscriber unit, etc., for example). While some of the devices, apparatuses, systems and methods disclosed herein may be described in terms of one or more standards, the techniques should not be limited to the scope of the disclosure, as the devices, apparatuses, systems and methods may be applicable to many systems and standards.
It should be noted that some communication devices may communicate wirelessly or may communicate using a wired connection or link. For example, some communication devices may communicate with other devices using an Ethernet protocol. The devices, apparatuses, systems and methods disclosed herein may be applied to communication devices that communicate wirelessly or that communicate using a wired connection or link.
The illustrated example of
In some examples, SS unit 45 may receive the audio data captured by dual microphones 18A, 18B prior to feeding the audio data to noise suppression unit 24. SS unit 45 comprises a sound separation unit that separates out speech from noise included in the input audio data, and places the speech (plus a little residual noise) in one channel and places the noise (plus a little residual speech) in the other channel. In a dual microphone system illustrated in
Within noise suppression unit 24, the captured audio data is transformed to the frequency domain using FFT 46. For example, FFT 46 may split the input audio data into multiple frequency bands for processing at each of the frequency bands. For example, each frequency band or bin of FFT 46 may include the noise spectrum in one of the channels in the frequency domain and the speech spectrum in another one of the channels.
Adaptive beamforming unit 52 is then used to spatially separate the speech signals and noise signals in the input audio data, and generate a speech reference signal and a noise reference signal from the input audio data captured by dual microphones 18A, 18B. Adaptive beamforming unit 52 includes spatial filtering to identify the direction of speech and filter out all noise coming from other spatial sectors. Adaptive beamforming unit 52 feeds the speech reference signal to gain application and smoothing unit 54. Noise reference generation unit 48 receives the transformed audio data and the separated noise signal from adaptive beamforming unit 52. Noise reference generation unit 48 may generate one or more noise reference signals for input to post processing gain unit 50.
Post processing gain unit 50 performs further processing of the noise reference signals over multiple frequency bands to compute a gain factor for the noise reference signals. Post processing gain unit 50 then feeds the computed gain factor to gain application and smoothing unit 54. In one example, gain application and smoothing unit 54 may subtract the noise reference signals from the speech reference signal with a certain gain and smoothing in order to suppress noise in the audio data. Gain application and smoothing unit 54 then feeds the noise-suppressed signal to IFFT 56. IFFT 56 may combine the audio data split among the frequency bands into a single output signal.
The gain factor computed by post processing gain unit 50 is one main factor, among other factors, that determine how aggressive the subtraction of the noise signal will be at gain application and smoothing unit 54, and thus how aggressive noise suppression is applied to the input audio data. Gain application and smoothing unit 54 applies noise suppression to the input audio data on a per frame basis, e.g., typically every 5-40 milliseconds.
In some examples, post processing gain unit 50 may use more advanced SNR based post processing schemes. In these examples, after comparing speech reference signal, X(n,f), and noise reference signal, N(n,f), energies within separate frequency bands, post processing gain unit 50 computes an SNR value, S(n,f), corresponding to each frequency band f during each frame n, according to the following equation.
Then, post processing gain unit 50 uses the SNR value, (n,f), to compute a gain factor, G(n,f), that is applied to the speech reference signal by gain application and smoothing unit 54 to compute the noise-suppressed signal. Y(n,f), according to the following equation.
Y(n,f)=G(n,f)·X(n,f)
In the case where the input audio data is captured in a valid music context, if a low or small gain factor is applied to the speech reference signal in certain frequency bands, the music signal within the input audio data may be heavily distorted.
In the illustrated example of
In this way, audio pre-processor 22 of source device 12 may be configured to obtain an audio context of input audio data, prior to application of a variable level of noise suppression to the input audio data, wherein the input audio data includes speech signals, music signals, and noise signals, and apply the variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data with audio encoder 20 based on the audio context. In some cases, a first portion of the input audio data may be captured by microphone 18A, and a second portion of the input audio data may be captured by microphone 18B.
Proximity sensor 40 may be a hardware unit typically included within a mobile phone that identifies the position of the mobile phone relative to the user. Proximity sensor 40 may output a signal to control unit 44 indicating whether the mobile phone is positioned near the user's face or away from the user's face. In this way, proximity sensor 40 may aid control unit 44 in determining whether the mobile phone is oriented proximate to a mouth of the user or whether the device is oriented distally away from the mouth of the user. In some examples, when the mobile phone is rotated by a certain angle, e.g., the user is listening and not talking, the earpiece of the mobile phone may be near the user's face or ear but the front microphone may not be near the user's mouth. In this case, proximity sensor 40 may still determine that the mobile phone is oriented proximate to the user even though the mobile phone is further away from the user but positioned directly in front of the user.
For example, proximity sensor 40 may include one or more infrared (IR)-based proximity sensors to detect the presence of human skin when the mobile phone is placed near the user's face (e.g., right next to the user's cheek or ear for use as a traditional phone). Typically, mobile device perform this proximity sensing for two purposes: to reduce display power consumption by turning off a display screen backlight, and to disable a touch screen to avoid inadvertent touches by the user's cheek. In this disclosure, proximity sensor 40 may be used for yet another purpose, i.e., to control the behavior of noise suppression unit 24. In this way, proximity sensor 40 may be configured to aid control unit 44 in determining an audio context of the input audio data.
SPMU classifier 42 may be a software module executed by audio pre-processor 22 of source device 12. In this way, SPMU classifier 42 is integrated into the one or more processors of source device 12. SPMU classifier 42 may output a signal to control unit 44 classifying the input audio data as one or both of speech content or music content. For example, SPMU classifier 42 may perform audio data classification based on one or more of linear discrimination, SNR-base metrics, or Gaussian mixture modelling (GMM). SPMU classifier 42 may be run in parallel to noise suppression unit 24 with no increase in delay.
SPMU classifier 42 may be configured to provide at least two classification outputs of the input audio data. In some examples, SPMU classifier 42 may provide additional classification outputs based on a number of microphones used to capture the input audio data. In some cases, one of the at least two classification outputs is music, and another one of the at least two classification outputs is speech. According to the techniques of this disclosure, control unit 44 may control noise suppression unit 24 to adjust one gain value for the input audio data based on the one of the at least two classification outputs being music. Furthermore, control unit 44 may control noise suppression unit 24 to adjust one gain value based on the one of the at least two classification outputs being speech.
As illustrated in
In general, SPMU classifiers are typically included in audio encoders configured to operate according to the EVS codec, e.g., SPMU classifier 26 of audio encoder 20 from
In the example illustrated in
Control unit 44 may combine the signals from both proximity sensor 40 and SPMU classifier 42 with some hysteresis to determine a context of the input audio data as one of a valid speech context (i.e., the user intends to primarily transmit speech signals to engage in a conversation with a listener) or a valid music context (i.e., the user intends to primarily transmit music signals or both music and speech signals for a listener to experience). In this way, control unit 44 may differentiate between audio data captured with environmental, background, or ambient noise to be suppressed, and audio data captured in a valid music context in which the music signals should be retained encoded to recreate the rich audio scene. Control unit 44 feeds the determined audio context to post processing gain unit 50 of noise suppression unit 24. In this way, control unit 44 may be integrated into the one or more processors of source device 12 and configured to determine the audio context of the input audio data when the one or more processors are configured to obtain the audio context of the input audio data.
In some examples, the audio context determined by control unit 44 may act as an override of a default level of noise suppression, e.g., post processing gain, G(n,f), that is used to generate the noise-suppressed signal within noise suppression unit 24. For example, if a valid music context is identified by control unit 44, the post processing gain may be modified, among other changes within noise suppression unit 24, to set a less aggressive level of noise suppression in order to preserve SWB or FB music quality. One example technique is to modify the post processing gain, G(n,f), based on the identified audio context, according to the following equation.
G
mod(n,f)=G(n,f)·M(n)
In the above equation, M(n) is derived by control unit 44 and denotes a degree to which the input audio data can be considered to have a valid music context.
In the example noise suppression configuration of
In the case where control unit 44 determines that the input audio data was captured in a valid music context, e.g., a music signal is detected in primary microphone 18A and the mobile phone is away from the user's face, noise suppression unit 24 may temporarily set a less aggressive level of noise suppression to allow music signals of the audio data to pass through noise suppression unit 24 with minimal distortion. Noise suppression unit 24 may then fall back to a default, aggressive level of noise suppression when control unit 44 again determines that the input audio data has a valid speech context, e.g., a speech signal is detected in primary microphone 18A or the mobile phone is proximate to the user's face.
In some examples, noise suppression unit 24 may store a set of default noise suppression parameters for the aggressive level of noise suppression, and other sets of noise suppression parameters for one or more less aggressive levels of noise suppression. In some examples, the default aggressive level of noise suppression may be overridden for a limited period of time based on user input. This example is described in more detail with respect to
In this way, gain application and smoothing unit 54 may be configured to attenuate the input audio data by one level when the audio context of the input audio data is music and attenuate the input audio data by a different level when the audio context of the input audio data is speech. In one example, a first level of attenuation of the input audio data when the audio context of the input audio data is speech in a first audio frame may be within fifteen percent of a second level of attenuation of the input audio data when the audio context of the input audio data is music in a second audio frame. In this example, the first frame may be within fifty audio frames before or after the second audio frame. In some cases, noise suppression unit 24 may be referred to a noise suppressor, and gain application and smoothing unit 54 may be referred to as a gain adjuster within the noise suppressor.
In a first example use case, a user of the mobile phone may be talking during a phone call in an environment with loud noise and music (e.g., a noisy bar, a party, or on the street). In this case, proximity sensor 40 detects that the mobile phone is positioned near the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes high speech content with a high level of noise and music content, and that the input audio data from a secondary microphone 18B has a high level of noise and music content and possibly some speech content similar to babble noise. In this case, control unit 44 may determine that the context of the input audio data is the valid speech context, and control noise suppression unit 24 to set an aggressive level of noise suppression for application to the input audio data.
In a second example use case, a user of the mobile phone may be listening during a phone call in an environment with loud noise and music. In this case, proximity sensor 40 detects that the mobile phone is positioned near the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes high noise and music content with no speech content, and that the input audio data from secondary microphone 18B includes similar content. In this case, even though the input audio data includes no speech content, control unit 44 may use the proximity of the mobile device to the user's face to determine that the context of the input audio data is the valid speech context, and control noise suppression unit 24 to set an aggressive level of noise suppression for application to the input audio data.
In a third example use case, a user may be holding the mobile phone up in the air or away from the user's face in an environment with music and little or no noise (e.g., to capture someone singing or playing an instrument in a home setting or concert hall). In this case, proximity sensor 40 detects that the mobile phone is positioned away from the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes high music content and that the input audio data from secondary microphone 18B also includes some music content. In this case, based on the absence of background noise, control unit 44 may determine that the context of the input audio data is the valid music context, and control noise suppression unit 24 to set a low level of noise suppression or no noise suppression for application to the input audio data.
In a fourth example use case, a user may be holding the mobile phone up in the air or away from the user's face in an environment with loud noise and music (e.g., to capture music played in a noisy bar, a party, or an outdoor concert). In this case, proximity sensor 40 detects that the mobile phone is positioned away from the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes a high level of noise and music content and that the input audio data from secondary microphone 18B includes similar content. In this case, even though background noise is present, control unit 44 may use the absence of speech content in the input audio data and the position of the mobile device away from the user's face to determine that the context of the input audio data is the valid music context, and control noise suppression unit 24 to set a low level of noise suppression or no noise suppression for application to the input audio data.
In a fifth example use case, a user may be recording someone singing along to music in an environment with little or no noise (e.g., to capture singing and Karaoke music in a home or private booth setting). In this case, proximity sensor 40 detects that the mobile phone is positioned away from the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes high music content and that the input audio data from secondary microphone 18B includes some music content. In this case, control unit 44 may determine that the context of the input audio data is the valid music context, and control noise suppression unit 24 to set a low level of noise suppression or no noise suppression for application to the input audio data. In some example, described in more detail with respect to
In a sixth example use case, a user may be recording someone singing along to music in an environment with loud noise (e.g., to capture singing and Karaoke music in a party or bar setting). In this case, proximity sensor 40 detects that the mobile phone is positioned away from the user's face, and SPMU classifier 42 determines that the input audio data from primary microphone 18A includes high noise and music content and that the input audio data from secondary microphone 18B includes similar content. In this case, even though background noise is present, control unit 44 may use a combination of multiple indicators, such as the absence of speech content in the input audio data, the position of the mobile device away from the user's face, control signals given by a Karaoke machine, or control signals given by a wearable device worn by the user, to determine that the context of the input audio data is the valid music context, and control the noise suppression unit 24 to set a low level of noise suppression or no noise suppression for application to the input audio data.
In general, according to the techniques of this disclosure, when control unit 44 determines that the context of the input audio data is a valid music context, a level of noise suppression is applied to the input audio data that is more favorable to retaining the quality of music signals included in the input audio data. Conversely, when control unit 44 determines that the context of the input audio data is a valid speech context, a default, aggressive level of noise suppression is applied to the input audio data in order to highly suppress background noise (including music).
As one example, different levels of noise suppression in dB may be mapped as follows: an aggressive or high level of noise suppression may be greater than approximately 15 dB, a mid-level of noise suppression may range from approximately 10 dB to approximately 15 dB, and a low-level of noise suppression may range from no noise suppression (i.e., 0 dB) to approximately 10 dB. It should be noted that the provided values are merely examples and should not be construed as limiting.
As illustrated in
Based on the detected user override signal, control unit 66 may determine that the audio data currently captured by microphones 18 has a valid music context and control noise suppression unit 24 to set a lower level of noise suppression for the audio data. In some examples, the override setting may be set to expire automatically within a predetermined period of time such that noise suppression unit 24 returns to the default level of noise suppression, i.e., an aggressive level of noise suppression. Without this override timeout, the user may neglect to disable or unselect the override setting. In this case, noise suppression unit 24 may continue to apply the less aggressive noise suppression, or no noise suppression, to all received audio signals, which may result in degraded or low quality speech signals when captured in a noisy environment.
Karaoke machine signal detector 62 may detect a signal from an external Karaoke machine in communication with source device 12. The detected signal may indicate that the Karaoke machine is playing music while microphones 18 of source device 12 are recording vocal singing by a user. The signal detected by Karaoke machine signal detector 62 may be used to override a default level of noise suppression, i.e., an aggressive level of noise suppression. Based on the detected Karaoke machine signal, control unit 66 may determine that the audio data currently captured by microphones 18 has a valid music context and control noise suppression unit 24 to set a lower level of noise suppression for the audio data to avoid music distortion while source device 12 is used to record the user's vocal singing.
Karaoke is a common example of a valid music context in which music played by a Karaoke machine and vocal singing by a user both need to be recorded for later playback or transmission to a receiver end device, e.g., destination device 14 from
In addition, sensor signal detector 64 may detect signals from one or more external sensors, such as a wearable device, in communication with source device 12. As an example, the wearable device may be a device worn by a user on his or her body, such as a smart watch, a smart necklace, a fitness tracker, etc., and the detected signal may indicate that the user is dancing. Based on the detected sensor signal along with input from one or both of proximity sensor 40 and SPMU classifier 42, control unit 66 may determine that the audio data currently captured by microphones 18 has a valid music context and control noise suppression unit 24 to set a lower level of noise suppression for the audio data. In other examples, sensor signal detector 64 may detect signals from other external sensors or control unit 66 may receive input from additional detectors to further improve the audio context determination performed by control unit 66.
According to the disclosed techniques, an operation used in voice and data communications comprises obtaining an audio context of input audio data, during a conversation between a user of a source device and a user of a destination device, wherein music is playing in a background of the user of the source device, prior to application of a variable level of noise suppression to the input audio data from the user of the source device, and wherein the input audio data includes a voice of the user of the source device and the music playing in the background of the user of the source device; applying a variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data with an audio encoder based on the audio context including the audio context being speech or music, or both speech and music; bandwidth compressing the input audio data to generate at least one audio encoder packet; and transmitting the at least one audio encoder packet over the air from the source device to the destination device. The individual steps of the operation used in voice and data communications are described in more detail below.
Audio pre-processor 22 receives audio data including speech signals, music signals, and noise signals from microphones 18 (70). As described above, microphones 18 may include dual microphones with a primary microphone 18A being a “front” microphone positioned on a front side of the mobile phone near a user's mouth and secondary microphone 18B being a “back” microphone positioned at a back side of the mobile phone.
SPMU classifier 42 of audio pre-processor 22 classifies the received audio data as speech content, music content, or both speech and music content (72). As described above, SPMU classifier 42 may perform signal classification based on one or more of linear discrimination, SNR-base metrics, or Gaussian mixture modelling (GMM). For example, SPMU classifier 42 may classify the audio data captured by primary microphone 18A as speech content, music content, or both speech and music content, and feed the audio data classification for primary microphone 18A to control unit 44. In addition, SPMU classifier 42 may also classify the audio data captured by second microphone 18B as speech content, music content, or both speech and music content, and feed the audio data classification for secondary microphone 18B to control unit 44.
Proximity sensor 40 detects a position of the mobile phone with respect to a user of the mobile phone (74). As described above, proximity sensor 40 may detect whether the mobile phone is being held near the user's face or being held away from the user's face. Conventionally, proximity sensor 40 within the mobile device may typically be used to determine when to disable a touch screen of the mobile device to avoid inadvertent activation by a user's cheek during use as a traditional phone. According to the techniques of this disclosure, proximity sensor 40 may detect whether the mobile phone is being held near the user's face to capture the user's speech during use as a traditional phone, or whether the mobile phone is being held away from the user's face to capture music or speech from multiple people during use as a speaker phone.
Control unit 44 of audio pre-processor 22 determines the context of the audio data as either a valid speech context or a valid music context based on the classified audio data and the position of the mobile phone (76). In general, the type of content that is captured by primary microphone 18A and the position of the mobile phone may indicate whether the user intends to primarily transmit speech signals or music signals to a listener at a receiver side device, e.g., destination device 14 from
In this way, audio pre-processor 22 obtains the audio context of the input audio data during a conversation between the user of source device 12 and a user of destination device 14, where music is playing in a background of the user of source device 12. Audio pre-processor 22 obtains the audio context prior to application of a variable level of noise suppression to the input audio data from the user of source device 12. The input audio data includes both a voice of the user of source device 12 and the music playing in the background of the user of source device 12. In some cases, the music playing in the background of the user of source device 12 comes from a karaoke machine.
In some examples, audio pre-processor 22 obtains the audio context of the input audio data based on SPMU classifier 42 classifying the input audio data as speech, music, or both speech and music. SPMU classifier 42 may classify the input audio data as music at least eighty percent of the time that music is present with speech. In other examples, audio pre-processor 22 obtains the audio context of the input audio data based on proximity sensor 40 determining whether source device 12 is proximate to or distally away from a mouth of the user of source device 12 based on a position of the source device. In one example, pre-processor 22 obtain the audio context based on the user of source device 12 wearing a smart watch or other wearable device.
Control unit 44 feeds the determined audio context of the captured audio data to noise suppression unit 24 of audio pre-processor 22. Noise suppression unit 24 then sets a level of noise suppression for the captured audio data based on the determined audio context of the audio data (78). As described above, noise suppression unit 24 may set the level of noise suppression for the captured audio data by modifying a gain value based on the determined context of the audio data. More specifically, noise suppression unit 24 may increase a post processing gain value based on the context of the audio data being the valid music context in order to reduce the level of noise suppression for the audio data.
In the case that the context of the audio data is the valid speech context, noise suppression unit 24 may set a first level of noise suppression that is relatively aggressive in order to suppress noise signals (including music signals) and clean-up speech signals in the audio data. In the case that the context of the audio data is the valid music context, noise suppression unit 24 may set a second level of noise suppression that is less aggressive to leave music signals undistorted in the audio data. In the above example, the second level of noise suppression is lower than the first level of noise suppression. For example, the second level of noise suppression may be at least 50 percent lower than the first level of noise suppression. More specifically, in some examples, an aggressive or high level of noise suppression may be greater than approximately 15 dB, a mid-level of noise suppression may range from approximately 10 dB to approximately 15 dB, and a low-level of noise suppression may range from no noise suppression (i.e., 0 dB) to approximately 10 dB.
Noise suppression unit 24 then applies the level of noise suppression to the audio data prior to sending the audio data to an EVS vocoder for bandwidth compression or encoding (80). For example, audio encoder 20 from
In this way, audio pre-processor 22 applies a variable level of noise suppression to the input audio data prior to bandwidth compression of the input audio data by audio encoder 20 based on the audio context including the audio context being speech or music, or both speech and music. Audio encoder 20 then bandwidth compresses the input audio data to generate at least one audio encoder packet; and source device 12 transmits the at least one audio encoder packet over the air from source device 12 to destination device 14.
In some examples, audio pre-processor 22 adjusts a noise suppression gain so that there is one attenuation level of the input audio data when the audio context of the input audio data is music and there is a different attenuation level of the input audio data when the audio context of the input audio data is speech. In one case, the one attenuation level and the different attenuation level both have the same value. In that case, the music playing in the background of the user of source device 12 passes through noise suppression unit 24 at the same attenuation level as the voice of the user of source device 12.
A first level of attenuation of the input audio data may be applied when the user of source device 12 is talking at least 3 dB louder than the music playing in the background of the user of source device 12, and a second level of attenuation of the input audio data may be applied when the music playing in the background of the user of source device 12 is at least 3 dB louder than the talking of the user of source device 12. The bandwidth compression of the input audio data of the voice of the user of source device 12 and the music playing in the background of the user of source device 12 at the same time may provide at least 30% less distortion of the music playing in the background as compared to bandwidth compression of the input audio data of the voice of the user of source device 12 and the music playing in the background of the user of source device 12 at the same time without obtaining the audio context of the input audio data prior to application of noise suppression to the input audio data.
Any use of the term “and/or” throughout this disclosure should be understood to refer to either one or both. In other words, A and/or B should be understood to provide for either (A and B) or (A or B).
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code, or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless communication device, a wireless handset, a mobile phone, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software or firmware.
Various embodiments of the invention have been described. These and other embodiments are within the scope of the following claims.