The same numbers are used throughout the disclosure and figures to reference like components and features
The following document describes tools capable of enabling and/or generating comfort noise for voice communications over a network. The tools may adapt to changes in a speaker's background noise effective to generate comfort noise that also adapts to these changes. The tools may do so at significant bandwidth savings over some other techniques.
An environment in which the tools may enable these and other techniques is set forth first below in a section entitled Exemplary Operating Environment. This section is followed by another section describing exemplary manners in which elements of the exemplary operating environment may build and adapt a noise history, entitled Building and Adapting an Exemplary Noise History. Another section follows, which describes exemplary manners in which elements of the exemplary operating environment may use this history to generate comfort noise, entitled Adaptively Generating Comfort Noise. A final section, entitled Additional Embodiments, sets forth various ways in which the tools may act to enable and generate comfort noise.
Before describing the tools in detail, the following discussion of an exemplary operating environment is provided to assist the reader in understanding some ways in which various inventive aspects of the tools may be employed. The environment described below constitutes but one example and is not intended to limit application of the tools to any one particular operating environment. Other environments may be used without departing from the spirit and scope of the claimed subject matter.
The environment also has a communications network 114, such as a company intranet or a global internet (e.g., the Internet). The participants' devices may be capable of communicating directly with the network (e.g., a wireless-Internet enabled laptop, PDA, or a Tablet PC, or a desktop computing device or VoIP-enabled telephone or cellular phone wired or wirelessly connected to the Internet) or indirectly (e.g., the telephone connected to the phone-to-network device). The conversation or conference may be enabled through a distributed or central network topology (or a combination of these). Exemplary distributed and central network topologies are illustrated as part of an example described below.
The communication network and/or any of these devices, including the phone-to-network device, may be a computing device having one or more processor(s) 116 and computer-readable media 118 (each device marked with “◯” to indicate this possibility). The computer-readable media comprises a voice handler 120 having one or more of a voice activity detector 122, an encoder 124, a decoder 126, an adaptive history module 128, a noise history 130, and a comfort noise generator 132. The noise history may comprise or have access to a frequency template 134 and an excitation template 136.
The processor(s) are capable of accessing and/or executing the computer-readable media. The voice handler is capable of sending and receiving audio communications over a network, e.g., according to a Voice-over-Internet Protocol (VoIP). The voice handler is shown as one cohesive unit with the mentioned discrete elements 122-136, though portions of it may be disparately placed, such as some elements residing in network 114 and some residing in one of the other devices.
Each of the participants may contribute and receive audio signals. The voice activity detector is capable of determining whether contributed audio is likely a participant's speech or not. Thus, if participant A (“Albert”) stops speaking, the voice activity module executing on Albert's communication device may determine that the audio signal just received from Albert comprises background noise and not speech. It may do so, for instance, by measuring the intensity and duration of the audio signal.
The encoder converts the audio signal from an analog format to a digital format and into packets suitable for communication over the network (each typically with a time-stamp). The decoder converts packets of audio received over the network from the encoder into analog suitable for rendering to a listening participant. The decoder may also analyze packets as they are received to provide information about the energy and frequency of the payload (e.g., a frame of audio contained in a packet).
The adaptive history module is capable of building and adapting noise history 130 based on information about background noise in audio received from one or more speaking participants. In some cases the information includes frequency and excitation information for a participant's background noise. In these cases the history module is capable of building the noise history to include frequency template 134 and excitation template 136 for that participant. The noise history may be used by the comfort noise generator to generate comfort noise that adapts to changes in a speaker's background noise. Many of the elements of the operating environment are mentioned and further described as part of the description below.
The following discussion describes exemplary ways in which the tools may build and adapt a noise history for later use in generating comfort noise. This discussion uses elements of operating environment 100 of
For this example assume that participant A of
Albert's communication device 102 receives this audio signal having speech and background noise. As shown in
Albert's device is shown with its own voice handler marked as 120a rather than 120 to show that it is associated with Albert. For simplicity, Albert's voice handler 120a is shown only with voice activity detector 122 and encoder 124. Calvin's device is shown with Calvin's voice handler 120c having only (again for simplicity) decoder 126, adaptive history module 128, noise history 130, and comfort noise generator 132. This ongoing example and the tools in general may use either a network having a distributed topology, centralized topology, or a combination of both (combination not shown).
In any of these topologies, Albert's communication device receives his audio signal in analog form, namely “Calvin . . . how are you? . . . ”. Albert's device's voice handler receives the audio in analog form, converts it into a digital form (e.g., with a voice card), and determines which parts of the signal are speech and which are background noise. Here the voice activity detector determines that the signal comprises the four portions shown in
Note, however, that a talk-and-noise portion may include background noise segments that are not at the end of the talk-spurt. For example, if Albert paused for ¼ second between “how” and “are you”, the pause would likely be considered background noise. The voice handler may send a talk-and-noise portion having just this ¼ second of background noise with or without any background noise following “are you”. If the voice handler does so, the segment of background noise surrounded by speech in a talk-and-noise portion may be used by the tools similarly to the background noise received after a talk-spurt, including to adapt a noise history.
Calvin's device receives packets A through P at decoder 126, shown at action 1. These packets are received from the network and include digital data for both talk-and-noise portions of
The decoder receives packets for the talk-and-noise portions at which time it strips the data from each packet to provide data frames. Assume, for simplicity, that the decoder receives packets A, B, C, D, E, and F in turn. Packets A-D represent part of the talk-spurt portion of the first talk-and-noise portion (from when Albert said: “Calvin”). Packets E and F represent background noise in the segment following the talk-spurt. On receiving each of these packets, the decoder provides frames for each, shown at action 2. Also on receiving each packet, the decoder determines an excitation signal (X) and Linear Spectral Parameters (LSP) for each frame (Xi and LSPi for each frame, with “i” being the frame at issue).
The excitation signal and LSP of a frame are used by the adaptive history module when the energy of that frame is consistent with background noise rather than speech. The adaptive history module receives each frame at action 2, with which it determines each frame's energy (Ei) at action 5. At action 6, the module uses the frame's energy, whether background noise or speech, to better assess in the future what is speech and what is background. Here the module uses a frame's energy to train a background noise level, represented by Ebg. The module may train the Ebg to represent a running average of minimum-energy frames.
At action 7 the adaptive history module determines if the frame at issue (here frame A-F in turn) is background noise or not. The module does so by subtracting the background noise level (Ebg) from the energy of the current frame (Ei) and, if the remainder is less than a threshold energy, determines that this frame is background noise. This threshold may be predetermined or adaptive based on energy information. Here the threshold is a predetermined constant value having a particular dB (decibel) value. If the frame is determined not to be background noise, the adaptive history module proceeds to analyze the next frame's energy at action 8. If the frame is determined to be background noise and not speech (the “Yes” arrow), the module proceeds to action 9.
At action 9 the module builds and/or adapts noise history 130 of
For Albert's talk-spurt of “Calvin”, which was received by Calvin's communication device with packets A, B, C, and D, the adaptive history module determines that none of the frames for these packets contain just background noise. Thus, for time T=0 through T=1 in
For the segment of background noise after the talk-spurt of “Calvin”, which was received by Calvin's communication device with packets E and F, the adaptive history module determines that both frames for these packets contain background noise and not speech. Thus, for times T=1 to T=1.5 in
Here the decoded excitation signal X(E) (for the frame of packet E) and X(F) (for the frame of packet F) are used to update the excitation template ET. These excitation signals X(E) and X(f) are noise vectors representing an average energy of the signal in their respective frames E and F. The adaptive history module updates the excitation template based on each of these vectors.
The module updates the excitation template ET according to the following formula:
E
T(j)=α·ET(j)+(1−α)·|X(j)|
where j=1, . . . . N and N is the frame length, α is a training weight (e.g., 0.9 or 0.99), and X is the current excitation signal.
Thus, for the frame of packet E, assuming it is the first frame of background noise and the training weight is 0.9, the excitation template is:
E
T(E)=0.9·0+(1−0.9)·|X(E)|=0.1|X(E)|
For frame F, the starting excitation template would be 0.1|X(E)| resulting in an adapted excitation template based on frame F of:
E
T(F)=0.9·0.1|X(E)|+(1−0.9)·|X(F)|
E
T(F)=0.09|X(E)|+0.1|X(F)|
At first it may seem that the value of excitation template should be larger. With the large number of packets typically received in a segment of background noise, however, the module may quickly adapt the excitation template to a value that is a close approximation of the background noise's excitation. Also, for the first frame used (here E), the adaptive history module may set the training weight to a smaller value (and thus a larger effect). If the training weight was set for the first frame at 0, for example, the excitation template following adaptation of frame F would be:
E
T(F)=0.9|X(E)|0.1|X(F)|
If the excitation of E and F were about equal, then the excitation template would be:
ET(F)≈|X(F)|
The adaptive history module also updates the noise history's frequency template. Here Linear Spectral Parameters (LSP) for frames from packets E and F, namely L(E) and L(F), are used to update the frequency template LT. These LSPs represent linear prediction filters for their frames E and F. The adaptive history module updates the frequency template based on each of these LSPs.
Here the module first updates the frequency template LT according to the following formula:
L
T(j)=β·LT(j)+(1−β)·L(j)
where j=1 . . . M and M is the order of the linear prediction filter (e.g., 10 or 16), β is a training weight (e.g., 0.9 or 0.99), and L is the current LSP. Initially (e.g., at receipt of the first packet) the adaptive history module may use the very first received packet's LSP or use a uniformly spaced LSP as initialization. A uniformly spaced LSP generates a flat spectrum in the frequency domain. Here we assume that the initial LSP used is the LSP of frame E. Thus, for the frame of packet E, assuming a training weight is 0.9, the frequency template is:
L
T(E)=0.9·L(E)+(1−0.9)·L(E)=1.0L(E)
For frame F, the starting frequency template would be 1.0 L(E) resulting in an adapted frequency template based on frame F of:
L
T(F)=0.9·1.0L(E)+(1−0.9)·L(F)
L
T(F)=0.9L(E)+0.1L(F)
Similarly to the excitation template above, the module may quickly adapt the frequency template to a value that is a close approximation of the background noise's spectral shape. Again, for the first frame used, E, the adaptive history module may set the training weight to a smaller value (and thus a larger effect). If the training weight was set for the first frame at 0.2 (for E) and 0.3 (for F) eventually increasing by 0.1 to 0.9, for example, the frequency template following adaptation based on frame F would be:
L
T(E)=0.2·L(E)+(1−0.2)·L(E)=1.0L(E)
L
T(F)=0.3·1.0L(E)+(1−0.3)·L(F)=0.3L(E)+0.7L(F)
If the LSPs of E and F were about equal, then the frequency template would be:
LT(F)≈1.0L(F)
In practice the segment of background noise sent with the talk-spurt in the speech-and-noise portion 502 often has enough packets such that the excitation template and frequency template is a weighted average of these parameters for the noise received, with the noise more-recently received having greater weight.
At some point, however, the decoder does not receive additional packets for the ongoing communication; here there is a lull after packet F is received. This lull may be determined analytically or be indicated in a packet (e.g., in packet F that F is the last packet). Responsive to this lull, the tools generate comfort noise to fill in noise after packet F is received and rendered to the listener (e.g., Calvin). An overview of these actions of the tools is set forth in
At block 702, the voice handler determines if it has received packets for Albert's audio signal. If packets are being received and are of an appropriate time-stamp (e.g., not for audio to be rendered later for a future-rendered talk-spurt), the process continues along the “Yes” path to block 704.
At block 704 the voice handler outputs samples of the frames for the packets effective to enable a participant to hear the actual audio received in the packets. Here the loud speakers on Calvin's communication device (his telephone) act responsive to a signal from his phone-to-network device 108 to broadcast the signal for speech-and-noise portion 502 (“Calvin” with a segment of background noise) based on the output samples. Thus, Calvin hears Albert say: “Calvin” and some actual background noise.
If, however, packets are not received of an appropriate time-stamp, the voice handler proceeds to block 706. At block 706, comfort noise generator 132 of
The voice handler outputs samples for rendering the comfort noise to a participant at block 708. Here again, Calvin's telephone acts responsive to a signal from his phone-to-network device to broadcast sounds, only here the sounds are comfort noise.
With the overview of process 700 set out, the discussion turns to exemplary and more-detailed ways in which the comfort noise generator generates comfort noise shown in overview with block 706 above.
At action 10 in
At action 11, the generator randomizes the order of the excitation template. At action 12, the generator randomizes the signs of the excitation template as well. By randomizing the order and sign but not the absolute values of the amplitude of the excitation template, the energy of the excitation vector is constant or nearly constant. Thus, the comfort noise generated can be of constant energy (i.e., volume). Comfort noise of a constant volume may be pleasing and non-disruptive to listeners. The randomizations of actions 11 and 12 may be described mathematically as:
The output of actions 11 and 12 is a randomized noise excitation. Optionally at arrow 13, however, the generator may reduce the amplitude of excitation (e.g., progressively over time). Thus, at the first comfort noise sample the excitation may be nearly equal to the randomized noise excitation produced by actions 11 and 12. Over the next ¼ second, ½ second, or more, the generator may gradually reduce the energy of the randomized noise excitation. In some cases listeners prefer that comfort noise progressively get quieter, though often at a rate that is not immediately noticeable. If Albert is talking on a cell phone in heavy traffic, for instance, the background noise could be annoying for Calvin. For example, the generator may start the comfort noise at about the same excitation (volume) as the actual noise and then, over the first five seconds reducing it by about a ¼, then another ¼ over the next five seconds until the high-volume background noise is noticeable but not annoying.
At action 14, the generator receives the frequency template LT(F) adapted by the adaptive history module at action 9 in
Assume, for example, that the frequency template represents a frequency spectrum as shown in
At action 16 the generator converts the frequency template LT(F) to a Linear Predictive Coding (LPC) template. This template is suitable for acting as a linear prediction synthesis filter with the excitation to generate the comfort noise.
At action 17 the generator passes the randomized noise excitation from action 12 or 13 to the LPC synthesis filter. The LPC may result from actions 15 and 16 or just 16. The result is a sample that may be rendered to produce comfort noise. The comfort noise sample is provided at action 18.
The generator continues to provide comfort noise samples until the next talk-and-noise portion is received by Calvin's phone-to-network device 108. The adaptive history module 128 continues to receive frames, excitation signals, and LSPs for packets G-P in the ongoing communication, shown in
The energy of the audio rendered for all of the audio signal received from Albert (“Calvin . . . how are you? . . . ”) is presented in
The following discussion, which is illustrated in
Block 1102 determines information about a segment of background noise in an audio signal. This segment may reside in any part of an audio signal, such as following a talk spurt in a talk-and-noise portion as set forth above, or residing within a talk-spurt, such as a short period of background noise between two pieces of speech, or even background noise not immediately before or after a talk-spurt. This segment information indicates parameters of the actual background noise, such as its energy and frequency spectrum. In the embodiments described above, for example, this information includes an excitation signal and a Linear Spectrum Predictor (LSP) for frames of audio decoded from packets received over a communication network according to VoIP.
Block 1102 may determine this information frame-by-frame for a segment of background noise, such as for a segment received immediately after or within a talk-spurt (e.g., as part of a talk-and-noise portion of an audio signal) as described above. The tools may determine this just for packets known to contain background noise or for all packets, as is performed by decoder 126 in the above examples. An encoder on a speaker's communication device may indicate which packets represent background noise and which do not. Block 1104 assumes that the packets do not indicate or do not indicate accurately which represent background noise and which do not. Thus, these blocks act to determine which packets have frames of background noise. If the packets accurately indicate which represent background noise, the tools may skip block 1104 and proceed to block 1106.
Block 1104 determines which frames represent background noise. In one embodiment, the tools do so according to blocks 1104a, 1104b, and 1104c, though other manners may also be used in conjunction with or alternatively to the manners set forth in blocks 1104a through 1104c. These other manners may include, for example, determining which frame represents background noise based on: signal analysis of a frame; features extracted from a frame; embedded side information about the nature of the frame as side-info or metadata in the packet having the frame; the rate at which packets are received or packet size of the packet having the frame; or an indication in the frame itself that the frame is speech or background noise.
Block 1104a calculates frame energies for frames of an audio signal received over a communication network. Block 1104b trains a background noise level based on the frame energies. Thus, as new frames are received, the tools update the background noise level to better determine which frames contain just background noise and which do not. The background noise, as noted in the above examples, may change over time. Some frames that would have been considered noise at one point may not be considered noise at a later point in time, or vice versa. By updating and adapting to changes in background noise, the tools may more accurately determine which frames represent background noise and which do not.
Block 1104c compares each frame's energy with the background noise level. The tools may determine which frames represent background noise by comparing the frame's energy with an adapting background noise level. In
Block 1106 receives information about background noise. Whether following block 1104 or 1102, block 1106 knows which frames are considered background noise and their information. In some of the above examples, for instance, the tools receive a talk-and-noise portion of an audio signal, determine which represent background noise based on their energy, and proceed with the information from the frames determined to be background noise. The segment of the audio signal determined to be background noise may include information for one or many frames determined to represent background noise. In the talk-and-noise portion 502 of
Block 1108 builds and/or adapts a noise history based on segment information about background noise in an audio signal of an ongoing communication. The tools provide updates or directly adapt this noise history responsive to changes in background noise to better enable generation of comfort noise. In the above examples, for instance, this segment information about the background noise includes excitation signals and LSPs for frames decoded from packets received over communication network 114 of
Block 1110 optionally alters the noise history to enable production of a more-pleasing comfort noise. In some cases the noise history, while accurate, may be altered to enable more-pleasing but possibly less-accurate comfort noise. If, for example, the frequency template contains a frequency peak that may be annoying or if the excitation template is simply too loud for comfort, the tools may alter these templates. As noted later, the tools may also or instead alter the templates during generation of comfort noise. In either case, whether following block 1108 or 1110, the tools provide a noise history effective to enable generation of comfort noise.
In all of process 1100, the tools may act at the listener's communication device. Thus, the outputting communication device (e.g., an encoder at the speaker's device) does not necessarily need to do anything more than provide audio containing speech and at least some audio containing background noise.
All of blocks 1102-1110 may be repeated. As new frames or segments of background noise are received, their information may be used to adapt the noise history. In the example illustrated in
Block 1112 receives a noise history indicating information about actual background noise in an audio signal received over a communication network. This noise history may have been built at the receiver, such as is described in some of the above examples. This noise history includes information usable to generate comfort noise and may be altered adaptively based on new background noise received. Thus, newer, adapted noise histories or updates to the noise history may be used, thereby enabling comfort noise to dynamically adapt to changes in background noise. This noise history may comprise, as described above, the frequency and excitation templates. In some cases block 1112 (e.g., the comfort noise generator) receives the noise history by actively accessing the noise history as needed to keep up-to-date.
Block 1114 generates comfort noise adaptively based on changes in background noise of an audio signal, such as based on how those changes are reflected in a changing noise history. If the noise history changes, such as when it is adapted based on changes in background noise, a different, adapted noise history is instead received or the prior history is altered (e.g., with an update). Block 1114 may generate comfort noise based on the most-recent noise history. Thus, the tools may generate comfort noise at one point in time and later generate different noise based on changes to the actual comfort noise in the audio signal effective to dynamically adapt comfort noise to changes in background noise in real-time and as a communication progresses.
The tools may perform various actions to generate comfort noise, such as those set forth in
The above-described tools are capable of enabling and/or generating comfort noise for voice communications over a network. The tools may adapt to changes in a speaker's background noise effective to generate comfort noise that also adapts to these changes. And, the tools may do so at significant bandwidth savings over some other techniques. Although the tools have been described in language specific to structural features and/or methodological acts, it is to be understood that these are defined in the appended claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the appended claims.