The invention pertains to systems and methods for modifying noise captured at nodes of a teleconferencing system during a teleconference, so that the modified noise produced at the different nodes is more consistent in spectral and spatial properties. Typically, noise (but not speech) captured at each teleconferencing system endpoint is modified to generate modified noise having a frequency-amplitude spectrum which matches at least substantially a target spectrum and a spatial property set (e.g., a spatial property) which matches at least substantially a target spatial property set.
It is typically desirable to provide teleconference participants a sense of perceptual continuity as they listen to rendered versions of audio captured during a multiparty conference. When different participants (users of teleconferencing system endpoints having different capabilities and spatial sound and acoustic environments) join or leave a conference, it is often desirable to create a sense of continuity and consistency, regardless of which endpoints are present (and active) at different times in the mix (to be rendered) of content (speech signal and noise) captured at endpoints. When the mix is spatially rendered, it may be desirable for the mix to provide spatial cues as to which participant is currently active (e.g., to render speech from each different endpoint so that it is perceived as emitting from a different apparent source position). The inventors have recognized that even when a teleconference system endpoint implements such spatial rendering, it is typically desirable to provide teleconference participants a sense of perceptual continuity (e.g., so that the participants are not distracted by perceived changes in rendered noise at times of low or absent speech activity) throughout a conference.
Typical embodiments of the invention provide plausible consistency throughout a conference with minimal artifacts and processing distractions in the resulting rendered audio, even when each user listens to a spatially rendered mix of audio captured at the conference system endpoints.
PCT Application International Publication No. WO 2012/109384, having international filing date Feb. 8, 2012, published on Aug. 16, 2012, and assigned to the assignee of the present invention, describes a method and system for suppression of noise in audio captured (e.g., using a single microphone or an array of microphones) at a conferencing system endpoint. The noise suppression is applied to the signal captured at a single endpoint, as a function of both spatial properties (e.g., to suppress noise more if it would be perceived when rendered as emitting from a source at a different location than the source of speech uttered at the endpoint) and frequency (e.g., by determining a frequency dependent suppression depth, which is an amount of gain reduction per frequency band, and reducing the gain of the noise in each frequency band in accordance with the frequency dependent suppression depth).
The present inventors have recognized the desirability of applying noise suppression as a function of both spatial content and frequency to noise captured at each endpoint of multiple endpoints of a conferencing system, to produce processed (modified) noise having spectral and spatial properties which match (at least nearly) a common target (e.g., a target frequency amplitude spectrum which is common to all endpoints, and a target spatial property set (e.g., a single target spatial property) which is common to all endpoints). By producing modified noise which is more consistent (from endpoint to endpoint) in spectral and spatial properties than is the original unmodified noise from the endpoints, typical embodiments of the present invention provide conference participants a sense of perceptual continuity when sound is rendered in response to content (including the modified noise) captured at multiple endpoints (e.g., a mix of content captured at the endpoints, including the modified noise from each endpoint), for example, so that the participants are not distracted by perceived changes in rendered noise at times of low or absent speech activity during a conference.
Above-cited PCT International Publication No. WO 2012/109384 suggests applying noise suppression to noise captured by microphones at a single endpoint, as a function of spatial properties (e.g., to suppress the noise more if it would be perceived when rendered as coming from a source at a different location than the source of speech uttered at the endpoint). In contrast, typical embodiments of the present invention perform noise suppression as a function of spatial properties (sometimes referred to herein as “spatial warping”) to make noise (e.g., room noise) captured at each of multiple endpoints of a conferencing system more similar from endpoint to endpoint. In a class of embodiments of the invention, spatial warping (e.g., determined by a warping matrix) and spectral modification (which may be implemented by the same warping matrix, or by a separate gain stage) is applied to noise captured at each endpoint of a set of at least two endpoints of a conferencing system, and the audio (including the modified noise) captured at the endpoints is typically then mixed (e.g., at a server) to generate a mixed signal which can be rendered (e.g., at an endpoint) to enable a user to listen to a mix of sound captured at multiple endpoints of a conferencing system.
In a first class of embodiments, the invention is a method for modifying noise captured during a conference at each endpoint of a set of at least two endpoints of a teleconferencing system, said method including steps of:
(a) generating first noise samples indicative of noise captured at a first one of the endpoints and second noise samples indicative of noise captured at a second one of the endpoints; and
(b) modifying the first noise samples to generate first modified noise samples indicative of modified noise having a frequency-amplitude spectrum which at least substantially matches a target spectrum, and at least one spatial property which at least substantially matches at least one target spatial property, and modifying the second noise samples to generate second modified noise samples indicative of modified noise having a frequency-amplitude spectrum which at least substantially matches the target spectrum, and at least one spatial property which at least substantially matches the target spatial property.
Typically, a set of frames of audio samples is captured at each endpoint during the conference, the first noise samples are a first subset of the set of frames, each frame of a second subset of the set of frames is indicative of speech uttered by a conference participant (and typically also noise), each frame of the first subset is indicative of noise but not a significant level of speech, and the first modified noise samples are generated by modifying the first subset of the set of frames.
In another class of embodiments, the invention is a teleconferencing method including the steps of:
(a) at each endpoint of a set of at least two endpoints of a teleconferencing system, determining a sequence of audio frames indicative of audio captured at the endpoint during a conference, wherein each frame of a first subset of the frames is indicative of speech uttered by a conference participant (and typically also noise), each frame of a second subset of the frames is indicative of noise but not a significant level of speech;
(b) at said each endpoint, generating modified frames by modifying each frame of the second subset of the frames, such that each of the modified frames is indicative of modified noise having a frequency-amplitude spectrum which at least substantially matches a target spectrum, and a spatial property set (e.g., a spatial property) which at least substantially matches a target spatial property set; and
(c) at said each endpoint, generating encoded audio including by encoding the modified frames and encoding each frame of the first subset of the frames.
Typically also, the method includes steps of: transmitting the encoded audio generated at said each endpoint to a server of the teleconferencing system, and at the server, generating conference audio indicative of a mix or sequence of audio captured at different ones of the endpoints.
In typical embodiments, the invention is a method and system for applying noise suppression at each endpoint (of multiple endpoints of a conferencing system) to modify noise (but typically not speech) captured at the endpoint in accordance with a common target (e.g., a target frequency amplitude spectrum and a target spatial property set) so that the modified noise at the different endpoints is more consistent in spectral and spatial properties, where the same target is used for all endpoints. This typically includes determining a set of frequency dependent values SNRi, which determine a noise suppression gain per frequency band (where index “i” identifies the band) equal to the difference between the level of noise (captured at an individual endpoint) in the band and a target level (e.g., common to all endpoints) for the band, and suppressing the noise in accordance with the set of frequency dependent values SNRi.
More generally, there are two components to modification of captured noise in accordance with the invention. These are spectral matching (the matching of a frequency-amplitude spectrum of captured noise to a target spectrum) and spatial matching.
To perform spectral matching, typical embodiments of the invention apply a noise suppression rule which is adaptively configured to suppress captured noise so that the suppressed noise has, on average, a spectral profile which matches (or substantially matches) a target spectrum. Typically, this is implemented during pre-processing, before each frame of suppressed noise (and each frame of speech captured at the teleconferencing system node at which the noise was captured) undergoes encoding. Typically, the frequency dependent pre-processing for noise suppression is applied, at each of two or more endpoints of a teleconferencing system, equally across a set of soundfield channels (each indicative of sound captured at a different one of the endpoints) so that when the soundfield channels are mixed or switched (e.g., at a server) there is no perceived spatial movement of audio sources between time intervals of voice and noise activity. There is one spectral suppression target common to all the endpoints, and this target is employed to modify noise captured at each of the endpoints.
In typical embodiments, spatial matching is implemented by applying a spatial warping matrix to noise after application of spectral shaping suppression gain to the noise. The spatial warping matrix is applied to modify at least one spatial characteristic of the noise asserted thereto (e.g., to make the modified noise isotropic, or to rotate the apparent source of the noise to make the modified apparent source position coincide with the apparent position of a conference participant who utters speech captured at the same endpoint at which the noise is captured). In other embodiments of the invention, spatial and spectral modification are not performed separately (and sequentially), and instead, both spatial modification (e.g., warping) of input audio and spectral modification of the input audio is performed in a single operation (e.g., by application of a single, non-unitary matrix to the input audio).
It is known to artificially add noise to teleconferencing audio in periods of complete silence. Such noise is referred to as “comfort noise” or “presence noise.” Adding presence noise can overcome the high impact of complete silence in between times of speech activity. In typical embodiments of the invention, a goal of the spatial matching is to achieve a modified background soundfield (i.e., the soundfield determined by the modified noise) that is similar in spatial properties to comfort noise that is injected somewhere between the point of capture and the process of rendering at a receiving endpoint. For example, the modified background soundfield (for each endpoint) may have spatial properties similar to the spatial properties of comfort noise to be applied to a mix (to be rendered) of all the modified individual endpoint signals. One example of a desired background soundfield spatial property is isotropy. For example, in some embodiments spatial modification is performed on multichannel noise in horizontal B-format (noise in “WXY” format) to generate isotropic modified noise, so that the components W, X and Y of the modified data will be uncorrelated, and the components X and Y of the modified data will have similar or equal power. Such modified noise will not be perceived to have a spatial preference or orientation bias in the modified soundfield.
It is possible to overlay (mix) continuously present comfort noise (also referred to as “presence noise”) with a mix of speech signals captured at endpoints of a conferencing system. This can overcome the high impact of complete silence in between times of speech activity. A class of embodiments of the invention provides a complementary approach, whereby background soundfields (of different endpoints of a teleconferencing system) are modified at the point of capture (i.e., at each individual capturing endpoint) to be more consistent both in spectral and spatial properties (i.e., to match target spectral and spatial properties that are common to all the endpoints). This has the advantage of producing modified versions of background soundfields captured at different locations which are more consistent and therefore decreasing the burden of processing (at the point of implementing a final mix of sound from multiple endpoints, or of switching between active endpoints) to achieve perceptual continuity. Some embodiments of the invention make use of presence noise (e.g., overlayed by a server with a mix or sequence of speech from speech-signal-capturing endpoints of a conferencing system), as well as modification (at each endpoint) in accordance with the invention of the noise portion of audio captured at each speech-signal-capturing endpoint.
The modification of noise in accordance with typical embodiments of the invention to match target spectral and spatial characteristics, and generation of a mix or sequence of teleconference endpoint signals (where each of the signals input to the mixing stage is indicative of speech captured at a different endpoint of a conferencing system, and noise that has been modified in accordance with the invention) ensures consistency (at least moderate consistency) of the frequency-amplitude spectrum and spatial properties of perceived noise when the mix or sequence is rendered. This can help to create a sense of connectedness and plausibility which lowers the distraction and loss of intelligibility associated with sudden shifts in the source and content of the signal(s) included in the mix.
Typical embodiments of the invention provide a means for achieving consistency in at least one spatial property (e.g., correlation and/or diversity) and spectral properties between noise (background or residual), captured with speech during a teleconference at different endpoints (typically in different rooms), when a mix of the noise and speech captured at a sequence of different subsets of the endpoints is rendered as a soundfield. Typically, both a target spectrum (a target noise value for each band of a set of frequency bands), and a target spatial property (e.g., a target spatial bias, which is typically fairly smooth and isotropic or spatially broad), are specified. Noise captured at each endpoint of a teleconferencing system is modified to generate modified noise having a frequency-amplitude spectrum which matches (at least substantially) the target spectrum and a spatial property which matches (at least substantially) the target spatial property. Typically, the speech captured at the endpoints is not so modified. Typically, during periods of relatively low speech activity (e.g., as determined by classification of signal activity and scene analysis), a corrective suppression depth (as a function of frequency) and spatial mixing are applied to the noise to achieve a match (at least approximately) to the target spectrum and spatial property, and during periods of relatively high speech activity, the corrective suppression depth (as a function of frequency) and spatial mixing are not applied to the noise.
Typically, the target is chosen so as not to require too much departure from typical conference rooms, and also to provide an acceptably pleasant background presence in the mix rendered by the endpoints of the conferencing system.
Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc) which stores code (in tangible form) for implementing any embodiment of the inventive method or steps thereof. For example, the inventive system can be or include a programmable general purpose processor, digital signal processor, or microprocessor (e.g., included in, or comprising, a teleconferencing system endpoint or server), programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.
Throughout this disclosure, including in the claims, the terms “speech” and “voice” are used interchangeably in a broad sense to denote audio content perceived as a form of communication by a human being, or a signal (or data) indicative of such audio content. Thus, “speech” determined or indicated by an audio signal may be audio content of the signal which is perceived as a human utterance upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer).
Throughout this disclosure, including in the claims, the term “noise” is used in a broad sense to denote audio content other than speech, or a signal (or data) indicative of such audio content (but not indicative of a significant level of speech). Thus, “noise” determined or indicated by an audio signal captured during a teleconference (or by data indicative of samples of such a signal) may be audio content of the signal which is not perceived as a human utterance upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer).
Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed (the speaker feed may undergo different processing in different circuitry branches coupled to the different transducers).
Throughout this disclosure, including in the claims, each of the expressions “monophonic” audio, “monophonic” audio signal, “mono” audio, and “mono” audio signal, denotes an audio signal capable of being rendered to generate a single speaker feed for driving a single loudspeaker to emit sound perceivable by a listener as emanating from one or more sources, but not to emit sound perceivable by a listener as originating at an apparent source location (or two or more apparent source locations) distinct from the loudspeaker's actual location.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Embodiments of the inventive system and method will be described with reference to
Each node of the
Endpoint 1 of
More specifically, endpoint 1 of
Processor 10 is configured to accept (as input audio) the captured audio signals that are output from microphones M1, M2, and M3 (or another set of microphones, in variations on the
Each channel of the pre-processed, frequency-domain output of processor 10 (e.g., each of the W, X, and Y channels of a multi-channel, frequency-domain signal output from processor 10) is a sequence of frames (or blocks) of audio samples. For simplicity, we shall refer to each block or frame of samples of all the channels (which may consist of a single channel) output from processor 10 as a “frame” of samples. The samples in each channel of a frame output from processor 10 can be denoted as a sequence of samples Yn, n=0, . . . , F−1. In response to each channel of each frame of samples output from processor 10, spectral banding element 14 generates a sequence of frequency banded samples, which can be denoted as Yb, b=1, . . . , B, where index b denotes frequency band. For each frame, the sequence of samples Yb′ includes, for each frequency band, b, a set of K samples, Ybk′, where k is an index in the range from 0 through K−1, and K is the number of channels of the frame.
Banded spatial feature estimator 12 is configured to generate spatial probability indicators Sb in response to the banded samples generated in processor 10 (or in response to the raw outputs of microphones M1, M2, and M3). These indicators can indicate an apparent source location (or range of locations) of sound indicated by the samples in some or all of the frequency bands of a frame of the captured audio, and can be used to spatially separate a signal into components originating from a desired location and those not. Beamforming in processor 10 may provide some degree of spatial selectivity, e.g., to achieve some suppression of out-of-position signal power and some suppression of noise.
We sometimes denote as Pb the total power spectrum (or other amplitude metric spectrum) of the banded, mixed-down samples, Yb′, in band b of a frame of the captured audio output from stage 14 (or in a channel of such frame of captured audio). The spectrum Pb is determined by spectral estimator 16. For example, if the captured audio output from stage 14 consists of a single channel, stage 16 may determine a power spectrum from each frame of samples of this channel. For another example, if the captured audio output from stage 14 comprises channels W, X, and Y, stage 16 may determine a power spectrum from each frame of samples of the W channel only.
We sometimes denote as Nb′ the power spectrum (or other amplitude metric spectrum) of the noise component (not indicative of speech by a conference participant) of the samples, Yb′, in band b of each channel (or of at least one channel) of the captured audio output from stage 14. Each such spectrum may also be determined by spectral estimator 16. Out-of-position power, sometimes denoted as PowerOutOfBeam′, and sometimes called out-of-beam power and out-of-location power, is the power or power spectrum (or other amplitude metric spectrum) determined from the samples, Yb′, that does not have an appropriate phase or amplitude mapping consistent with sound incident from a desired location (e.g., a known or expected location of a conference participant relative to a capturing microphone array), and desired signal power, sometimes denoted as PowerDesired′ is the remainder of Pb that is not noise Nb′ (or noise or echo) or PowerOutOfBeam′.
Voice activity detector (VAD) 18 is configured to generate a control value (denoted as “V” in
In response to the control value V, the power spectrum Pb, and the spatial probability indicators Sb for each frame of samples, gain determination stage 20 determines a set of gain control values for configuring gain stage 22 to apply an appropriate gain to each frequency band of the samples. In response to the control value V, the power spectrum Pb, and the spatial probability indicators Sb for each frame indicative of noise (but not speech), gain determination stage 20 determines a set of gain control values for configuring gain stage 22 to apply an appropriate gain to each frequency band of each channel of the samples of the frame, to cause the modified samples (of each channel) that are output from stage 22 to have a frequency amplitude spectrum which matches a target spectrum, and to have spatial properties which match a target set of spatial properties. The gain control values generated in stage 20 for each frame are indicative of a gain value for each of the frequency bands of the samples of each channel of the frame.
In accordance with a class of embodiments of the invention, an implementation of stage 20 (or a similar gain determination stage) in each endpoint of the
With reference again to
Endpoint 1 typically performs other (conventional) processing on input audio signals captured by microphones M1, M2, and M3, to generate the encoded audio output which is asserted to link 2, e.g., in additional subsystems or stages (not shown in
Endpoints 3 and 4 can (but need not) be identical to endpoint 1.
Server 5 is configured to generate conference audio indicative of a mix or sequence of audio captured at different ones of the endpoints of the
There are two components to modification of captured noise in accordance with typical embodiments of the invention. These are spectral matching (the matching of a frequency-amplitude spectrum of captured noise to a target spectrum) and spatial matching.
To perform spectral matching, typical embodiments of the invention apply a noise suppression rule which is adaptively configured to suppress captured noise so that the suppressed noise has, on average, a spectral profile which matches (or substantially matches) a target spectrum. Typically, this is implemented during pre-processing, before each frame of suppressed noise (and each frame of speech captured at the teleconferencing system node at which the noise was captured) undergoes encoding. Typically, the frequency dependent pre-processing for noise suppression is applied, at each of two or more endpoints of a teleconferencing system, equally across a set of soundfield channels (each indicative of sound captured at a different one of the endpoints) so that when the soundfield channels are mixed (e.g., at a server) there is no perceived spatial movement of audio sources between time intervals of voice and noise activity. As such, there is one spectral suppression target common to all the endpoints, and this target is employed to modify noise captured at each of the endpoints.
The spectral noise modification (suppression) may be implemented so as to impart a constant signal to noise ratio improvement. Alternatively, it may be implemented so as to impart not less than a predetermined minimum gain (to the noise to be modified), to prevent overly deep noise suppression. Typically, the perceived noise level (when the modified soundfields are rendered) will vary in much the same way as would occur if the original (unmodified) soundfields were rendered, albeit at a lower level after modification (suppression) in accordance with the invention. The target spectrum may be flat (or substantially flat) as a function of frequency, or it may have another shape, but the noise modification is typically applied in an effort to make the modified noise match the target, independently of the actual noise level.
For example, consider the upper graph of
Typically, it is desirable to apply gain to audio indicative of captured speech, to achieve a flat pass through of the desired speech signal. Alternatively, it is desired to apply gain to audio indicative of captured speech to achieve some predetermined spectral shape, which typically differs from the desired spectral shape for audio indicative of background noise.
Thus, a target spectrum for noise (e.g., the power spectrum labeled “Target” in the upper graph of
In the
As is well known, a set of minimum statistics or decision based averaging may be used to estimate the noise level of the input signal.
A maximum suppression depth (an amount of gain suppression determined by one of the values GN of the lower graph of
Where the incoming noise level (for a frequency band) is already below the desired target, typical embodiments apply no suppression is applied to the incoming noise in such band.
The pass-through gain for speech (voice) is typically set to unity. Optionally, an alternative pass-through spectrum for speech is obtained by performing a fixed equalization after pre-processing in accordance with the invention. If this equalization is also applied to the modified noise (generated in accordance with the invention), the noise target can still be achieved by performing the inventive noise suppression so as to compensate for the equalization to be later applied.
When noise suppression in accordance with typical embodiments of the invention is active, the gains applied to the input audio (in different frequency bands) vary between unity (for audio indicative of speech, the desired signal), and the gain for achieving maximum suppression (for audio indicative of background noise; not speech).
A target spectrum for modified noise is plotted as curve T in
Modifying noise captured in different rooms (in different conferencing system endpoints) to match a single common target spectrum (as in typical embodiments of the invention) typically provides perceptual continuity and consistency to conference participants when the noise is rendered. When target spectrum T of
It should be appreciated that if a room has a noise spectrum similar in shape but higher in absolute level than a noise target spectrum employed in accordance with the invention, the noise suppression curve generated in accordance with the invention would be (for processing captured audio indicative of background noise) at least substantially flat over frequency.
As noted above, spatial matching is the second main component to modification of captured noise in accordance with typical embodiments of the invention. Spatial matching is typically implemented by applying a spatial warping matrix (e.g., as implemented by element 62 of
Typically, a goal of the spatial matching is to achieve a modified background soundfield (i.e., the soundfield determined by the modified noise) that is similar in spatial properties to presence (or “comfort”) noise that is injected somewhere between the point of capture and the process of rendering at a receiving endpoint. For example, the modified background soundfield (for each endpoint) may have spatial properties similar to the spatial properties of comfort noise to be applied to a mix (to be rendered) of all the modified individual endpoint signals. One example of a desired background soundfield spatial property is isotropy. For example, in some embodiments spatial modification is performed on multichannel noise in horizontal B-format (noise in “WXY” format) to generate isotropic modified noise, so that the components W, X and Y of the modified data will be uncorrelated, and the components X and Y of the modified data will have similar or equal power. Such modified noise will not be perceived to have a spatial preference or orientation bias in the modified soundfield.
It is possible to overlay (mix) continuously present noise (sometimes referred to as “comfort noise” or “presence noise”) with a mix of speech signals captured at endpoints of a conferencing system. This overcomes the high impact of complete silence in between times of speech activity. A class of embodiments of the invention provides a complementary and alternative approach, whereby background soundfields (of different endpoints of a teleconferencing system) are modified at the point of capture (i.e., at each individual capturing endpoint) to be more consistent both in spectral and spatial properties (i.e., to match target spectral and spatial properties that are common to all the endpoints). This has the advantage of producing modified versions of background soundfields captured at different locations more consistent and therefore decreasing the burden of processing (at the point of implementing a final mix of sound from multiple endpoints, or of switching between active endpoints) to achieve perceptual continuity. It is envisaged that some embodiments of the invention will make use of presence noise (overlayed by a server with a mix of speech from speech-signal-capturing endpoints of a conferencing system), as well as modification (at each endpoint) in accordance with the invention of the noise portion of audio captured at each speech-signal-capturing endpoint.
The modification of noise in accordance with typical embodiments of the invention to match target spectral and spatial characteristics, and generation of a mix of teleconference endpoint signals (where each of the signals input to the mixing stage is indicative of speech captured at a different endpoint of a conferencing system, and noise that has been modified in accordance with the invention) ensures consistency (at least moderate consistency) of the frequency-amplitude spectrum and spatial properties of perceived noise when the mix is rendered. This can help to create a sense of connectedness and plausibility which lowers the distraction and loss of intelligibility associated with sudden shifts in the source and content of the signal(s) included in the mix. In some embodiments, the inventive system is operable in a selected one of a first mode in which it modifies captured noise in accordance with the invention, and a second mode in which it does not so modify captured noise. The selection of the mode in which the endpoint operates may be user controlled or otherwise controlled.
Typical embodiments of the invention provide a means for achieving consistency in at least one spatial property (e.g., correlation and/or diversity) and spectral properties between noise (background or residual), captured with speech during a teleconference at different endpoints (typically in different rooms), when a mix of the noise and speech captured at a sequence of different subsets of the endpoints is rendered as a soundfield. Typically, both a target spectrum (a target noise value for each band of a set of frequency bands), and a target spatial property (e.g., a target spatial bias, which is typically fairly smooth and isotropic), are specified. Noise captured at each endpoint of a teleconferencing system is modified to generate modified noise having a frequency-amplitude spectrum which matches (at least substantially) the target spectrum and a spatial property which matches (at least substantially) the target spatial property. Typically, the speech captured at the endpoints is not so modified. Typically, during periods of relatively low speech activity (e.g., as determined by classification of signal activity and scene analysis), a corrective suppression depth (as a function of frequency) and spatial mixing are applied to the noise to achieve a match (at least approximately) to the target spectrum and spatial property, and during periods of relatively high speech activity, the corrective suppression depth (as a function of frequency) and spatial mixing are not applied to the noise.
Typically, the target is chosen so as not to require too much departure from typical conference rooms, and also to provide an acceptably pleasant background presence in the mix rendered by the endpoints of the conferencing system.
It has been found that captured room noise and/or capsule noise typically has some form of bias. In modifying such noise in accordance with typical embodiments, the desired spatial correction (to be achieved in accordance with the invention) may not be large, however it may be noticeable even if not large, particularly when switching rapidly between modified soundfields.
By aggregating a spatial representation of the input noise to be modified, some embodiments of the invention determine at least one noise covariance matrix (or other data structure) indicative of spatial properties in a frequency band of the input noise to be modified. Each such matrix (or other structure) is tracked, e.g., in a way similar to that (and with identical update rates) by which the estimated power spectrum of the input is tracked. Although only a mono (single channel) version of the noise power spectrum estimate (e.g., an estimate of the power spectrum of the W channel of noise in WXY format) is typically used for spectral control (and maximum noise suppression), a full covariance matrix can be aggregated (and tracked) during identified intervals of speech inactivity during a conference. In some embodiments, a more conservative approach is taken to the estimation and tracking of the spatial properties of the input noise than to estimation and tracking of its power spectrum, with a longer time window in any minimum follower and/or less aggressive updates (than those employed for power spectrum tracking).
The covariance matrix (or other indication of spatial properties of the input noise) may be calculated for each of a set of frequency bands. In simpler embodiments, it can be averaged across the main spectral bands for spatial perception (e.g., across bands in the range from about 200 Hz to about 2 kHz).
Next, with reference to
In the
The
Voice activity detector (VAD) 50 is configured to determine whether each frame of the input audio WXY (i.e., the samples indicative of each frame of the input audio WXY) is indicative of speech (e.g., speech in the presence of noise) or only noise. In response to determining that a frame of the input audio is indicative of speech, VAD 50 places switch 58 in a state in which it asserts default gain control values (e.g., values GV shown in
For the sake of brevity, VAD 50 is referred to in decision and application as binary in nature. It should be apparent to one skilled in the art that in alternate embodiments, the VAD decision may be more of a continuous value or probability of speech indicator, and the application or decision to apply any signal processing can be implemented in a continuously varying way. Such approaches are familiar and broadly known in the art.
In response to each frame of frequency-domain samples of input audio channel W, noise estimation block 52 determines the power of input audio channel W as a function of frequency band (i.e., the power Pi of W in the “i”th frequency band, for all values of index i) and asserts the Pi values to block 56. In response, block 56 determines gain control values GNi (one value GNi for each frequency band) for configuring gain stage 60 to apply corrected gain (rather than the default gain which would otherwise be applied by stage 60) to each frequency band of the frame of input audio WXY (i.e., to each frame of input audio which consists of noise), to cause the power spectrum of output of stage 60 match a predetermined target noise spectrum. The same predetermined target noise spectrum (e.g., the target spectrum shown in
Similarly, in response to determining that a frame of the input audio is indicative of speech (e.g., speech in the presence of noise), VAD 50 places switch 61 in a state in which it asserts default control values (generated or stored in element 55) to spatial warping stage 62 to cause stage 62 to pass through (unchanged) the output of gain stage 60. In this state, stage 62 does not modify the spatial characteristics of the audio output from stage 60.
In response to determining that a frame of the input audio is indicative of noise (not speech), VAD 50 places switch 61 in a state in which it asserts gain control values, M, generated in block 54 to warping stage 62, to cause stage 62 to modify (spatially warp) at least one spatial characteristic of the gain-modified version of the frame which is output from stage 60, such that the output of stage 62 has a spatial characteristic set which matches a predetermined target spatial characteristic set. Gain control values M configure stage 62 to perform warping matrix multiplication on each frequency band (or bin) of each of the three channels of the current gain-modified audio frame output from stage 60.
As will be described below, the gain control values M typically determine (or are organized as) a matrix M, of form M=(RT/R)1/2, where R and RT are matrices described below, and RT/R denotes matrix multiplication of matrix RT by the inverse of matrix R. Thus, stage 54 of
The matrix M=(RT/R)1/2 is a square root matrix of form, A1/2=VSV−1, where A=RT/R is diagonalizable as A=VDV−1, D is a diagonal matrix, and S is a diagonal matrix which is the square root of diagonal matrix D.
The gain control values M determine a warp matrix to be applied by stage 62 to a frame of audio data WXY (e.g., they are themselves such a warp matrix). When a frame of audio data is indicative of speech (e.g., speech in the presence of noise), stage 62 does not spatially warp the frame and typically instead implements and applies to the frame an identity matrix determined by control values from block 55. When a frame of audio data is indicative of noise, stage 62 implements and applies to the frame the warp matrix determined by values M output from block 54. Conditioning may be applied to each sequence of warp matrices determined by a sequence of sets of values M asserted from block 54 to stage 62, e.g., to ensure that the spatial warping applied to a sequence of frames of noise by the conditioned sequence of warp matrices does not include any perceptually disturbing warp (e.g., rotation).
Similar conditioning (filtering) can also (or alternatively) be applied to a sequence of gain control values (determined by stage 56) to be applied to frames of noise data by stage 60, e.g., to prevent (filter out) undesired changes between the successively determined power spectra of the resulting modified frames of noise data.
The conditioning of successive sets of gain control values M output from block 54 (or successive sets of gain control values output from block 56) can be applied in any of many different ways. For example, in some embodiments, the conditioning is performed in any of the ways suggested by above-cited PCT Application International Publication No. WO 2012/109384. The conditioning may employ the concept of a speech probability in each frequency bin or band. In general, such a speech probability is any monotonic function whose range is from 0 to 1 which is indicative of a probability that audio content (e.g., of a frame of input audio) in the bin (or band) is indicative of speech (if the allowed values of speech probability are more constrained, the modified noise generated in accordance with some embodiments of the invention may not be optimal in the perceptual sense). Typically, the speech probability for each bin (or band) sensibly depends on the current input audio frame (or block) and a few previous input audio frames (or blocks) of the bin (band) and neighboring bins (bands).
The conditioning of a sequence of sets of gain control values (or spatial warping matrices) may be determined by a leaky minimum follower with a tracking rate defined by at least one minimum follower leak rate parameter. The leak rate parameter(s) of the leaky minimum follower may be controlled by the probability of speech being indicated (e.g., as determined by VAD 50 of
In some embodiments, the conditioning may be performed by determining a covariance matrix (e.g., a broadband matrix R, of the type to be described below, determined in stage 54) over the range of interest, and determining a set of control values M (which determine a warping matrix) associated with this covariance matrix. The sequence of control value sets M (for a sequence of frames of input audio) is filtered so that changes between successive sets have gentle onsets (at transitions from input speech to input noise) but are allowed to be sudden at transitions from input noise to input speech), much like in a conventional phrasing VAD. The application of the spatial warp could fade in (e.g., over 150 ms or a similar time interval) at each transition from speech to noise, and thus be largely concealed.
In
where “av(rj)” denotes the average of the N samples in the “j”th channel, av(rj)” denotes the average of the N samples in the “k”th channel. One can accumulate a number of frames (e.g., T frames), and do the averaging over samples in each set of accumulated frames, e.g., so that the number N in the foregoing equation is the number of samples in each set of T accumulated frames. In typical implementations, the terms that are summed are weighted (typically with weights that decay exponentially over time, and correspond to some frequency banding envelope).
More generally, in variations on the
where “av(rj)” denotes the average of the N samples in the “j”th channel, av(rj)” denotes the average of the N samples in the “k”th channel. In typical implementations, the terms that are summed are weighted (typically with weights that decay exponentially over time, and correspond to some frequency banding envelope). One can accumulate a number of frames (e.g., T frames), and do the averaging over samples in each set of accumulated frames, e.g., so that the number N in the foregoing equation is the number of samples in each set of T accumulated frames.
In the general case in which the input audio data comprises K channels, the target spatial properties are indicated by a K×K target matrix RT for each frequency band, whose elements tjk, are covariance values, with index j in the range from 1 to K and index k in the range from 1 to K. In the
Thus, the elements of matrix R determine a second order statistic of how the input audio is spatially biased. Similarly, the elements of target matrix RT (for each frequency band) determine how target noise (in the band) is spatially biased. If the channels of the input audio were independent over a significant length of time, matrix R would approach an identity matrix, asymptotically. If the elements of matrix R are accumulated only during intervals in which the input audio is indicative of only noise (not speech), e.g., as determined by VAD 50 of
For example, for input audio in WXY format (as in
which indicates that the input noise is slightly biased in the zero azimuth direction.
In this and following examples, for simplicity we present a single real covariance matrix. It is evident to one of ordinary skill in the art that this approach can be applied generally in any transform or filterbank implementation where each covariance matrix represents some bin or band across a range of frequencies, and may potentially include complex terms in the off diagonal components.
Assuming the exemplary input audio of Equation (1), and a target matrix RT which is isotropic (RT=the identity matrix), it is apparent that the warping matrix implemented by stage 62 can be determined by the matrix, (RT/R)1/2, which is a matrix square root (whose elements are gain control values for controlling operation of stage 62). Thus, if R is as shown in Equation (1) and
The matrix M of Equation (3) is a suitable warping matrix to be applied by stage 62 to modify a frequency band of a frame of noise (i.e., a frame of modified noise output from stage 60) to make its spatial properties match those determined by the target matrix RT. Note that in Eq. (3), the third diagonal value (equal to 1.1180) is the square root of 1.25, so that matrix M of Eq. (3) scales as well as rotates the input noise, and is not a unitary matrix. In some implementations of
If stage 62 applies the 3×3 matrix M of Eq. (3) to each frequency band of each frame of the three channels of audio output from stage 60 (i.e., channels W, X, and Y, which can be denoted as a time sequence of samples of an audio data vector v=(v1, v2, v3), it is true that the covariance matrix (“Rout”) of the output of stage 62 is the covariance matrix of vector Mv:
R
out
=E{Mv}=ME{v}M
T
=I=R
T Eq. (4)
, where MT is the transpose of matrix M, and I is the identity matrix (as in equation (2)). Since gain stage 60 does not modify the spatial properties of the input noise WXY, equation (4) shows that matrix M of Equation (3) is a suitable warping matrix to be applied by stage 62 to modify the input noise WXY to make its spatial properties match those determined by the target matrix RT.
When stage 62 (of
If the warping matrix (for each frequency band) applied by stage 62 is unitary, stage 60 typically applies gain (as a function of frequency band) to input noise to implement the desired noise suppression depth profile (determined by stage 56) to make the frequency amplitude spectrum of stage 60's output (and stage 62's output) match the target spectrum. Alternatively, the warping matrix applied by stage 62 (for a frequency band) is not unitary, and stages 60 and 62 together apply gain to the input noise to implement the desired noise suppression depth profile (to make the frequency amplitude spectrum of the output of stage 62 match the target spectrum, although the output of stage 60 does not match the target spectrum). The choice of the target matrix RT (e.g., the choice as to whether it is unitary) determines which of these two classes of implementations of the
It should be appreciated that in operation of the
In some embodiments (including some implementations of
In another exemplary implementation of the
and the target matrix RT is:
Based on this matrix M, and assuming probability p=0.5, matrix A of Equation (5) is:
It is worth noting that in this case, there is approximately a 20 dB difference between the input spectra and the desired target spectra. The probability of p=0.5 effects half of the desired suppression, which results in a matrix A being applied with a scaling of around −10 dB as expected. In this case, the matrix A implements both the suppression and spectral/spatial warping.
In some embodiments (e.g., the
It will be apparent to those of ordinary skill in the art that some embodiments of the invention implement scaling and rotation of multiple channels such that a certain covariance or second order statistical target is achieved, where the second order statistic is expressed, considered, and matched over space and frequency, and utilization is made of time averaging of the input statistics during periods of identified low or absent signal of interest level. Given the algebraic nature of the overall signal manipulation, it may be implemented by varying sub operations of scaling individual channels, scaling the complete channel set and/or performing a linear operation on the set of channels that performs or is capable of performing some scaling and or rotation (cross talk mixing) of the multi-channel set. It is envisaged that this operation can be applied at various points in the chain where there is access to: (a) the target second order statistics, (b) the estimate of the recent input second order statistics integrated over periods of low activity, and (c) the degree or extent of application and any associated time constants or controls. This application and operation may be singular or distributed and consideration of the desired signal manipulation outcome leads to various possible stages and/or optimizations that are well known to ordinarily skilled practitioners in the art of multichannel signal processing.
Thus, the warping matrix applied (e.g., by stage 62 of the
The system of
The
Processor 154 is configured to perform the functions of noise estimation block 52 of
More specifically, processor 154 is configured to determine, for each frequency band of each frame of input audio WXY, a set of gain control values M which determine (or are organized as) a matrix M, of form M=(RT/R)1/2, where each matrix R is a source noise covariance matrix (or other data structure) indicative of spatial properties of the frame of input audio in the band, the matrices R (for all the bands) determine the frequency amplitude (e.g., power) spectrum of the frame of input audio, each matrix RT determines a set of target spatial properties of output audio in the band, the matrices RT (for all the bands) determine a target spectrum (e.g., a target power spectrum) for output audio, and RT/R denotes matrix multiplication of a matrix RT by the inverse of a matrix R. Thus, stage 154 determines a specific spectral modification and spatial warping matrix M to be applied by stage 162 to each frequency band of each frame of the input audio WXY.
In a typical implementation of
In a class of embodiments of the invention, a target matrix (e.g., target matrix RT of
G
i
′=A
i
G
i, Eq. (5)
where each matrix Ai is applied to samples in a different frequency band Gi of the frame, each matrix Ai has form Ai=Mi(1-p), each matrix Mi has form Mi=(RT/R)1/2, where R is a source noise covariance matrix (or other data structure) indicative of spatial properties of the frame of input audio in the “i”th band, the matrices R (for all the bands) determine the frequency amplitude (e.g., power) spectrum of the frame of input audio, and each matrix RT determines a target gain and a target spatial property set for a different one (the “i”th one) of the bands.
Using such a matrix Ai, with input audio in the “i”th band and a target matrix RT for the “ith band which satisfy the above equations (1), (2), and (3), with an input noise estimate at −40 dB, and target isotropic noise at −60 dB (20 sB SNRi), and speech probability equal to p=0.5, it is apparent that Equation (5) determines the complete desired noise suppression matrix for the filterbank of the inventive system for the “i”th band. As expected, the diagonal elements of matrix Ai in this example represent approximately 10 dB of attenuation of the input audio, which is half of the desired noise suppression gain (SNRi) due to the speech probability at that point. In the example, the matrix Ai also provides spatial warping to correct for the nonisotropic nature of the input audio.
Aspects of the invention include a system or device configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method or steps thereof. For example, the inventive system can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.
The system of
Another aspect of the invention is a computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method or steps thereof.
While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
This application claims benefit of priority to related, co-depending U.S. Provisional Patent Application No. 61/781,669 filed on Mar. 14, 2013 entitled “Spectral and Spatial Modification of Noise Captured During Teleconferencing”, hereby incorporated by references in its entirety.
Number | Date | Country | |
---|---|---|---|
61781669 | Mar 2013 | US |