1. Field
This disclosure is related to audio signal processing.
2. Background
An existing approach to audio masking applies the fundamental concept that a tone can mask other tones that are at nearby frequencies and are below a certain relative level. With a high enough level, a white noise signal may be used to mask speech, and such a sound masking design may be used to support secure conversations in offices.
Other approaches to restricting the area within which a sound may be heard include ultrasonic loudspeakers, which require different fundamental hardware designs; headphones, which provide no freedom if the user desires ventilation at his or her head, and general sound maskers as may be used in a national security office, which typically involve large-scale fixed construction.
A method of signal processing according to a general configuration includes producing a multichannel source signal that is based on a speech signal; producing an obfuscated speech signal that is based on the speech signal; and producing a multichannel masking signal that is based on the obfuscated speech signal. This method also includes driving a directionally controllable transducer, in response to the multichannel source signal and the multichannel masking signal, to produce a sound field comprising (A) a source component that is based on the multichannel source signal and (B) a masking component that is based on the multichannel masking signal. Computer-readable storage media (e.g., non-transitory media) having tangible features that cause a machine reading the features to perform such a method are also disclosed.
An apparatus for signal processing according to a general configuration includes means for producing a multichannel source signal that is based on a speech signal; means for producing an obfuscated speech signal that is based on the speech signal; and means for producing a multichannel masking signal that is based on the obfuscated speech signal. This apparatus also includes means for driving a directionally controllable transducer, in response to the multichannel source signal and the multichannel masking signal, to produce a sound field comprising (A) a source component that is based on the multichannel source signal and (B) a masking component that is based on the multichannel masking signal.
An apparatus for signal processing according to another general configuration includes a first spatially directive filter configured to produce a multichannel source signal that is based on a speech signal; a masking signal generator configured to produce an obfuscated speech signal that is based on the speech signal; and a second spatially directive filter configured to produce a multichannel masking signal that is based on the obfuscated speech signal. This apparatus also includes an audio output stage configured to drive a directionally controllable transducer, in response to the multichannel source signal and the multichannel masking signal, to produce a sound field comprising (A) a source component that is based on the multichannel source signal and (B) a masking component that is based on the multichannel masking signal.
The systems, methods, and apparatus described herein include arrangements that may be used to reduce the intelligibility of a speech signal using a masking signal that is an obfuscated yet correlated version of the speech signal. In this context, obfuscation of a speech signal indicates reducing intelligibility of the speech signal.
Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, estimating, and/or selecting from a plurality of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Unless expressly limited by its context, the term “selecting” is used to indicate any of its ordinary meanings, such as identifying, indicating, applying, and/or using at least one, and fewer than all, of a set of two or more. Unless expressly limited by its context, the term “determining” is used to indicate any of its ordinary meanings, such as deciding, establishing, concluding, calculating, selecting, and/or evaluating. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “derived from” (e.g., “B is a precursor of A”), (ii) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (iii) “equal to” (e.g., “A is equal to B” or “A is the same as B”). Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least.” Unless otherwise indicated, the terms “at least one of A, B, and C,” “one or more of A, B, and C,” “at least one among A, B, and C,” and “one or more among A, B, and C” indicate “A and/or B and/or C.” Unless otherwise indicated, the terms “each of A, B, and C” and “each among A, B, and C” indicate “A and B and C.”
References to a “location” of a microphone of a multi-microphone audio sensing device indicate the location of the center of an acoustically sensitive face of the microphone, unless otherwise indicated by the context. The term “channel” is used at times to indicate a signal path and at other times to indicate a signal carried by such a path, according to the particular context. Unless otherwise indicated, the term “series” is used to indicate a sequence of two or more items. The term “logarithm” is used to indicate the base-ten logarithm, although extensions of such an operation to other bases are within the scope of this disclosure. The term “frequency component” is used to indicate one among a set of frequencies or frequency bands of a signal, such as a sample (or “bin”) of a frequency domain representation of the signal (e.g., as produced by a fast Fourier transform) or a subband of the signal (e.g., a Bark scale or mel scale subband).
Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa). The term “configuration” may be used in reference to a method, apparatus, and/or system as indicated by its particular context. The terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. A “task” having multiple subtasks is also a method. The terms “apparatus” and “device” are also used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” are typically used to indicate a portion of a greater configuration. Unless expressly limited by its context, the term “system” is used herein to indicate any of its ordinary meanings, including “a group of elements that interact to serve a common purpose.”
Any incorporation by reference of a portion of a document shall also be understood to incorporate definitions of terms or variables that are referenced within the portion, where such definitions appear elsewhere in the document, as well as any figures referenced in the incorporated portion. Unless initially introduced by a definite article, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify a claim element does not by itself indicate any priority or order of the claim element with respect to another, but rather merely distinguishes the claim element from another claim element having a same name (but for use of the ordinal term). Unless expressly limited by its context, each of the terms “plurality” and “set” is used herein to indicate an integer quantity that is greater than one.
It may be assumed that in the near-field and far-field regions of an emitted sound field, the wavefronts are spherical and planar, respectively. The near-field may be defined as that region of space which is less than one wavelength away from a sound receiver (e.g., a microphone array). Under this definition, the distance to the boundary of the region varies inversely with frequency. At frequencies of two hundred, seven hundred, and two thousand hertz, for example, the distance to a one-wavelength boundary is about 170, forty-nine, and seventeen centimeters, respectively. It may be useful instead to consider the near-field/far-field boundary to be at a particular distance from the microphone array (e.g., fifty centimeters from a microphone of the array or from the centroid of the array, or one meter or 1.5 meters from a microphone of the array or from the centroid of the array).
Examples of audio sensing devices that may be implemented to include a multi-microphone array and to perform a method as described herein include portable computing devices (e.g., laptop computers, notebook computers, netbook computers, ultra-portable computers, tablet computers, mobile Internet devices, smartbooks, smartphones, etc.), audio- or video-conferencing devices, and display screens (e.g., computer monitors, television sets).
It may be desirable to obfuscate a speech signal (i.e., to reduce intelligibility). For a case in which the speech signal is part of a confidential conversation, it may be desirable to direct an obfuscated version of the speech signal into a surrounding space to prevent a bystander or intentional eavesdropper from understanding the words being spoken. For a case in which the speech signal is part of a scene being recorded (e.g., a surveillance video), it may be desirable to obfuscate the speech signal to provide an accurate representation of the acoustic environment while maintaining the privacy of the spoken communication.
Examples of methods of reducing speech intelligibility include replacing linear prediction coding (LPC) coefficients of the speech signal as described in U.S. Pat. No. 8,140,326 B2 (Chen et al) Like a noise-based masking signal, however, such a signal is likely to create a perception of two different sources to a bystander. Another approach to making voice sounds unintelligible to persons nearby includes non-acoustically sensing and processing a user's speech as described in US Publ. Pat. Appl. No. 2012/0053931 A1 (Holzrichter).
A further approach to reducing intelligibility of a speech signal is to change the order of the frames of the speech signal in time as described in US Publ. Pat. Appl. No. 2010/0208912 A1 (Tohyama et al.). While such rearrangement may reduce intelligibility of the speech content, it is likely to alter non-semantic aspects of the speech signal as well (e.g., prosodic information, which carries emotional content). The speech signal may also contain other sounds (e.g., non-speech sounds) as part of the recorded environment, and such rearrangement may also degrade these other sounds.
Methods, systems, and apparatus as described herein may be configured to process the speech signal as a series of segments. Typical segment lengths range from about five or ten milliseconds to about forty or fifty milliseconds, and the segments may be overlapping (e.g., with adjacent segments overlapping by 25% or 50%) or nonoverlapping. In one particular example, the speech signal is divided into a series of nonoverlapping segments or “frames”, each having a length of ten milliseconds. In another particular example, each frame has a length of twenty milliseconds. Examples of sampling rates for the speech signal include (without limitation) eight, twelve, sixteen, 32, 44.1, 48, and 192 kilohertz.
Voiced segments of a speech signal are typically characterized by a pitch component, which is generated by movement of the vocal cords. It may be desirable to implement method M100 to preserve prosodic information (i.e., change in pitch frequency of the speech signal over time). For example, it may be desirable to implement task T100 such that the plurality of different frequencies of the speech signal are related to a pitch fundamental frequency f0 of the speech signal. In such case, task T100 may be implemented to calculate the envelopes at frequencies that are harmonics of frequency f0 (i.e., frequencies fk=k×f0 for integer values of k from 1 to K). Examples of values for the number K of harmonics include four, five, six, seven, eight, nine, and ten, although K may have any other positive non-zero integer value.
Typical values of frequency f0 range from about 70 to 100 Hz for a male speaker to about 150 to 200 Hz for a female speaker.
Task T50 may be implemented to estimate a pitch frequency for each voiced frame of the speech signal, where the pitch frequency may vary from one frame to another. For example, task T50 may be implemented to perform a pitch estimation procedure as described in section 4.6.3 (pp. 4-44 to 4-49) of EVRC (Enhanced Variable Rate Codec) document C.S0014-C, available online at www-dot-3gpp-dot-org. Alternatively, for a case in which the speech signal has been decoded from an encoded speech signal obtained from a transmission channel (e.g., a far-end communications signal, as in a telephone call) or from storage, a current estimate of the pitch frequency (e.g., in the form of an estimate of the pitch period or “pitch lag”) will typically already be available. In voice communications using codecs that include pitch estimation, such as code-excited linear prediction (CELP) and prototype waveform interpolation (PWI)), an encoded frame may include a current estimate of the pitch frequency in the form of an estimate of the pitch period or “pitch lag.”
Task T122 may be implemented such that each of the plurality N of narrowband filters is centered at a corresponding one of N pitch harmonics (e.g., for N=K). In such case, task T122 may be implemented to reconfigure the narrowband filters (e.g., periodically and/or upon some event) according to a current pitch estimate. For example, such reconfiguration may be performed at each frame, at some other interval (e.g., every two, three, five, or ten frames), or in response to some event (e.g., detection of a change in frequency f0). It may be desirable to implement task T122 to perform such reconfiguration only when the corresponding frame of the speech signal is voiced.
It may be desirable to implement each of the plurality of narrowband filters as a biquad filter (i.e., a second-order infinite-impulse-response filter) or according to another reconfigurable design. For example, task T122 may be implemented to calculate the coefficients of a biquad bandpass implementation of the narrowband filters from desired values of center frequency (e.g., corresponding pitch harmonic frequency), bandwidth, and sampling rate according to any of several known algorithms.
Task T126 calculates envelopes of the outputs of the plurality of narrowband filters. In one example, task T126 is implemented to calculate an amplitude envelope of the output of each filter (e.g., as a magnitude of each sample of the filter output). In another example, task T126 is implemented to calculate an energy envelope of the output of each filter (e.g., as a squared magnitude of each sample of the filter output). In a further example, task T126 is implemented to calculate a complex envelope of the output of each filter (e.g., at the corresponding pitch harmonic).
In a related approach, the speech signal is modeled as a superposition of modulated carrier signals. The envelopes of these modulated carrier signals may be expected to carry intelligible cues. In one such example, the carrier signals are harmonics of the pitch fundamental f0.
where n is a sample index, f0 is the pitch fundamental frequency, and fs is the sampling frequency.
As described above, method M100 may be implemented to calculate or receive an estimate of frequency f0 for each voiced frame of the speech signal. It may be desirable to avoid an abrupt shift in frequency of the carrier signals from one pitch estimate to the next, as such a shift may introduce artifacts into the calculated envelopes.
Task T135 also includes an implementation T132A of task T132 that calculates the carrier signals as harmonics of the frequency indicated by the pitch track. In one example, task T132A is implemented to calculate each carrier signal Ck, 1<=k<=K, as a complex (i.e., quadrature) sinusoid at the corresponding frequency according to an expression such as
where f0[n] is the pitch fundamental at sample n.
Task T136 calculates an envelope of the speech signal at the frequency of each carrier signal. For example, task T136 may be implemented to generate each envelope by demodulating the speech signal at the frequency of the corresponding carrier signal. In one example, task T136 is implemented to calculate each envelope Ek, 1<=k<=K, as a complex envelope according to an expression such as
E
k
[n]=s[n]×C
k
*[n],
where s[n] denotes the speech signal and the asterisk denotes the complex conjugate.
Task T200 filters the plurality of calculated envelopes to produce a corresponding plurality of filtered envelopes. It may be desirable to implement task T200 to remove information from the envelopes that is important to intelligibility. For example, task T200 may be implemented to attenuate high-frequency components of the envelope, which may contribute to semantic content of the speech signal, while retaining low-frequency components of the envelope, which may carry prosodic information. In one example, task T200 is implemented to apply a low-pass filter having a cutoff frequency fc of five Hz to each envelope to produce the corresponding filtered envelope. Examples of values for fc that may be used in other such implementations of task T200 include, without limitation, three, four, six, and seven Hz. In another example, task T200 is implemented to apply low-pass filters having different cutoff frequencies to different envelopes (e.g., a lower cutoff frequency for the envelope that corresponds to the fundamental than for the envelope that corresponds to the highest harmonic).
Task T300 modulates a plurality of carrier signals with corresponding ones of the filtered envelopes to produce a plurality of modulated carrier signals. The carrier signals may be narrowband signals at harmonics of the current pitch fundamental f0, or the complex sinusoids Ck[n] as described above, which may have pitch-track-based frequencies. For example, task T300 may be implemented to produce the modulated carrier signals Mk, 1<=k<=K, according to an expression such as
M
k
[n]=E
k
[n]C
k
[n],
where Ek
If the harmonics modulated in task T300 are exact integer multiples of the pitch fundamental f0, the resulting obfuscated speech signal may sound a bit mechanical. It may be desirable to implement task T300 to modulate a plurality of carrier signals at harmonics of frequency f0 that are obtained by adding noise to the complex sinusoids Ck[n] as described above. In one example, task T300 is configured to calculate the carrier signals Ck′[n] according to an expression such as
C
k
′[n]=C
k
[n]+z
k
[n],
where zk[n] denotes a noise signal (e.g., white or pink noise) that shifts the frequency of the carrier signal slightly to provide a jitter to the synthesized pitch. In such case, task T300 may be implemented to produce the modulated carrier signals according to an expression such as
M
k
[n]=E
k
[n]C
k
′[n].
Task T400 combines the modulated carrier signal to produce the obfuscated speech signal. In one example, task T400 is implemented to produce the obfuscated speech signal according to an expression such as
where m[n] denotes the obfuscated speech signal.
While voice-based methods as described in U.S. Pat. No. 8,140,326 B2 and US Publ. Pat. Appl. No. 2012/0053931 A1 are active only during voiced segments (e.g., vowels), a modulation-based scheme as described herein may be used during voiced segments only, during both voiced and unvoiced segments, or during all segments. It is also noted that a modulation-based obfuscated speech signal as produced by an implementation of method M100 may be used in addition to other maskers, such as white or pink noise, waterfall noise, etc. For applications in which it is desired to mask speech from more than one speaker, method M100 may be implemented to perform a multi-pitch analysis to calculate a corresponding pitch track for each speaker.
Use cases for an obfuscated yet correlated speech signal include masking intelligibility of speech within a source signal. For example, it may be desirable to preserve an accurate record of an acoustic environment (e.g., an environment that is being monitored or recorded) without compromising the privacy of individuals speaking within that environment. In such case, an obfuscated speech signal as produced by an implementation of method M100 may be combined with the recorded signal in order to obscure the intelligibility of the speech.
Other applications of using pitch analysis and demodulation to separate an information-carrying component of the speech signal from a speaker-characterizing component include voice morphing.
An obfuscated speech signal as produced by an implementation of method M100 may be used to provide a privacy zone. For example, it may be desirable to confine the intelligible content of a person's voice to a particular space, such as the cubicle, office, or conference room in which the person is speaking, and to prevent persons outside that space (e.g., in an adjoining room or cubicle) from understanding that speech. In such cases, method M100 may be implemented to receive the speech signal via one or more microphones, and the resulting obfuscated speech signal may be used to drive a transducer (e.g., a loudspeaker) to create a masking sound field directed away from the privacy zone. In one example, a handset is implemented to perform method M100 and to drive a rear speaker of the handset to create a masking sound field directed away from the user's ear.
Envelope calculator 120 may be configured to apply, to each of the plurality of frames of the speech signal and for each of the plurality of different frequencies, a narrowband filter at the frequency (e.g., as described herein with reference to task T122) and to calculate, for each of the plurality of frames of the speech signal and for each of the plurality of different frequencies, an envelope of the output of the corresponding narrowband filter (e.g., as described herein with reference to task T126).
Alternatively, envelope calculator 120 may be configured to calculate, for each of the plurality of frames of the speech signal and for each of the plurality of different frequencies, a carrier signal at the frequency (e.g., as described herein with reference to task T132) and to calculate, for each of the plurality of frames of the speech signal and for each of the plurality of different frequencies, an envelope of the corresponding carrier signal (e.g., as described herein with reference to task T136).
In another example, it may be desirable to confine the intelligible content of a reproduced speech signal (e.g., a far-end voice communications signal, such as the received channel of a telephone call, or a recorded voice signal) to a particular space. In this case, a directionally controllable transducer (e.g., an array of loudspeakers) may be used to steer beams with different characteristics in various directions of emission and/or to create a private listening zone. By combining different audio contents that are beamed in different directions, we can direct a main beam to carry the communication channel towards the user and masking beams to obscure the communication channel in other directions without interfering with the main beam.
A problem may arise when the loudspeaker array is used in a public area, where people in the dark zone may be normal bystanders rather than eavesdroppers, or in a workplace, where the dark zone may encompass people at work. While such a method may be used to preserve the user's privacy, the masking signals are usually unwanted sound pollution with respect to bystanders in the surrounding environment. It may be desirable to provide a system that can achieve good privacy protection for the user and minimal sound pollution to others at the same time.
The effectiveness of an audio masking signal may be dependent on factors such as signal intensity, frequency, and/or content as well as psychoacoustic factors. A critical masking condition is typically a function of several (and possibly all) of these factors. For simplicity in explanation,
Generating a masking signal by rearranging frames of the speech signal in time, or by substituting components of the speech signal (e.g., LPC coefficients) with components from other signals, is likely to produce a signal that is uncorrelated with the speech signal. A low degree of correlation increases the likelihood that a bystander hearing both signals will perceive two different sources. A potential advantage of an obfuscated speech signal as produced by an implementation of method M100 is a high degree of correlation with the original speech signal. Such correlation increases the likelihood that a bystander will perceive only one source, providing a masking operation that may be more effective (e.g., at the same power level) and less distracting than other approaches. The bystander may not even notice that a masking activity is being performed.
Task T700 produces a second multichannel signal (a “multichannel masking signal”) that is based on the obfuscated speech signal. Task T800 drives a directionally controllable transducer to produce a sound field to include a source component that is based on the multichannel source signal and a masking component that is based on the multichannel masking signal. The source component may have an intensity (e.g., magnitude or energy) which is higher in a source direction relative to the array than in a leakage direction relative to the array that is different than the source direction, and task T700 may be implemented to produce the masking signal based on an estimated intensity of the source component in the leakage direction.
It may be desirable to implement method M300 to produce the source component by inducing constructive interference in a desired direction of the produced sound field (e.g., in the first direction) while inducing destructive interference in other directions of the produced sound field (e.g., in the second direction). Such a technique may include implementing task T500 to produce the multichannel source signal by steering a beam in a desired source direction while creating a null (implicitly or explicitly) in another direction. A beam is defined as a concentration of energy along a particular direction relative to the emitter (e.g., the loudspeaker array), and a null is defined as a valley, along a particular direction relative to the emitter, in a spatial distribution of energy.
Task T500 may be implemented, for example, to produce the multichannel source signal by applying a spatially directive filter (the “source spatially directive filter”) to the speech signal. By appropriately weighting and/or delaying the speech signal to generate each channel of the multichannel source signal, such an implementation of task T500 may be used to obtain a desired spatial distribution of the source component within the produced sound field. Task T500 may be implemented to apply a precalculated filter, to select the source spatially directive filter from among a set of precalculated filters (e.g., according to a desired beam direction and/or width), or to calculate the coefficients of the source spatially directive filter (e.g., according to any of expressions (1)-(3b) below).
Task T500 may be implemented according to a phased-array technique such that each channel of the multichannel source signal has a respective phase (i.e., time) delay. One example of such a technique is a delay-sum beamforming (DSB) filter. In such case, task T500 may be implemented to direct the source component in a desired source direction by applying a respective time delay to the speech signal to produce each channel of signal MCS10. For a case in which task T800 drives a uniformly spaced linear loudspeaker array, for example, the coefficients of channels w1 to wN of the source spatially directive filter may be calculated according to the following expression for a DSB filtering operation in the frequency domain:
for 1≦n≦N, where d is the spacing between the centers of the radiating surfaces of adjacent loudspeakers in the array, N is the number of loudspeakers to be driven (which may be less than or equal to the number of loudspeakers in the array), f is a frequency bin index, c is the velocity of sound, and φs is the desired angle of the beam relative to the axis of the array (e.g., the desired source direction, or the desired direction of the main lobe of the source component). For an equivalent time-domain implementation of the filter configuration, elements w1 to wN may be implemented as corresponding delays. In either domain, task T500 may also include normalization of signal MCS10 by scaling each channel of signal MCS10 by a factor of 1/N (or, equivalently, scaling source signal SS10 by 1/N).
For a frequency f1 at which the spacing d is equal to half of the wavelength λ (where λ=c/f1), expression (1) reduces to the following expression:
w
n(f1)=exp(−jπ(n−1)cos φs). (2)
It is noted that the filter beam patterns shown in
It is also possible to implement method M300 to include multiple instances of task T500 such that portions of a directionally selective transducer (e.g., subarrays of array LA100) may be driven differently for different frequency ranges. Such an implementation may provide better directivity for wideband reproduction. In one example, a second instance of task T502 is implemented to produce an N/2-channel multichannel signal (e.g., using alternate ones of the channels w1 to wN) from a frequency band of the speech signal that is limited to a maximum frequency of c/4d, and this second multichannel signal is used to drive alternate loudspeakers of the array (i.e., a subarray that has an effective spacing of 2d).
It may be desirable to implement task T500 to apply different respective weights to channels of the multichannel source signal. For example, it may be desirable for the source spatially selective filter to include a spatial windowing function applied to the filter coefficients. Examples of such a windowing function include, without limitation, triangular and raised cosine (e.g., Hann or Hamming) windows. Use of a spatial windowing function tends to reduce both sidelobe magnitude and angular resolution (e.g., by widening the mainlobe).
In one example, the coefficients of each channel wn of the source spatially directive filter include a respective factor sn of a spatial windowing function. In such case, expressions (1) and (2) may be modified to the following expressions, respectively:
An array having more loudspeakers allows for more degrees of freedom and may typically be used to obtain a narrower mainlobe.
It may be desirable to implement task T500 and/or task T700 to apply a superdirective beamformer, which maximizes gain in a desired direction while minimizing the average gain over all other directions. Examples of superdirective beamformers include the minimum variance distortionless response (MVDR) beamformer (cross-covariance matrix), and the linearly constrained minimum variance (LCMV) beamformer. Other fixed or adaptive beamforming techniques, such as generalized sidelobe canceller (GSC) techniques, may also be used.
The design goal of an MVDR beamformer is to minimize the output signal power with the constraint minw WHΦXXW subject to WHd=1, where W denotes the filter coefficient matrix, ΦXX denotes the normalized cross-power spectral density matrix of the loudspeaker signals, and d denotes the steering vector. Such a beam design may be expressed as
where dT is a farfield model for linear arrays that may be expressed as
d
T=[1, exp(−jΩfsc−1 cos(θ0)),exp(−jΩfsc−12l cos(θ0)), . . . ,exp(−jΩfsc−1(N−1)cos(θ0))],
and Γv
In these equations, μ denotes a regularization parameter (e.g., a stability factor), θ0 denotes the beam direction, fs denotes the sampling rate, Ω denotes angular frequency of the signal, c denotes the speed of sound, l denotes the distance between the centers of the radiating surfaces of adjacent loudspeakers, lnm denotes the distance between the centers of the radiating surfaces of loudspeakers n and m, ΦVV denotes the normalized cross-power spectral density matrix of the noise, and σ2 denotes transducer noise power.
Task T500 may be implemented to produce the multichannel source signal to obtain a desired spatial response with a linear loudspeaker array with uniform spacing, a linear loudspeaker array with nonuniform spacing, or a nonlinear (e.g., shaped) array, such as an array having more than one axis. In one example, task T500 is implemented to produce the multichannel source signal to obtain a desired spatial response with an array having more than one axis by using a pairwise beamforming-nullforming (BFNF) configuration as described herein with reference to a microphone array. Such an application may include a loudspeaker that is shared among two or more of the axes. Task T500 may also be performed using other directional field generation principles, such as a wave field synthesis (WFS) technique based on, e.g., the Huygens principle of wavefront propagation.
Task T800 drives the loudspeaker array, in response to the multichannel source and masking signals, to produce the sound field. Typically the produced sound field is a superposition of a source component based on the multichannel source signal and a masking component based on the masking signal. In such case, task T800 may be implemented to produce the source component of the sound field by driving the array in response to the multichannel source signal to create a corresponding beam of acoustic energy that is concentrated in the direction of the user and to create a valley in the beam response at other locations.
Task T800 may be configured to amplify, apply a gain to, and/or control a gain of the multichannel source signal, and/or to filter the multichannel source and/or masking signals. As shown in
Additionally or in the alternative to mixing corresponding channels of the multichannel source and masking signals, task T800 may be implemented to drive different loudspeakers of the array to produce the source and masking components of the field. For example, task T800 may be implemented to drive a first plurality (i.e., at least two) of the loudspeakers of the array to produce the source component and to drive a second plurality (i.e., at least two) of the loudspeakers of the array to produce the masking component, where the first and second pluralities may be separate, overlapping, or the same.
Task T800 may also be implemented to perform one or more other audio processing operations on the mixed channels to produce the driving signals. Such operations may include amplifying and/or filtering one or more (possibly all) of the mixed channels. For example, it may be desirable to implement task T800 to apply an inverse filter to compensate for differences in the array response at different frequencies and/or to implement task T800 to compensate for differences between the responses of the various loudspeakers of the array. Alternatively or additionally, it may be desirable to implement task T800 to provide impedance matching to the loudspeakers of the array (and/or to an audio-frequency transmission path that leads to the loudspeaker array).
Task T500 may be implemented to produce the multichannel source signal according to a desired direction. As described above, for example, task T500 may be implemented to produce the multichannel source signal such that the resulting source component is oriented in a desired source direction. Examples of such source direction control include, without limitation, the following:
In a first example, task T500 is implemented such that the source component is oriented in a fixed direction (e.g., center zone). For example, task T510 may be implemented such that the coefficients of channels w1 to wN of the source spatially directive filter are calculated offline (e.g., during design and/or manufacture) and applied to the speech signal at run-time. Such a configuration may be suitable for applications such as listening to a recorded speech signal and browse-talk (i.e., web surfing while on a telephone call). Typical use scenarios include on an airplane, in a transportation hub (e.g., an airport or rail station), and at a coffee shop or café. Such an implementation of task T500 may be configured to allow selection (e.g., automatically according to a detected use mode, or by the user) among different source beam widths to balance privacy (which may be important for a telephone call) against sound pollution generation (which may be a problem for speakerphone use in close public areas).
In a second example, task T500 is implemented such that the source component is oriented in a direction that is selected by the user from among two or more fixed options. For example, task T500 may be implemented such that the source component is oriented in a direction that corresponds to the user's selection from among a left zone, a center zone, and a right zone. In such case, task T510 may be implemented such that, for each direction to be selected, a corresponding set of coefficients for the channels w1 to wN of the source spatially directive filter is calculated offline (e.g., during design and/or manufacture) for selection and application to the speech signal at run-time. One example of corresponding respective directions for the left, center, and right zones (or sectors) in such a case is (45, 90, 135) degrees. Other examples include, without limitation, (30, 90, 150) and (60, 90, 120) degrees.
In a third example, task T500 is implemented such that the source component is oriented in a direction that is automatically selected from among two or more fixed options according to an estimated user position. Such a configuration may be suitable for a speakerphone application. For example, task T500 may be implemented such that the source component is oriented in a direction that corresponds to the user's estimated position from among a left zone, a center zone, and a right zone. In such case, task T510 may be implemented such that, for each direction to be selected, a corresponding set of coefficients for the channels w1 to wN of the source spatially directive filter is calculated offline (e.g., during design and/or manufacture) for selection and application to the speech signal at run-time. One example of corresponding respective directions for the left, center, and right zones in such a case is (45, 90, 135) degrees. Other examples include, without limitation, (30, 90, 150) and (60, 90, 120) degrees. It is also possible for such an implementation of task T500 to select among different source beam widths for the selected direction according to an estimated user range. For example, a more narrow beam may be selected when the user is more distant from the array (e.g., to obtain a similar beam width at the user's position at different ranges).
In a fourth example, task T500 is implemented such that the source component is oriented in a direction that may vary over time in response to changes in an estimated direction of the user. In such case, task T510 may be implemented to calculate the coefficients of the channels w1 to wN of the source spatially directive filter at run-time such that the orientation angle of the filter (i.e., angle φs) corresponds to the estimated direction of the user. Such an implementation of task T510 may be configured to perform an adaptive beamforming operation.
In a fifth example, task T500 is implemented such that the source component is oriented in a direction that is initially selected from among two or more fixed options according to an estimated user position (e.g., as in the third example above) and then adapted over time according to changes in the estimated user position (e.g., changes in direction and/or distance). In such case, task T510 may also be implemented to switch to (and then adapt) another of the fixed options in response to a determination that the current estimated direction of the user is within a zone corresponding to the new fixed option.
Generation of the multichannel source signal by task T500 leads to a concentration of energy of the source component in a source direction relative to an axis of the array (e.g., in the direction of angle φs). As shown in
It may be desirable to implement task T700 to direct the masking component such that its intensity is higher in one direction than another. For example, task T700 may be implemented to produce the multichannel masking signal such that an intensity of the masking component is higher in the leakage direction than in the source direction. The source direction is typically the direction of a main lobe of the source component, and the leakage direction may be the direction of a sidelobe of the source component. A sidelobe is an energy concentration of the component that is not within the main lobe.
In one example, the leakage direction is determined as the direction of a sidelobe of the source component that is adjacent to the main lobe. In another example, the leakage direction is the direction of a sidelobe of the source component whose peak intensity is not less than (e.g., is greater than) the peak intensities of all other sidelobes of the source component.
In a further alternative, the leakage direction may be based on directions of two or more sidelobes of the source component. For example, these sidelobes may be the highest sidelobes of the source component, the sidelobes having estimated intensities not less than (alternatively, greater than) a threshold value, and/or the sidelobes that are closest in direction to the same side of the main lobe of the source component. In such case, the leakage direction may be calculated as an average direction of the sidelobes, such as a weighted average among two or more directions (e.g., each weighted by intensity of the corresponding sidelobe).
Selection of the leakage direction may be performed during a design phase, based on a calculated response of the source spatially directive filter and/or from observation of a sound field produced using such a filter. Alternatively, task T700 may be implemented to select the leakage direction at run-time, similarly based on such a calculation and/or observation.
It may be desirable to implement task T700 to produce the masking component by inducing constructive interference in a desired direction of the produced sound field (e.g., in a leakage direction) while inducing destructive interference in other directions of the produced sound field (e.g., in the source direction). Such a technique may include implementing task T700 to produce the multichannel masking signal by steering a beam in a desired masking direction (i.e., in a leakage direction) while creating a null (implicitly or explicitly) in another direction.
Task T700 may be implemented, for example, to produce the masking signal by applying a second spatially directive filter (the “masking spatially directive filter”) to the obfuscated speech signal.
Task T700 may be implemented according to a phased-array technique such that each channel of the masking signal has a respective phase (i.e., time) delay. For example, task T700 may be implemented to perform a DSB filtering operation to direct the masking component in the leakage direction by applying a respective time delay to the noise signal to produce each channel of signal MCS20. For a case in which task T800 drives a uniformly spaced linear loudspeaker array, for example, the coefficients of channels v1 to vN of the masking spatially directive filter may be calculated according to an expression for a DSB filtering operation in the frequency domain such as expression (1) or (3a) above, where the angle φs is replaced by the desired angle φm of the beam relative to the axis of the array (e.g., the leakage direction).
To avoid spatial aliasing, it may be desirable to limit the maximum frequency of the noise signal to c/2d. It is also possible to implement method M300 to include multiple instances of task T700 such that subarrays of array LA100 are driven differently for different frequency ranges.
The masking component may include more than one subcomponent. For example, the masking spatially directive filter may be configured such that the masking component includes a first masking subcomponent whose energy is concentrated in a beam on one side of the main lobe of source component, and a second masking subcomponent whose energy is concentrated in a beam on the other side of the main lobe of the source component. The masking component typically has a null in the source direction.
Examples of masking direction control that may be performed by respective implementations of task T700 include, without limitation, the following:
1) For a case in which the direction of the source component is fixed (e.g., determined during a design phase), it may be desirable also to fix (i.e., to precalculate) the masking direction.
2) For cases in which the direction of the source component is selected (e.g., by the user or automatically) from among several fixed options, it may be desirable for each of such fixed options to also indicate a corresponding masking direction. It may also be desirable to allow for multiple masking options for a single source direction (to allow selection among different respective masking component patterns, for example, for a case in which source beam width is selectable).
3) For a case in which the source component is adapted according to a direction that may vary over time, it may be desirable to select a corresponding masking direction from among several preset options and/or to adapt the masking direction according to the changes in the source direction.
It may be desirable to design the masking spatially directive filter to have a response that is similar to the response of the source spatially selective filter in one or more leakage directions and has a null in the source direction.
As illustrated in
The estimated intensity of the source component in a given direction φ may be based on an estimated response of the source spatially directive filter in that direction, which is typically expressed relative to an estimated peak response of the filter (e.g., the estimated response of the filter in the source direction). Task T720 may be implemented to apply a gain factor value to the obfuscated speech signal that is based on a local maximum of an estimated response of the source spatially directive filter in a direction other than the source direction (e.g., in the leakage direction). For example, task T720 may be implemented to apply a gain factor value that is based on the maximum sidelobe peak intensity of the filter response. In another example, the value of the gain factor is based on a maximum of the estimated filter response in a direction that is at least a minimum angular distance (e.g., ten or twenty degrees) from the source direction.
For a case in which a source spatially directive filter of task T500 comprises channels w1 to wN as in expression (1) above, the response Hφs(tp, f) of the filter, at angle φ and frequency f and relative to the response at source direction angle φs, may be estimated as a magnitude of a sum of the relative responses of the channels w1 to wN. Such an estimated response may be expressed in decibels as:
Similar application of the principle of this example to calculate an estimated response for a spatially directive filter that is otherwise expressed will be easily understood.
Such calculation of a filter response may be performed according to a desired resolution of angle φ and frequency f. Alternatively, it may be decided for some applications that calculation of the response at a single value of frequency f (e.g., frequency f1) is sufficient. Such calculation may also be performed for each of a plurality of source spatially selective filters, each oriented in a different corresponding source direction (e.g., for each of a set of fixed options as described above with reference to examples 1, 2, 3, and 5 of task T500), such that task T720 selects the estimated response corresponding to the current source direction at run-time.
Calculating a filter response as defined by the values of its coefficients (e.g., as described above with reference to expression (5)) produces a theoretical result that may differ from the actual response of the device with respect to direction (and frequency) as observed in service. It may be expected that in-service masking performance may be improved by compensating for such difference. For example, the response of the source spatially directive filter with respect to direction (and frequency, if desired) may be estimated by measuring the intensity distribution of an actual sound field that is produced using a copy of the filter. Such direct measurement of the estimated intensity may also be expected to account for other effects that may be observed in service, such as a response of the loudspeaker array, acoustic reflectance of the surfaces of the device, resonances of the housing, etc. The response of the source spatially directive filter may be estimated and stored before run-time, such as during design and/or manufacture, to be accessed by task T720 at run-time.
Task T720 may be implemented to calculate the gain factor such that the masking component has the same intensity in the leakage direction as the source component, or to obtain a different relation between these intensities (e.g., based on a loudness weighting function or other perceptual response function, such as an A-weighting curve). The value of the gain factor may also be based on an estimated intensity of the source component in one or more other directions. For example, the gain factor value may be based on estimated filter responses at two or more source sidelobes (e.g., relative to the source main lobe level). In such case, the two or more sidelobes may be selected as the highest sidelobes, the sidelobes having estimated intensities not less than (alternatively, greater than) a threshold value, and/or the sidelobes that are closest in direction to the main lobe. The gain factor value (which may be precalculated, or calculated at run-time by task T720) may be based on an average of the estimated responses at the two or more sidelobes.
The source component may have a frequency distribution that differs from one direction to another. Such variations may arise from task T500 (e.g., from the operation of applying a source spatially directive filter to generate the source component). Such variations may also arise from the response of the audio output stage and/or loudspeaker array. It may be desirable to produce the masking component according to an estimation of frequency- and direction-dependent variations in the source component. For example, it may be desirable to implement task T720 to apply different respective gain factors to different frequency bands of the obfuscated speech signal, where the gain factors are based on estimated intensities of the source component in those frequency bands and on a desired masking level.
Method M300 may be used in any of a wide variety of different applications. For example, method M300 may be used to reproduce the far-end communications signal in a two-way voice communication, such as a telephone call. In such a case, a primary concern may be to protect the privacy of the user (e.g., by obscuring the sidelobes of the source component). It may be desirable for the device to activate a privacy masking mode in response to an incoming and/or an outgoing telephone call.
Method M300 may also be implemented to drive a loudspeaker array to generate a sound field that includes more than one source component.
In one example of a multi-source use case, method M300 is implemented to generate source components having unrelated audio content into different respective directions. For example, each of two or more of the source components may carry far-end audio content for a different voice communication (e.g., telephone call). Alternatively or additionally, each of two or more of the source components may include an audio track for a different respective media reproduction (e.g., music, video program, etc.).
For a case in which multiple source signals are supported, each source component may be oriented in a respective direction that is fixed (e.g., selected, by a user or automatically, from among two or more fixed options), as described herein with reference to task T500. Alternatively, each of at least one (possibly all) of the source components may be oriented in a respective direction that may vary over time in response to changes in an estimated direction of a corresponding user. Typically it is desirable to implement independent direction control for each source, such that each source component or beam is steered independently of the other(s) (e.g., by a corresponding instance of task T500).
In a typical multi-source application, it may be desirable to provide about thirty or forty to sixty degrees of separation between the directions of orientation of adjacent source components. One typical application is to provide different respective source components to each of two or more users who are seated shoulder-to-shoulder (e.g., on a couch) in front of the loudspeaker array. At a typical viewing distance of 1.5 to 2.5 meters, the span occupied by a viewer is about thirty degrees. With an array of four microphones, a resolution of about fifteen degrees may be possible. With an array having more microphones, a more narrow beam may be obtained.
As for a single-source case, privacy may be a concern for multi-source cases, especially if at least one of the source signals is a far-end voice communication (e.g., a telephone call). For a typical multiple-source case, however, leakage of one source component to another may be a greater concern, as each source component is potentially an interferer to other source components being produced at the same time. Accordingly, it may be desirable to generate a source component to have a null in the direction of another source component. For example, each source beam may be directed to a respective user, with a corresponding null being generated in the direction of each of one or more other users. Such design will typically cope with a “waterbed” effect, as the energy suppressed by creating a null on one side of a beam is likely to re-emerge as a sidelobe on the other side. The beam and null (or nulls) of a source component may be designed together or separately. It may be desirable to direct two or more narrow nulls of a source component next to each other to obtain a broader null. For each source signal to be obfuscated, an instance of method M300 may be performed to produce a corresponding source component and a masking component according to an estimated spatial distribution of the source component.
It may be desirable to implement method M300 to adapt the direction of the source component, and/or the direction of the masking component, in response to changes in the location of the user. For a multiple-user case, it may be desirable to implement method M300 to perform such adaptation individually for each of two or more users. In order to determine the respective source and/or masking directions, such a method may be implemented to perform user tracking.
Additionally or in the alternative, task T900 may be configured to perform passive tracking by applying a multi-microphone speech tracking algorithm to a multichannel sound signal produced by a microphone array (e.g., in response to sound emitted by the user or users). Examples of multi-microphone approaches to localization of one or more sound sources include directionally selective filtering operations, such as beamforming (e.g., filtering a sensed multichannel signal in parallel with several beamforming filters that are each fixed in a different direction, and comparing the filter outputs to identify the direction of arrival of the speech), blind source separation (e.g., independent component analysis, independent vector analysis, and/or a constrained implementation of such a technique), and estimating direction-of-arrival by comparing differences in level and/or phase between a pair of channels of the multichannel microphone signal. Such a task may include performing an echo cancellation operation on the multichannel microphone signal to block sound components that were produced by the loudspeaker array and/or performing a voice recognition operation on at least one channel of the multichannel microphone signal.
For accurate tracking results, it may be desirable for the microphone array (or other sensing device) to be aligned in space with the loudspeaker array in a reciprocal arrangement. In an ideally reciprocal arrangement, the direction to a point source P as indicated by a sensing device (e.g., a microphone array and associated tracking logic) is the same as the source direction used to direct a beam from the loudspeaker array to the point source P. A reciprocal arrangement may be used to create the privacy zones (e.g., by beamforming and nullforming) at the actual locations of the users. If the sensing and emitting arrays are not arranged reciprocally, the accuracy of creating a beam or null for designated source locations may be unacceptable. The quality of the null especially may suffer from such a mismatch, as a nullforming operation typically requires a higher level of accuracy than a comparable beamforming operation.
With an array of many microphones, a narrow beam may be produced. With a four-microphone array, for example, a resolution of about fifteen degrees is possible. For a typical television viewing distance of two meters, a span of fifteen degrees corresponds to a shoulder-to-shoulder width, and a span of thirty degrees corresponds to a typical angle between the directions of adjacent users seated on a couch. A typical application is to provide forty to sixty degrees between the directions of adjacent source beams.
It may be desirable to direct two or more narrow nulls together to obtain a broad null. The beam and nulls may be designed together or separately. Such design will typically cope with a “waterbed” effect, as creating a null on one side is likely to create a sidelobe on the other side.
As described above, it may be desirable to implement method M300 to support privacy zones for multiple listeners. In such an implementation of method M330, task T900 may be implemented to track multiple users. Multiple source beams may be directed to respective users, with corresponding nulls being generated in other user directions.
Any beamforming method may be used to estimate the direction of each of one or more users as described above. For example, a reciprocal implementation of a method used to generate the source and/or masking components may be applied.
For a one-dimensional (1-D) array of microphones, a direction of arrival (DOA) for a source may be easily defined in a range of, for example, −90° to +90°. For an array that includes more than two microphones at arbitrary relative locations (e.g., a non-coaxial array), it may be desirable to use a straightforward extension of one-dimensional principles as described above, e.g. (θ1, θ2) in a two-pair case in two dimensions; (θ1, θ2, θ3) in a three-pair case in three dimensions, etc. A key problem is how to apply spatial filtering to such a combination of paired 1-D DOA estimates.
We may apply a beamformer/null beamformer (BFNF) as shown in
As the approach shown in
where lp indicates the distance between the microphones of pair p (reciprocally, between a pair of loudspeakers), w indicates the frequency bin number, and fs indicates the sampling frequency.
A method as described herein (e.g., method M300) may be combined with automatic speech recognition (ASR) for system control. The method may be configured, for example, to use an embedded speech recognition engine to create a privacy zone whenever an activation code is uttered (e.g., a particular phrase, such as “Qualcomm voice”). Such a method may also be configured to recognize words spoken after the activation code as command and/or payload parameters. Examples of such parameters include a command to initiate a telephone call to a particular person (e.g., “call Mom”).
Audio output stage 800 may be configured to mix the multichannel source and masking signals to produce a plurality of driving signals SD10-1 to SD10-N (e.g., as described herein with reference to tasks T800 and T810). Audio output stage 800 may be implemented to perform such mixing in the digital domain or in the analog domain. For example, audio output stage 800 may be configured to produce a driving signal for each loudspeaker channel by converting digital source and masking signals to analog, or by converting a digital mixed signal to analog. Audio output stage 800 may also be configured to amplify, apply a gain to, and/or control a gain of the source signal; to filter the source and/or masking signals; to provide impedance matching to the loudspeakers of the array; and/or to perform any other desired audio processing operation.
Each of the microphones for direction estimation as discussed herein (e.g., with reference to location and tracking of one or more users) may have a response that is omnidirectional, bidirectional, or unidirectional (e.g., cardioid). The various types of microphones that may be used include (without limitation) piezoelectric microphones, dynamic microphones, and electret microphones. It is expressly noted that the microphones may be implemented more generally as transducers sensitive to radiations or emissions other than sound. In one such example, the microphone array is implemented to include one or more ultrasonic transducers (e.g., transducers sensitive to acoustic frequencies greater than fifteen, twenty, twenty-five, thirty, forty, or fifty kilohertz or more).
Each of apparatus A100, A102, A105, A150, A200, A300, A302, A330, MF100, MF102, MF150, MF200, MF300, MF302, and MF330 may be implemented as a combination of hardware (e.g., a processor) with software and/or with firmware. Such apparatus may also include an audio preprocessing stage AP10 as shown in
It may be desirable for audio preprocessing stage AP10 to produce each microphone signal as a digital signal, that is to say, as a sequence of samples. Audio preprocessing stage AP20, for example, includes analog-to-digital converters (ADCs) C10a, C10b, and C10c that are each arranged to sample the corresponding analog signal. Typical sampling rates for acoustic applications include 8 kHz, 12 kHz, 16 kHz, and other frequencies in the range of from about 8 to about 16 kHz, although sampling rates as high as about 44.1, 48, or 192 kHz may also be used. Typically, converters C10a, C10b, and C10c will be configured to sample each signal at the same rate.
In this example, audio preprocessing stage AP20 also includes digital preprocessing stages P20a, P20b, and P20c that are each configured to perform one or more preprocessing operations (e.g., spectral shaping) on the corresponding digitized channel to produce a corresponding one of a left microphone signal AL10, a center microphone signal AC10, and a right microphone signal AR10 for input to task T900 or direction estimator 900. Typically, stages P20a, P20b, and P20c will be configured to perform the same functions on each signal. It is also noted that preprocessing stage AP10 may be configured to produce a different version of a signal from at least one of the microphones (e.g., at a different sampling rate and/or with different spectral shaping) for content use, such as to provide a near-end speech signal in a voice communication (e.g., a telephone call). Although
Loudspeaker array LA100 may include cone-type and/or rectangular loudspeakers. The spacings between adjacent loudspeakers may be uniform or nonuniform, and the array may be linear or nonlinear. As noted above, techniques for generating the multichannel signals for driving the array may include pairwise BFNF and MVDR.
When beamforming techniques are used to produce spatial patterns for broadband signals, selection of the transducer array geometry involves a trade-off between low and high frequencies. To enhance the direct handling of low frequencies by the beamformer, a larger loudspeaker spacing is preferred. At the same time, if the spacing between loudspeakers is too large, the ability of the array to reproduce the desired effects at high frequencies will be limited by a lower aliasing threshold. To avoid spatial aliasing, the wavelength of the highest frequency component to be reproduced by the array should be greater than twice the distance between adjacent loudspeakers.
As consumer devices become smaller and smaller, the form factor may constrain the placement of loudspeaker arrays. For example, it may be desirable for a laptop, netbook, or tablet computer or a high-definition video display to have a built-in loudspeaker array. Due to the size constraints, the loudspeakers may be small and unable to reproduce a desired bass region. Alternatively, the loudspeakers may be large enough to reproduce the bass region but spaced too closely to support beamforming or other acoustic imaging. Thus it may be desirable to provide the processing to produce a bass signal in a closely spaced loudspeaker array in which beamforming is employed.
It is expressly noted that the principles described herein are not limited to use with a uniform linear array of loudspeakers (e.g., as shown in
In the example of
Although particular examples of directional masking in a range of 180 degrees are shown, the principles described herein may be extended to provide directional masking across any desired angular range in a plane (e.g., a two-dimensional range). Such extension may include the addition of appropriately placed loudspeakers to the array. For example,
Such principles may also be extended to provide directional masking across any desired angular range in space (e.g., in three dimensions).
A psychoacoustic phenomenon exists that listening to higher harmonics of a signal may create a perceptual illusion of hearing the missing fundamentals. Thus, one way to achieve a sensation of bass components from small loudspeakers is to generate higher harmonics from the bass components and play back the harmonics instead of the actual bass components. Descriptions of algorithms for substituting higher harmonics to achieve a psychoacoustic sensation of bass without an actual low-frequency signal presence (also called “psychoacoustic bass enhancement” or PBE) may be found, for example, in U.S. Pat. No. 5,930,373 (Shashoua et al., issued Jul. 27, 1999) and U.S. Publ. Pat. Appls. Nos. 2006/0159283 A1 (Mathew et al., published Jul. 20, 2006), 2009/0147963 A1 (Smith, published Jun. 11, 2009), and 2010/0158272 A1 (Vickers, published Jun. 24, 2010). Such enhancement may be particularly useful for reproducing low-frequency sounds with devices that have form factors which restrict the integrated loudspeaker or loudspeakers to be physically small. For example, task T800 may be implemented to perform PBE to produce the driving signals that drive the array of loudspeakers to produce the combined sound field.
It may be desirable to apply PBE not only to reduce the effect of low-frequency reproducibility limits, but also to reduce the effect of directivity loss at low frequencies. For example, it may be desirable to combine PBE with spatially directive filtering (e.g., beamforming) to create the perception of low-frequency content in a range that is steerable by a beamformer. In one example, any of the implementations of task T500 as described herein is modified to perform PBE on the source signal and to produce the multichannel source signal from the PBE-processed source signal. In the same example or in an alternative example, any of the implementations of task T700 as described herein is modified to perform PBE on the masking signal and to produce the multichannel masking signal from the PBE-processed masking signal.
The use of a loudspeaker array to produce directional beams from an enhanced signal results in an output that has a much lower perceived frequency range than an output from the audio signal without such enhancement. Additionally, it becomes possible to use a more relaxed beamformer design to steer the enhanced signal, which may support a reduction of artifacts and/or computational complexity and allow more efficient steering of bass components with arrays of small loudspeakers. At the same time, such a system can protect small loudspeakers from damage by low-frequency signals (e.g., rumble). Additional description of such enhancement techniques, which may be combined with directional masking as described herein, may be found in, e.g., U.S. patent application Ser. No. 13/190,464, entitled “SYSTEMS, METHODS, AND APPARATUS FOR ENHANCED ACOUSTIC IMAGING” (filed Jul. 25, 2011).
The methods and apparatus disclosed herein may be applied generally in any transceiving and/or audio sensing application, including mobile or otherwise portable instances of such applications and/or sensing of signal components from far-field sources. For example, the range of configurations disclosed herein includes communications devices that reside in a wireless telephony communication system configured to employ a code-division multiple-access (CDMA) over-the-air interface. Nevertheless, it would be understood by those skilled in the art that a method and apparatus having features as described herein may reside in any of the various communication systems employing a wide range of technologies known to those of skill in the art, such as systems employing Voice over IP (VoIP) over wired and/or wireless (e.g., CDMA, TDMA, FDMA, and/or TD-SCDMA) transmission channels.
It is expressly contemplated and hereby disclosed that communications devices disclosed herein may be adapted for use in networks that are packet-switched (for example, wired and/or wireless networks arranged to carry audio transmissions according to protocols such as VoIP) and/or circuit-switched. It is also expressly contemplated and hereby disclosed that communications devices disclosed herein may be adapted for use in narrowband coding systems (e.g., systems that encode an audio frequency range of about four or five kilohertz) and/or for use in wideband coding systems (e.g., systems that encode audio frequencies greater than five kilohertz), including whole-band wideband coding systems and split-band wideband coding systems.
The foregoing presentation of the described configurations is provided to enable any person skilled in the art to make or use the methods and other structures disclosed herein. The flowcharts, block diagrams, and other structures shown and described herein are examples only, and other variants of these structures are also within the scope of the disclosure. Various modifications to these configurations are possible, and the generic principles presented herein may be applied to other configurations as well. Thus, the present disclosure is not intended to be limited to the configurations shown above but rather is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein, including in the attached claims as filed, which form a part of the original disclosure.
Those of skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, and symbols that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Important design requirements for implementation of a configuration as disclosed herein may include minimizing processing delay and/or computational complexity (typically measured in millions of instructions per second or MIPS), especially for computation-intensive applications, such as playback of compressed audio or audiovisual information (e.g., a file or stream encoded according to a compression format, such as one of the examples identified herein) or applications for wideband communications (e.g., voice communications at sampling rates higher than eight kilohertz, such as 12, 16, 32, 44.1, 48, or 192 kHz).
Goals of a multi-microphone processing system may include achieving ten to twelve dB in overall noise reduction, preserving voice level and color during movement of a desired speaker, obtaining a perception that the noise has been moved into the background instead of an aggressive noise removal, dereverberation of speech, and/or enabling the option of post-processing for more aggressive noise reduction.
An apparatus as disclosed herein (e.g., any among apparatus A100, A102, A105, A150, A200, A300, A302, A330, MF100, MF102, MF150, MF200, MF300, MF302, and MF330) may be implemented in any combination of hardware with software, and/or with firmware, that is deemed suitable for the intended application. For example, the elements of such an apparatus may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of the elements of the apparatus may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips).
One or more elements of the various implementations of the apparatus disclosed herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field-programmable gate arrays), ASSPs (application-specific standard products), and ASICs (application-specific integrated circuits). Any of the various elements of an implementation of an apparatus as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions, also called “processors”), and any two or more, or even all, of these elements may be implemented within the same such computer or computers.
A processor or other means for processing as disclosed herein may be fabricated as one or more electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips). Examples of such arrays include fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, DSPs, FPGAs, ASSPs, and ASICs. A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors. It is possible for a processor as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to a directional sound masking procedure as described herein, such as a task relating to another operation of a device or system in which the processor is embedded (e.g., an audio sensing device). It is also possible for part of a method as disclosed herein to be performed by a processor of the audio sensing device and for another part of the method to be performed under the control of one or more other processors.
Those of skill will appreciate that the various illustrative modules, logical blocks, circuits, and tests and other operations described in connection with the configurations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Such modules, logical blocks, circuits, and operations may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an ASIC or ASSP, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to produce the configuration as disclosed herein. For example, such a configuration may be implemented at least in part as a hard-wired circuit, as a circuit configuration fabricated into an application-specific integrated circuit, or as a firmware program loaded into non-volatile storage or a software program loaded from or into a data storage medium as machine-readable code, such code being instructions executable by an array of logic elements such as a general purpose processor or other digital signal processing unit. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. A software module may reside in a non-transitory storage medium such as RAM (random-access memory), ROM (read-only memory), nonvolatile RAM (NVRAM) such as flash RAM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), registers, hard disk, a removable disk, or a CD-ROM; or in any other form of storage medium known in the art. An illustrative storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
It is noted that the various methods disclosed herein (e.g., any among methods M100, M102, M150, M200, M300, M310, M320, M330, and other methods disclosed by way of description of the operation of the various apparatus described herein) may be performed by an array of logic elements such as a processor, and that the various elements of an apparatus as described herein may be implemented as modules designed to execute on such an array. As used herein, the term “module” or “sub-module” can refer to any method, apparatus, device, unit or computer-readable data storage medium that includes computer instructions (e.g., logical expressions) in software, hardware or firmware form. It is to be understood that multiple modules or systems can be combined into one module or system and one module or system can be separated into multiple modules or systems to perform the same functions. When implemented in software or other computer-executable instructions, the elements of a process are essentially the code segments to perform the related tasks, such as with routines, programs, objects, components, data structures, and the like. The term “software” should be understood to include source code, assembly language code, machine code, binary code, firmware, macrocode, microcode, any one or more sets or sequences of instructions executable by an array of logic elements, and any combination of such examples. The program or code segments can be stored in a processor-readable storage medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication link.
The implementations of methods, schemes, and techniques disclosed herein may also be tangibly embodied (for example, in tangible, computer-readable features of one or more computer-readable storage media as listed herein) as one or more sets of instructions readable and/or executable by a machine including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The term “computer-readable medium” may include any medium that can store or transfer information, including volatile, nonvolatile, removable and non-removable media. Examples of a computer-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette or other magnetic storage, a CD-ROM/DVD or other optical storage, a hard disk, a fiber optic medium, a radio frequency (RF) link, or any other medium which can be used to store the desired information and which can be accessed. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet or an intranet. In any case, the scope of the present disclosure should not be construed as limited by such embodiments.
Each of the tasks of the methods described herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions), embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc.), that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to receive and/or transmit encoded frames.
s expressly disclosed that the various methods disclosed herein may be performed by a portable communications device such as a handset, headset, or portable digital assistant (PDA), and that the various apparatus described herein may be included within such a device. A typical real-time (e.g., online) application is a telephone conversation conducted using such a mobile device.
In one or more exemplary embodiments, the operations described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, such operations may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term “computer-readable media” includes both computer-readable storage media and communication (e.g., transmission) media. By way of example, and not limitation, computer-readable storage media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM), or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage; and/or magnetic disk storage or other magnetic storage devices. Such storage media may store information in the form of instructions or data structures that can be accessed by a computer. Communication media can comprise any medium that can be used to carry desired program code in the form of instructions or data structures and that can be accessed by a computer, including any medium that facilitates transfer of a computer program from one place to another. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association, Universal City, Calif.), where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
An acoustic signal processing apparatus as described herein (e.g., any among apparatus A100, A102, A105, A150, A200, A300, A302, A330, MF100, MF102, MF150, MF200, MF300, MF302, and MF330) may be incorporated into an electronic device that accepts speech input in order to control certain operations, or may otherwise benefit from separation of desired noises from background noises, such as communications devices. Many applications may benefit from enhancing or separating clear desired sound from background sounds originating from multiple directions. Such applications may include human-machine interfaces in electronic or computing devices which incorporate capabilities such as voice recognition and detection, speech enhancement and separation, voice-activated control, and the like. It may be desirable to implement such an acoustic signal processing apparatus to be suitable in devices that only provide limited processing capabilities.
The elements of the various implementations of the modules, elements, and devices described herein may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or gates. One or more elements of the various implementations of the apparatus described herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs, ASSPs, and ASICs.
It is possible for one or more elements of an implementation of an apparatus as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to an operation of the apparatus, such as a task relating to another operation of a device or system in which the apparatus is embedded. It is also possible for one or more elements of an implementation of such an apparatus to have structure in common (e.g., a processor used to execute portions of code corresponding to different elements at different times, a set of instructions executed to perform tasks corresponding to different elements at different times, or an arrangement of electronic and/or optical devices performing operations for different elements at different times).
The present Application for Patent claims priority to Provisional Application No. 61/666,196, entitled “SYSTEMS, METHODS, APPARATUS, AND COMPUTER-READABLE MEDIA FOR GENERATING CORRELATED MASKING SIGNAL,” filed Jun. 29, 2012, and assigned to the assignee hereof.
Number | Date | Country | |
---|---|---|---|
61666196 | Jun 2012 | US |