1. Field
This disclosure is related to audio signal processing.
2. Background
An existing approach to audio masking applies the fundamental concept that a tone can mask other tones that are at nearby frequencies and are below a certain relative level. With a high enough level, a white noise signal may be used to mask speech, and such a sound masking design may be used to support secure conversations in offices.
Other approaches to restricting the area within which a sound may be heard include ultrasonic loudspeakers, which require different fundamental hardware designs; headphones, which provide no freedom if the user desires ventilation at his or her head, and general sound maskers as may be used in a national security office, which typically involve large-scale fixed construction.
A method of signal processing according to a general configuration includes determining a frequency profile of a source signal. This method also includes, based on said frequency profile of the source signal, producing a masking signal according to a masking frequency profile, wherein the masking frequency profile is different than the frequency profile of the source signal. This method also includes producing a sound field comprising (A) a source component that is based on the source signal and (B) a masking component that is based on the masking signal. Computer-readable storage media (e.g., non-transitory media) having tangible features that cause a machine reading the features to perform such a method are also disclosed.
An apparatus for signal processing according to a general configuration includes means for determining a frequency profile of a source signal. This apparatus also includes means for producing a masking signal, based on said frequency profile of the source signal, according to a masking frequency profile, wherein the masking frequency profile is different than the frequency profile of the source signal. This apparatus also includes means for producing the sound field comprising (A) a source component that is based on the source signal and (B) a masking component that is based on the masking signal.
An apparatus for signal processing according to another general configuration includes a signal analyzer configured to determine a frequency profile of a source signal. This apparatus also includes a signal generator configured to produce a masking signal, based on said frequency profile of the source signal, according to a masking frequency profile, wherein the masking frequency profile is different than the frequency profile of the source signal. This apparatus also includes an audio output stage configured to drive an array of loudspeakers to produce the sound field, wherein the sound field comprises (A) a source component that is based on the source signal and (B) a masking component that is based on the masking signal.
In monophonic signal masking, a single-channel masking signal drives a loudspeaker to produce the masking field. Descriptions of such masking may be found, for example, in U.S. patent application Ser. No. 13/155,187, filed Jun. 7, 2011, entitled “GENERATING A MASKING SIGNAL ON AN ELECTRONIC DEVICE.” When the intensity of such a masking field is high enough to effectively interfere with a potential eavesdropper, the masking field may also be distracting to the user and/or may be unnecessarily loud to bystanders.
When more than one loudspeaker is available to produce the masking field, the spatial pattern of the emitted sound can be designed and controlled. A loudspeaker array may be used to steer beams with different characteristics in various directions of emission and/or to create a personal surround-sound bubble. By combining different audio contents that are beamed in different directions, we can create a private listening zone, in which the communication channel beam is targeted towards the user, and target noise or masking beams to other directions to mask and obscure the communication channel.
While such a method may be used to preserve the user's privacy, the masking signals are usually unwanted sound pollution with respect to bystanders in the surrounding environment. Masking principles may be applied as disclosed herein to generate a masker having the most efficient and minimum level needed, according to spatial location and source signal contents. Such principles may be used to implement an automatically controlled system that uses information about the spatial environment to generate masking signals with a reduced level of sound pollution to the environment.
Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, estimating, and/or selecting from a plurality of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Unless expressly limited by its context, the term “selecting” is used to indicate any of its ordinary meanings, such as identifying, indicating, applying, and/or using at least one, and fewer than all, of a set of two or more. Unless expressly limited by its context, the term “determining” is used to indicate any of its ordinary meanings, such as deciding, establishing, concluding, calculating, selecting, and/or evaluating. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “derived from” (e.g., “B is a precursor of A”), (ii) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (iii) “equal to” (e.g., “A is equal to B”). Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least.”
References to a “location” of a microphone of a multi-microphone audio sensing device indicate the location of the center of an acoustically sensitive face of the microphone, unless otherwise indicated by the context. The term “channel” is used at times to indicate a signal path and at other times to indicate a signal carried by such a path, according to the particular context. Unless otherwise indicated, the term “series” is used to indicate a sequence of two or more items. The term “logarithm” is used to indicate the base-ten logarithm, although extensions of such an operation to other bases are within the scope of this disclosure. The term “frequency component” is used to indicate one among a set of frequencies or frequency bands of a signal, such as a sample of a frequency domain representation of the signal (e.g., as produced by a fast Fourier transform) or a subband of the signal (e.g., a Bark scale or mel scale subband).
Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa). The term “configuration” may be used in reference to a method, apparatus, and/or system as indicated by its particular context. The terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. A “task” having multiple subtasks is also a method. The terms “apparatus” and “device” are also used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” are typically used to indicate a portion of a greater configuration. Unless expressly limited by its context, the term “system” is used herein to indicate any of its ordinary meanings, including “a group of elements that interact to serve a common purpose.” The term “plurality” means “two or more.” Any incorporation by reference of a portion of a document shall also be understood to incorporate definitions of terms or variables that are referenced within the portion, where such definitions appear elsewhere in the document, as well as any figures referenced in the incorporated portion.
It may be assumed that in the near-field and far-field regions of an emitted sound field, the wavefronts are spherical and planar, respectively. The near-field may be defined as that region of space which is less than one wavelength away from a sound emitter (e.g., a loudspeaker array). Under this definition, the distance to the boundary of the region varies inversely with frequency. At frequencies of two hundred, seven hundred, and two thousand hertz, for example, the distance to a one-wavelength boundary is about 170, forty-nine, and seventeen centimeters, respectively. It may be useful instead to consider the near-field/far-field boundary to be at a particular distance from the sound emitter (e.g., fifty centimeters from a loudspeaker of the array or from the centroid of the array, or one meter or 1.5 meters from a loudspeaker of the array or from the centroid of the array). Unless otherwise indicated by the particular context, a far-field approximation is assumed herein.
A problem may arise when the loudspeaker array is used in a public area, where people in the dark zone may not be eavesdroppers, but rather normal bystanders who do not wish to experience unwanted sound pollution. It may be desirable to provide a system that can achieve good privacy protection for the user and minimal sound pollution to the public at the same time.
The effectiveness of an audio masking signal may be dependent on factors such as signal intensity, frequency, and/or content as well as psychoacoustic factors. A critical masking condition is typically a function of several (and possibly all) of these factors. For simplicity in explanation,
As noted above, it may be desirable to operate an apparatus to create a privacy zone using spatial patterns of components of a sound field. Such an apparatus may be implemented to include systems for design and control of a masking component of a combined sound field. Design procedures for such a masker are described herein, as well as combinations of reciprocal beam-and-nullforming and masker design for an interactive in-situ privacy zone. Extensions to multiple-user cases are also disclosed. Such principles may be applied to obtain a new system design that advances data fusion capabilities, provides better performance than a single-loudspeaker version of a masking system, and/or takes into consideration both signal contents and spatial response.
Directed source components may be combined with masker design for interactive in-situ privacy zone creation. If only one privacy zone is needed (e.g., for a single-user case), then method M100 may be configured to combine beamforming of the source signal with a spatial masker. If more than one privacy zone is desired (e.g., for a multiple-user case), then method M100 may be configured to combine beamforming and nullforming of each source signal with a spatial masker.
It is typical for each channel of the multichannel source signal to be associated with a corresponding particular loudspeaker of the array. Likewise, it is typical for each channel of the masking signal to be associated with a corresponding particular loudspeaker of the array.
It may be desirable to implement method M100 to produce the source component by inducing constructive interference in a desired direction of the produced sound field (e.g., in the first direction) while inducing destructive interference in other directions of the produced sound field (e.g., in the second direction). Such a technique may include implementing task T100 to produce the multichannel source signal by steering a beam in a desired source direction while creating a null (implicitly or explicitly) in another direction. A beam is defined as a concentration of energy along a particular direction relative to the emitter (e.g., the loudspeaker array), and a null is defined as a valley, along a particular direction relative to the emitter, in a spatial distribution of energy.
Task T100 may be implemented, for example, to produce the multichannel source signal by applying a spatially directive filter (the “source spatially directive filter”) to the source signal. By appropriately weighting and/or delaying the source signal to generate each channel of the multichannel source signal, such an implementation of task T100 may be used to obtain a desired spatial distribution of the source component within the produced sound field.
Task T100 may be implemented according to a phased-array technique such that each channel of the multichannel source signal has a respective phase (i.e., time) delay. One example of such a technique is a delay-sum beamforming (DSB) filter. Task T100 may be implemented to perform a DSB filtering operation to direct the source component in a desired source direction by applying a respective time delay to the source signal to produce each channel of signal MCS10. For a case in which task T300 drives a uniformly spaced linear loudspeaker array, for example, task T110 may be implemented to perform a DSB filtering operation in the frequency domain by calculating the coefficients of channels w1 to wN of the source spatially directive filter according to the following expression:
for 1≦n≦N, where d is the spacing between the centers of the radiating surfaces of adjacent loudspeakers in the array, N is the number of loudspeakers to be driven (which may be less than or equal to the number of loudspeakers in the array), f is a frequency bin index, c is the velocity of sound, and φs is the desired angle of the beam relative to the axis of the array (e.g., the desired source direction, or the desired direction of the main lobe of the source component). Equivalent time-domain implementations of channels w1 to wN may be implemented as corresponding delays. In either domain, task T100 may also include normalization of signal MCS10 by scaling each channel of signal MCS10 by a factor of 1/N (or, equivalently, scaling source signal SS10 by 1/N).
For a frequency f1 at which the spacing d is equal to half of the wavelength λ (where λ=c/f1), expression (1) reduces to the following expression:
w
n(f1)=exp(−jπ(n−1)cos φs). (2)
It is noted that the filter beam patterns shown in
It is also possible to implement method M100 to include multiple instances of task T100 such that subarrays of array LA100 are driven differently for different frequency ranges. Such an implementation may provide better directivity for wideband reproduction. In one example, a second instance of task T102 is implemented to produce an N/2-channel multichannel signal (e.g., using alternate ones of the filters w1 to wN) from a frequency band of the source signal that is limited to a maximum frequency of c/4d, and this multichannel signal is used to drive alternate loudspeakers of the array (i.e., a subarray that has an effective spacing of 2d).
It may be desirable to implement task T100 to apply different respective weights to channels of the multichannel source signal. For example, it may be desirable to implement task T100 to apply a spatial windowing function to the filter coefficients. Examples of such a windowing function include, without limitation, triangular and raised cosine (e.g., Hann or Hamming) windows. Use of a spatial windowing function tends to reduce both sidelobe magnitude and angular resolution (e.g., by widening the mainlobe).
In one example, task T100 is implemented such that the coefficients of each channel wn of the source spatially directive filter include a respective factor sn of a spatial windowing function. In such case, expressions (1) and (2) may be modified to the following expressions, respectively:
An array having more loudspeakers allows for more degrees of freedom and may typically be used to obtain a narrower mainlobe.
It may be desirable to implement task T100 and/or task T200 to apply a superdirective beamformer, which maximizes gain in a desired direction while minimizing the average gain over all other directions. Examples of superdirective beamformers include the minimum variance distortionless response (MVDR) beamformer (cross-covariance matrix), and the linearly constrained minimum variance (LCMV) beamformer. Other fixed or adaptive beamforming techniques, such as generalized sidelobe canceller (GSC) techniques, may also be used.
The design goal of an MVDR beamformer is to minimize the output signal power with the constraint minw WHΦXXW subject to WHd=1, where W denotes the filter coefficient matrix, ΦXX denotes the normalized cross-power spectral density matrix of the loudspeaker signals, and d denotes the steering vector. Such a beam design may be expressed as
where dT is a farfield model for linear arrays that may be expressed as
d
T=[1,exp(−jΩfsc−1 cos(θ0)),exp(−jΩfsc−12l cos(θ0)), . . . ,exp(−jΩfsc−1(N−1)cos(θ0))],
and Γv
In these equations, μ denotes a regularization parameter (e.g., a stability factor), θ0 denotes the beam direction, fs denotes the sampling rate, Ω denotes angular frequency of the signal, c denotes the speed of sound, l denotes the distance between the centers of the radiating surfaces of adjacent loudspeakers, lnm denotes the distance between the centers of the radiating surfaces of loudspeakers n and m, ΦVV denotes the normalized cross-power spectral density matrix of the noise, and σ2 denotes transducer noise power.
Task T200 may be implemented to drive a linear loudspeaker array with uniform spacing, a linear loudspeaker array with nonuniform spacing, or a nonlinear (e.g., shaped) array, such as an array having more than one axis. In one example, task T200 is implemented to drive an array having more than one axis by using a pairwise beamforming-nullforming (BFNF) configuration as described herein with reference to a microphone array. Such an application may include a loudspeaker that is shared among two or more of the axes. Task T200 may also be performed using other directional field generation principles, such as a wave field synthesis (WFS) technique based on, e.g., the Huygens principle of wavefront propagation.
Task T300 drives the loudspeaker array, in response to the multichannel source and masking signals, to produce the sound field. Typically the produced sound field is a superposition of a source component based on the multichannel source signal and a masking component based on the masking signal. In such case, task T300 may be implemented to produce the source component of the sound field by driving the array in response to the multichannel source signal to create a corresponding beam of acoustic energy that is concentrated in the direction of the user and to create a valley in the beam response at other locations.
Task T300 may be configured to amplify, apply a gain to, and/or control a gain of the multichannel source signal, and/or to filter the multichannel source and/or masking signals. As shown in
Additionally or in the alternative to mixing corresponding channels of the multichannel source and masking signals, task T300 may be implemented to drive different loudspeakers of the array to produce the source and masking components of the field. For example, task T300 may be implemented to drive a first plurality (i.e., at least two) of the loudspeakers of the array to produce the source component and to drive a second plurality (i.e., at least two) of the loudspeakers of the array to produce the masking component, where the first and second pluralities may be separate, overlapping, or the same.
Task T300 may also be implemented to perform one or more other audio processing operations on the mixed channels to produce the driving signals. Such operations may include amplifying and/or filtering one or more (possibly all) of the mixed channels. For example, it may be desirable to implement task T300 to apply an inverse filter to compensate for differences in the array response at different frequencies and/or to implement task T300 to compensate for differences between the responses of the various loudspeakers of the array. Alternatively or additionally, it may be desirable to implement task T300 to provide impedance matching to the loudspeakers of the array (and/or to an audio-frequency transmission path that leads to the loudspeaker array).
Task T100 may be implemented to produce the multichannel source signal according to a desired direction. As described above, for example, task T100 may be implemented to produce the multichannel source signal such that the resulting source component is oriented in a desired source direction. Examples of such source direction control include, without limitation, the following:
In a first example, task T100 is implemented such that the source component is oriented in a fixed direction (e.g., center zone). For example, task T110 may be implemented such that the coefficients of channels w1 to wN of the source spatially directive filter are calculated offline (e.g., during design and/or manufacture) and applied to the source signal at run-time. Such a configuration may be suitable for applications such as media viewing, web surfing, and browse-talk (i.e., web surfing while on a telephone call). Typical use scenarios include on an airplane, in a transportation hub (e.g., an airport or rail station), and at a coffee shop or café. Such an implementation of task T100 may be configured to allow selection (e.g., automatically according to a detected use mode, or by the user) among different source beam widths to balance privacy (which may be important for a telephone call) against sound pollution generation (which may be a problem for media viewing in close public areas).
In a second example, task T100 is implemented such that the source component is oriented in a direction that is selected by the user from among two or more fixed options. For example, task T100 may be implemented such that the source component is oriented in a direction that corresponds to the user's selection from among a left zone, a center zone, and a right zone. In such case, task T110 may be implemented such that, for each direction to be selected, a corresponding set of coefficients for the channels w1 to wN of the source spatially directive filter is calculated offline (e.g., during design and/or manufacture) for selection and application to the source signal at run-time. One example of corresponding respective directions for the left, center, and right zones (or sectors) in such a case is (45, 90, 135) degrees. Other examples include, without limitation, (30, 90, 150) and (60, 90, 120) degrees.
In a third example, task T100 is implemented such that the source component is oriented in a direction that is automatically selected from among two or more fixed options according to an estimated user position. For example, task T100 may be implemented such that the source component is oriented in a direction that corresponds to the user's estimated position from among a left zone, a center zone, and a right zone. In such case, task T110 may be implemented such that, for each direction to be selected, a corresponding set of coefficients for the channels w1 to wN of the source spatially directive filter is calculated offline (e.g., during design and/or manufacture) for selection and application to the source signal at run-time. One example of corresponding respective directions for the left, center, and right zones in such a case is (45, 90, 135) degrees. Other examples include, without limitation, (30, 90, 150) and (60, 90, 120) degrees. It is also possible for such an implementation of task T100 to select among different source beam widths for the selected direction according to an estimated user range. For example, a more narrow beam may be selected when the user is more distant from the array (e.g., to obtain a similar beam width at the user's position at different ranges).
In a fourth example, task T100 is implemented such that the source component is oriented in a direction that may vary over time in response to changes in an estimated direction of the user. In such case, task T110 may be implemented to calculate the coefficients of the channels w1 to wN of the source spatially directive filter at run-time such that the orientation angle of the filter (i.e., angle φs) corresponds to the estimated direction of the user. Such an implementation of task T110 may be configured to perform an adaptive beamforming operation.
In a fifth example, task T100 is implemented such that the source component is oriented in a direction that is initially selected from among two or more fixed options according to an estimated user position (e.g., as in the third example above) and then adapted over time according to changes in the estimated user position (e.g., changes in direction and/or distance). In such case, task T110 may also be implemented to switch to (and then adapt) another of the fixed options in response to a determination that the current estimated direction of the user is within a zone corresponding to the new fixed option.
Task T200 may be implemented to generate the masking signal based on a noise signal, such as a white noise or pink noise signal. The noise signal may also be a signal whose frequency characteristics vary over time, such as a music signal, a street noise signal, or a babble noise signal. Babble noise is the sound of many speakers (actual or simulated) talking simultaneously such that their speech is not individually intelligible. In practice, use of low-level pink or white noise or another stationary noise signal, such as a constant stream or waterfall sound, may be less annoying to bystanders and/or less distracting to the user than babble noise.
In a further example, the noise signal is an ambient noise signal as detected from the current acoustic environment by one or more microphones of the device. In such case, it may be desirable to implement task T200 to perform echo cancellation and/or nonstationary noise cancellation on the ambient noise signal before using it to produce the masking signal.
Generation of the multichannel source signal by task T100 leads to a concentration of energy of the source component in a source direction relative to an axis of the array (e.g., in the direction of angle φs). As shown in
It may be desirable to implement task T200 to direct the masking component such that its intensity is higher in one direction than another. For example, task T200 may be implemented to produce the masking signal such that an intensity of the masking component is higher in the leakage direction than in the source direction. The source direction is typically the direction of a main lobe of the source component, and the leakage direction may be the direction of a sidelobe of the source component. A sidelobe is an energy concentration of the component that is not within the main lobe.
In one example, the leakage direction is determined as the direction of a sidelobe of the source component that is adjacent to the main lobe. In another example, the leakage direction is the direction of a sidelobe of the source component whose peak intensity is not less than (e.g., is greater than) the peak intensities of all other sidelobes of the source component.
In a further alternative, the leakage direction may be based on directions of two or more sidelobes of the source component. For example, these sidelobes may be the highest sidelobes of the source component, the sidelobes having estimated intensities not less than (alternatively, greater than) a threshold value, and/or the sidelobes that are closest in direction to the same side of the main lobe of the source component. In such case, the leakage direction may be calculated as an average direction of the sidelobes, such as a weighted average among two or more directions (e.g., each weighted by intensity of the corresponding sidelobe).
Selection of the leakage direction may be performed during a design phase, based on a calculated response of the source spatially directive filter and/or from observation of a sound field produced using such a filter. Alternatively, task T200 may be implemented to select the leakage direction at run-time, similarly based on such a calculation and/or observation.
It may be desirable to implement task T200 to produce the masking component by inducing constructive interference in a desired direction of the produced sound field (e.g., in a leakage direction) while inducing destructive interference in other directions of the produced sound field (e.g., in the source direction). Such a technique may include implementing task T200 to produce the masking signal by steering a beam in a desired masking direction (i.e., in a leakage direction) while creating a null (implicitly or explicitly) in another direction.
Task T200 may be implemented, for example, to produce the masking signal by applying a second spatially directive filter (the “masking spatially directive filter”) to the noise signal.
Task T200 may be implemented according to a phased-array technique such that each channel of the masking signal has a respective phase (i.e., time) delay. For example, task T200 may be implemented to perform a DSB filtering operation to direct the masking component in the leakage direction by applying a respective time delay to the noise signal to produce each channel of signal MCS20. For a case in which task T300 drives a uniformly spaced linear loudspeaker array, for example, task T210 may be implemented to perform a DSB filtering operation by calculating the coefficients of filters v1 to vN according to an expression such as expression (1) or (3a) above, where the angle φs is replaced by the desired angle φm of the beam relative to the axis of the array (e.g., the leakage direction).
To avoid spatial aliasing, it may be desirable to limit the maximum frequency of the noise signal to c/2d. It is also possible to implement method M100 to include multiple instances of task T200 such that subarrays of array LA100 are driven differently for different frequency ranges.
The masking component may include more than one subcomponent. For example, the masking spatially directive filter may be configured such that the masking component includes a first masking subcomponent whose energy is concentrated in a beam on one side of the main lobe of source component, and a second masking subcomponent whose energy is concentrated in a beam on the other side of the main lobe of the source component. The masking component typically has a null in the source direction.
Examples of masking direction control that may be performed by respective implementations of task T200 include, without limitation, the following:
1) For a case in which the direction of the source component is fixed (e.g., determined during a design phase), it may be desirable also to fix (i.e., to precalculate) the masking direction.
2) For cases in which the direction of the source component is selected (e.g., by the user or automatically) from among several fixed options, it may be desirable for each of such fixed options to also indicate a corresponding masking direction. It may also be desirable to allow for multiple masking options for a single source direction (to allow selection among different respective masking component patterns, for example, for a case in which source beam width is selectable).
3) For a case in which the source component is adapted according to a direction that may vary over time, it may be desirable to select a corresponding masking direction from among several preset options and/or to adapt the masking direction according to the changes in the source direction.
It may be desirable to design the masking spatially directive filter to have a response that is similar to the response of the source spatially selective filter in one or more leakage directions and has a null in the source direction.
As illustrated in
As noted above, task T200 may be implemented (e.g., as task T210) to produce the masking signal by applying a masking spatially directive filter to a noise signal. In such case, it may be desirable to modify the noise signal to achieve a desired masking effect.
The intensity of the source component in a particular direction is dependent on the response of the source spatially directive filter with respect to that direction. The intensity of the source component is also determined by the level of the source signal, which may be expected to change over time.
The estimated intensity of the source component in a given direction φ may be based on an estimated response of the source spatially directive filter in that direction, which is typically expressed relative to an estimated peak response of the filter (e.g., the estimated response of the filter in the source direction). Task TA200 may be implemented to apply a gain factor value to the noise signal that is based on a local maximum of an estimated response of the source spatially directive filter in a direction other than the source direction (e.g., in the leakage direction). For example, task TA200 may be implemented to apply a gain factor value that is based on the maximum sidelobe peak intensity of the filter response. In another example, the value of the gain factor is based on a maximum of the estimated filter response in a direction that is at least a minimum angular distance (e.g., ten or twenty degrees) from the source direction.
For a case in which a source spatially directive filter of task T100 comprises channels w1 to wN as in expression (1) above, the response Hφs(φ,f) of the filter, at angle φ and frequency f and relative to the response at source direction angle φs, may be estimated as a magnitude of a sum of the relative responses of the channels w1 to wN. Such an estimated response may be expressed in decibels as:
Similar application of the principle of this example to calculate an estimated response for a spatially directive filter that is otherwise expressed will be easily understood.
Such calculation of a filter response may be performed according to a desired resolution of angle φ and frequency f. Alternatively, it may be decided for some applications that calculation of the response at a single value of frequency f (e.g., frequency f1) is sufficient. Such calculation may also be performed for each of a plurality of source spatially selective filters, each oriented in a different corresponding source direction (e.g., for each of a set of fixed options as described above with reference to examples 1, 2, 3, and 5 of task T100), such that task TA100 selects the estimated response corresponding to the current source direction at run-time.
Calculating a filter response as defined by the values of its coefficients (e.g., as described above with reference to expression (5)) produces a theoretical result that may differ from the actual response of the device with respect to direction (and frequency) as observed in service. It may be expected that in-service masking performance may be improved by compensating for such difference. For example, the response of the source spatially directive filter with respect to direction (and frequency, if desired) may be estimated by measuring the intensity distribution of an actual sound field that is produced using a copy of the filter. Such direct measurement of the estimated intensity may also be expected to account for other effects that may be observed in service, such as a response of the loudspeaker array.
In this case, an instance of task T100 is performed on a second source signal (e.g., white or pink noise) to produce a second multichannel source signal, based on the source direction. The second multichannel source signal is used to drive a second array of loudspeakers to produce a second sound field that has a source component in the source direction (in this case, relative to an axis of the second array). The intensity of the second sound field is observed at each of a plurality of angles (and, if desired, at each of one or more frequency subbands), and the observed intensities are recorded to obtain an offline recording.
It may be desirable to minimize effects that may cause the second sound field to differ from the source component and thereby reduce the accuracy of the estimated response. For example, it may be desirable for loudspeaker array LA20 to be similar as possible to loudspeaker array LA10 (e.g., for each array to have the same number of the same type of loudspeakers, and for the positioning of the loudspeakers relative to one another to be the same in each array). Physical characteristics of the device (e.g., acoustic reflectance of the surfaces, resonances of the housing) may also affect the intensity distribution of the sound field, and it may be desirable to include the effects of such characteristics in the observed results as recorded. For example, it may also be desirable for array LA20 to be mounted and/or enclosed, during the measurement, in a housing that is as similar as possible to the housing in which array LA10 is to be mounted and/or enclosed during service. Similarly, it may be desirable for the electronics used to drive each array in response to the corresponding multichannel signal to be as similar as possible, or at least to have similar frequency responses.
Recording logic RL10 receives a signal produced by each microphone of array MA20 in response to the second sound field and calculates a corresponding intensity (e.g., as the energy over a frame or other interval of the captured signal). Recording logic RL10 may be implemented to calculate the intensity of the second source field with respect to direction (e.g., in decibels) relative to a level of the second source signal or, alternatively, relative to an intensity of the second sound field in the source direction. If desired, recording logic RL10 may also be implemented to calculate the intensity at each observation direction per frequency component or subband.
Such sound field production, measurement, and intensity calculation may be repeated for each of a plurality of source directions. For example, a corresponding instance of the measurement procedure may be performed for each of a set of fixed options as described above with reference to examples 1, 2, 3, and 5 of task T100. The calculated intensities are stored before run-time (e.g., during manufacture, during provisioning, and/or as part of a software or firmware update) as offline recording information OR10.
Calculation of a response of the source spatially directive filter may be based on an estimated response that is calculated from the filter coefficients as described above (e.g., with reference to expression (5)), on an estimated response from offline recording information OR10, on or a combination of both. In one example of such a combination, the estimated response is calculated as an average of corresponding values from the filter coefficients and from information OR10.
In another example of such a combination, the estimated response is calculated by adjusting an estimated response at angle φ, as calculated from the filter coefficients, according to one or more estimated responses from observations at nearby angles from information OR10. It may be desirable, for example, to collect and/or store offline recording information OR10 using a coarse angular resolution (e.g., five, ten, twenty, 22.5, thirty, or forty-five degrees) and to calculate the intensity from the filter coefficients using a finer angular resolution (e.g., one, five, or ten degrees). In such case, the estimated response may be calculated by compensating a response as calculated from the filter coefficients (e.g., as described above with reference to expression (5)) with a compensation factor that is based on information OR10. The compensation factor may be calculated, for example, from a difference between an observed response at a nearby angle, from information OR10, and a response as calculated from the filter coefficients for the nearby angle. In a similar manner, a compensation factor with respect to source direction and/or frequency may also be calculated from an observed response from information OR10 at a nearby source direction and/or a nearby frequency.
The response of the source spatially directive filter may be estimated and stored before run-time, such as during design and/or manufacture, to be accessed by task T220 (e.g., by task TA100) at run-time. Such precalculation may be appropriate for a case in which the source component is oriented in a fixed direction or in a selected one of a few (e.g., ten or fewer) fixed directions (e.g. as described above with reference to examples 1, 2, 3, and 5 of task T100). Alternatively, task T220 may be implemented to estimate the filter response at run-time.
The value of the gain factor may also be based on an estimated intensity of the source component in one or more other directions. For example, the gain factor value may be based on estimated filter responses at two or more source sidelobes (e.g., relative to the source main lobe level). In such case, the two or more sidelobes may be selected as the highest sidelobes, the sidelobes having estimated intensities not less than (alternatively, greater than) a threshold value, and/or the sidelobes that are closest in direction to the main lobe. The gain factor value (which may be precalculated, or calculated at run-time by task TA210) may be based on an average of the estimated responses at the two or more sidelobes.
Task T200 may be implemented to produce the masking signal based on a level of the source signal in the time domain.
It may be desirable to implement task T200 to vary the gain of the masking signal over time (e.g., to implement task TA210 to vary the gain of the noise signal over time), based on a level of the source signal over time. For example, it may be desirable to implement task T200 to control a gain of the noise signal based on a temporally smoothed level of the source signal. Such control may help to avoid annoying mimicking of speech sparsity (e.g., in a phone-call masking scenario). For applications in which a signal that indicates a voice activity state of the source signal is available, task T200 may be configured to maintain a high level of the masking signal for a hangover period (e.g., several frames) after the voice activity state changes from active to inactive.
It may be desirable to use a temporally sparse signal to mask a similarly sparse source signal, such as a far-end voice communications signal, and to use a temporally continuous signal to mask a less sparse source signal, such as a music signal. In such case, task T200 may be implemented to produce a masking signal that is active only when the source signal is active. Such implementations of task T200 may produce a masking signal whose energy changes over time in a manner similar to that of the source signal (e.g., a masking signal whose energy over time is proportional to that of the source signal).
As described above, the estimated intensity of the source component may be based on an estimated response of the source spatially directive filter in one or more directions. The estimated intensity of the source component may also be based on a level of the source signal. In such case, task TA210 may be implemented to calculate the gain factor value as a combination (e.g., as a product in the linear domain or as a sum in the decibel domain) of a value based on the estimated filter response, which may be precalculated, and a value based on the estimated source signal level. A corresponding implementation of task T220 may be configured, for example, to produce the masking signal by applying a gain factor to each frame of the noise signal, where the value of the gain factor is based on a level (e.g., an energy level) of a corresponding frame of the source signal. In one such case, the value of the gain factor is higher when the energy of the source signal within the frame is high and lower when the energy of the source signal within the frame is low.
If the source signal is sparse over time (e.g., as for a speech signal), a masking signal whose level strictly mimics the sparse behavior of the source speech signal over time may be distracting to nearby persons by emphasizing the speech sparsity. It may be desirable, therefore, to implement task T200 to produce the masking signal to have a more gradual attack and/or decay over time than the source signal. For example, task TA200 may be implemented to control the level of the masking signal based on a temporally smoothed level of the source signal and/or to perform a temporal smoothing operation on the gain factor of the masking signal.
In one example, such a temporal smoothing operation is implemented by using a first-order infinite-impulse-response filter (also called a leaky integrator) to apply a smoothing factor to a sequence in time of values of the gain factor (e.g., to the gain factor values for a consecutive sequence of frames). The value of the smoothing factor may be fixed. Alternatively, the smoothing factor may be adapted to provide less smoothing during onset of the source signal and/or more smoothing during offset of the source signal. For example, the smoothing factor value may be based on an activity state and/or an activity state transition of the source signal. Such smoothing may help to reduce the temporal sparsity of the combined sound field as experienced by a bystander.
Additionally or alternatively, task T200 may be implemented to produce the masking signal to have a similar onset as the source signal but a prolonged offset. For example, it may be desirable to implement task TA200 to apply a hangover period to the gain factor such that the gain factor value remains high for several frames after the source signal becomes inactive. Such a hangover may help to reduce the temporal sparsity of the combined sound field as experienced by a bystander and may also help to obscure the source component via a psychoacoustic effect called “backward masking” (or pre-masking). For applications in which a signal that indicates a voice activity state of the source signal is available, task T200 may be configured to maintain a high level of the masking signal for a hangover period (e.g., several frames) after the voice activity state changes from active to inactive. Additionally or alternatively, for a case in which it is acceptable to delay the source signal, task T200 may be implemented to generate the masking signal to have an earlier onset than the source signal to support a psychoacoustic effect called “forward masking” (or post-masking).
Instead of being configured to produce a masking signal whose energy is similar (e.g., proportional) over time to the energy of the source signal, task T200 may be implemented to produce the masking signal such that the combined sound field has a substantially constant level over time in the direction of the masking component. In one such example, task TA210 is configured to calculate the gain factor value such that the expected energy of the combined sound field in the direction of the masking component for each frame is based on a long-term energy level of the source signal (e.g., the energy of the source signal averaged over the most recent ten, twenty, or fifty frames).
Such an implementation of task TA210 may be configured to calculate a gain factor value for each frame of the masking signal based on both the energy of the corresponding frame of the source signal and the long-term energy level of the source signal. For example, task TA210 may be implemented to produce the masking signal such that a change in the value of the gain factor from a first frame to a second frame is opposite in direction to a change in the level of the source signal from the first frame to the second frame (e.g., is complementary, with respect to the long-term energy level, to a corresponding change in the level of the source signal).
A masking signal whose energy changes over time in a manner similar to that of the energy of the source signal may provide better privacy. Consequently, such a configuration of task T200 may be suitable for a communications use case. Alternatively, a combined sound field having a substantially constant level over time in the direction of the masking component may be expected to have a reduced environmental impact and may be suitable for an entertainment use case. It may be desirable to implement task T200 to produce the masking signal according to a detected use case (e.g., as indicated by a current mode of operation of the device and/or by the nature of the module from which the source signal is received).
In a further example, task T200 may be implemented to modulate the level of the masking signal over time according to a rhythmic pattern. For example, task T200 may be implemented to modulate the level of the masking signal over time at a frequency of from 0.1 Hz to 3 Hz. Such modulation has been shown to provide effective masking at reduced masking power levels. The modulation frequency may be fixed or may be adaptive. For example, the modulation frequency may be based on a detected variation in the level of the source signal over time (e.g., a rhythm of a music signal), and the frequency of this variation may change over time. In such cases, task TA200 may be implemented to apply such modulation by modulating the value of the gain factor.
In addition to an estimated intensity of the source component, task TA210 may be implemented to calculate the value of the gain factor based on one or more other component factors as well. In one such example, task TA210 is implemented to calculate the value of the gain factor based on the type of noise signal used to produce the masking signal (e.g., white noise or pink noise). Additionally or alternatively, task TA210 may be implemented to calculate the value of the gain factor based on the identity of a current application. For example, it may be desirable for the masking component to have a higher intensity during a voice communications or other privacy-sensitive application (e.g., a telephone call) than during a media application (e.g., watching a movie). In such case, task TA210 may be implemented to scale the gain factor according to a detected use case (as indicated, for example, by a current mode of operation of the device and/or by the nature of the module from which the source signal is received). Other examples of such component factors include a ratio between the peak responses of the source and masking spatially directive filters. Task TA210 may be implemented to multiply (e.g., in a linear domain) and/or to add (e.g., in a decibel domain) such component factors to obtain the gain factor value. It may be desirable to implement task TA210 to calculate the gain factor value according to a loudness weighting function or other perceptual response function, such as an A-weighting curve.
It may be desirable to implement task T200 to produce the masking signal based on a frequency profile of the source signal (a “source frequency profile”). The source frequency profile indicates a corresponding level (e.g., an energy level) of the source signal at each of a plurality of different frequencies (e.g., subbands). In such case, it may be desirable to calculate and apply values of the gain factor to corresponding subbands of the noise signal.
Task T400 may be implemented to determine the source frequency profile according to a current use of the device (e.g., as indicated by a current mode of operation of the device and/or by the nature of the module from which the source signal is received). If the device is engaged in voice communications (for example, the source signal is a far-end telephone call), task T400 may determine that the source signal has a frequency profile that indicates a decrease in energy level as frequency increases. If the device is engaged in media playback (for example, the source signal is a music signal), task T400 may determine that the source frequency profile is flatter with respect to frequency, such as a white or pink noise profile.
Additionally or alternatively, task T400 may be implemented to determine the source frequency profile by calculating levels of the source signal at different frequencies. For example, task T400 may be implemented to determine the source frequency profile by calculating a first level of the source signal at a first frequency and a second level of the source signal at a second frequency. Such calculation may include a spectral or subband analysis of the source signal in a frequency domain or in the time domain. Such calculation may be performed for each frame of the source signal or at another interval. Typical frame lengths include five, ten, twenty, forty, and fifty milliseconds. It may be desirable to implement task T400 to calculate the source frequency profile according to a loudness weighting function or other perceptual response function, such as an A-weighting curve.
For time-domain analysis, task T400 may be implemented to determine the source frequency profile by calculating an average energy level for each of a plurality of subbands of the source signal. Such an analysis may include applying a subband filter bank to the source signal, such that the frame energy of the output of each filter (e.g., a sum of squared samples of the output for the frame or other interval, which may be normalized to a per-sample value) indicates the level of the source signal at a corresponding frequency, such as a center or peak frequency of the filter passband.
The subband division scheme may be uniform, such that each subband has substantially the same width (e.g., within about ten percent). Alternatively, the subband division scheme may be nonuniform, such as a transcendental scheme (e.g., a scheme based on the Bark scale) or a logarithmic scheme (e.g., a scheme based on the Mel scale). In one example, the edges of a set of seven Bark scale subbands correspond to the frequencies 20, 300, 630, 1080, 1720, 2700, 4400, and 7700 Hz. Such an arrangement of subbands may be used in a wideband speech processing system that has a sampling rate of 16 kHz. In other examples of such a division scheme, the lower subband is omitted to obtain a six-subband arrangement and/or the high-frequency limit is increased from 7700 Hz to 8000 Hz. Another example of a subband division scheme is the four-band quasi-Bark scheme 300-510 Hz, 510-920 Hz, 920-1480 Hz, and 1480-4000 Hz. Such an arrangement of subbands may be used in a narrowband speech processing system that has a sampling rate of 8 kHz. Other examples of perceptually relevant subband division schemes that may be used to implement a subband filter bank for analysis of the source signal include octave band, third-octave band, critical band, and equivalent rectangular bandwidth (ERB) scales.
In one example, task T400 applies a subband filter bank that is implemented as a bank of second-order recursive (i.e., infinite-impulse-response) filters. Such filters are also called “biquad filters.”
For frequency-domain analysis, task T400 may be implemented to determine the source frequency profile by calculating a frame energy level for each of a plurality of frequency bins of the source signal or by calculating an average frame energy level for each of a plurality of groups of frequency bins of the source signal. Such a grouping may be configured according to a perceptually relevant subband division scheme, such as one of the examples listed above.
In another example, task T400 is implemented to determine the source frequency profile from a set of linear prediction coding (LPC) parameters, such as LPC filter coefficients. Such an implementation may be especially suitable for a case in which the source signal is provided in a form that includes LPC parameters (e.g., the source signal is provided as an encoded speech signal). In such case, the source frequency profile may be implemented to include a location and level for each of one or more spectral peaks (e.g., formants) and/or valleys of the source signal. It may be desirable, for example, to implement task T230 to filter the noise signal to have a low level at source formant peaks and a higher level in source spectral valleys. Alternatively or additionally, task T230 may be implemented to filter the noise signal to have a notch at one or more of the source pitch harmonics. Alternatively or additionally, task T230 may be implemented to filter the noise signal to have a spectral tilt that is based on (e.g., is inverse in direction to) a source spectral tilt, as indicated, e.g., by the first reflection coefficient.
Task T230 produces the masking signal based on the noise signal and according to the masking frequency profile. The masking frequency profile may indicate a distribution of energy that is more concentrated or less concentrated in particular bands (e.g., speech bands), or a frequency profile that is flat or is tilted up or down.
Based on the source frequency profile, task T230 may be implemented to select the masking frequency profile from a database. Alternatively, task T230 may be implemented to calculate the masking frequency profile, based on the source frequency profile.
It may be desirable to implement task TA110 to calculate the estimated intensity of the source component with respect to frequency, based on the source frequency profile. Such calculation may also take into account variations of the estimated response of the source spatially directive filter with respect to frequency (alternatively, it may be decided for some applications that calculation of the response at a single value of frequency f, such as frequency f1, is sufficient).
The response of the source spatially directive filter may be estimated and stored before run-time, such as during design and/or manufacture, to be accessed by task T230 (e.g., by task TA110) at run-time. Such precalculation may be appropriate for a case in which the source component is oriented in a fixed direction or in a selected one of a few (e.g., ten or fewer) fixed directions (e.g. as described above with reference to examples 1, 2, 3, and 5 of task T100). Alternatively, task T230 may be implemented to estimate the filter response at run-time.
Task TA110 may be implemented to calculate the estimated intensity for each subband as a product of the estimated response and level for the subband in the linear domain, or as a sum of the estimated response and level for the subband in the decibel domain. Task TA110 may also be implemented to apply temporal smoothing and/or a hangover period as described above to each of one or more (possibly all) of the subband levels of the source signal.
The masking frequency profile may be implemented as a plurality of masking target levels, each corresponding to one of the plurality of different frequencies (e.g., subbands). In such case, task T230 may be implemented to produce the masking signal according to the masking target levels.
Task TC150 may be implemented to calculate each of one or more of the masking target levels as a corresponding masking threshold that is based on a value of the source frequency profile in the subband and indicates a minimum masking level. Such a threshold may also be based on estimates of psychoacoustic factors such as, for example, tonality of the source signal (and/or of the noise signal) in the subband, masking effect of the noise signal on adjacent subbands, and a threshold of hearing in the subband. Calculation of a subband masking threshold may be performed, for example, as described in Psychoacoustic Model 1 or 2 of the MPEG-1 standard (ISO/IEC, JTC1/SC29/WG11MPEG, “Information technology-Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s-Part 3: Audio,” IS11172-3 1992). Additionally or alternatively, it may be desirable to implement task TC150 to calculate the masking target levels according to a loudness weighting function or other perceptual response function, such as an A-weighting curve.
It may be desirable for method M100 to produce the sound field to have a spectrum that is noise-like in one or more directions outside the privacy zone (e.g., in one or more directions other than the user's direction, such as a leakage direction). For example, it may be desirable for these regions of the combined sound field to have a white-noise distribution (i.e., equal energy per frequency), a pink-noise distribution (i.e., equal energy per octave), or another noise distribution, such as a perceptually weighted noise distribution. In such cases, task TC150 may be implemented to calculate, for at least some of the plurality of frequencies, a masking target level that is based on a masking target level for at least one other frequency.
For a combined sound field that is noise-like in a leakage direction, task T200 may be implemented to select or filter the noise signal to have a spectrum that is complementary to that of the source signal with respect to a desired intensity of the combined sound field. For example, task T200 may be implemented to produce the masking signal such that a change in the level of the noise signal from a first frequency to a second frequency is opposite in direction (e.g., is inverse) to a change in the level of the source signal from the first frequency to the second frequency (e.g., as indicated by the source frequency profile).
In this case, task TC150 may be implemented to calculate a desired combined intensity of the sound field in the leakage direction for subband i as a product of (A) the bandwidth of subband i and (B) the maximum, over all subbands j, of the estimated combined intensity of subband j as normalized by the bandwidth of subband j. Such a calculation may be performed, for example, according to an expression such as
where DCIi denotes the desired combined intensity for subband i, ECIj denotes the estimated combined intensity for subband j, and BWi and BWj, denote the bandwidths of subbands i and j, respectively. In the particular example of
In the example of
In this case, task TC150 may be implemented to determine the desired combined intensity of the sound field in the leakage direction for each subband as a maximum of the estimated combined intensities, as shown in the plot on the right, and to calculate a modified masking target level for each subband (for example, as the difference between the corresponding desired combined intensity and the corresponding estimated intensity of the source component in the leakage direction). For other subband division schemes (e.g., a third-octave scheme or a critical-band scheme), calculation of a desired combined intensity for each subband, and calculation of a modified masking target level for each subband, may include a suitable bandwidth compensation.
As shown in the examples of
It may be desirable to configure task T200 according to a detected use case (e.g., as indicated by a current mode of operation of the device and/or by the nature of the module from which the source signal is received). For example, a combined sound field that resembles white noise in a leakage direction may be more effective at concealing speech within the source signal, so for a communications use (e.g., when the device is engaged in a telephone call), it may be desirable for task T230 to use a white-noise spectral profile (e.g., as shown in
In a further example, it may be desirable to implement task TC150 to calculate the desired combined intensities according to a noise profile that varies over time. Such alternative noise profiles include babble noise, street noise, and car interior noise. For example, it may be desirable to select a noise profile according to (e.g., to match) a detected ambient noise profile.
Based on the masking frequency profile, task TC210 calculates a corresponding gain factor value for each subband. For example, it may be desirable to calculate the gain factor value to be high enough for the intensity of the masking component in the subband to meet the corresponding masking target level in the leakage direction. It may be desirable to implement task TC210 to calculate the gain factor values according to a loudness weighting function or other perceptual response function, such as an A-weighting curve.
Tasks TC150 and/or TC210 may be implemented to account for a dependence of the source frequency profile on the source direction, a dependence of the masking frequency profile on the masking direction, and/or a frequency dependence in a response of the audio output path (e.g., in a response of the loudspeaker array). In another example, task TC210 is implemented to modulate the values of the gain factor for one or more (possibly all) of the subbands over time according to a rhythmic pattern (e.g., at a frequency of from 0.1 Hz to 3 Hz, which modulation frequency may be fixed or may be adaptive) as described above.
Task TC200 may be configured to produce the masking signal by applying corresponding gain factor values to different frequency components of the noise signal. Task TC200 may be configured to produce the masking signal by using a subband filter bank to shape the noise signal according to the masking frequency profile. In one example, such a subband filter bank is implemented as a cascade of biquad peaking filters. The desired gain at each subband may be obtained in this case by modifying the filter transfer function with an offset that is based on the corresponding gain factor. Such a modified transfer function for each subband i may be expressed as follows:
where the values of a1(i) and a2(i) are selected to define subband i, b0(i) is equal to one, the values of a1(i) and b1(i) are equal, the values of a2(i) and b2(i) are equal, and g, denotes the corresponding offset.
Offset g, may be calculated from the corresponding gain factor (e.g., based on a masking target level mi for subband i, as described above with reference to
g
i=(1−a2(i))(10m
where mi is the masking signal level for subband i (in decibels) and ci is a normalization factor having a value less than one. Factor ci may be tuned such that the desired gain is achieved, for example, at the center of the subband.
The subband division scheme used in task TC200 may be any of the schemes described above with reference to task T400 (e.g., uniform or nonuniform; transcendental or logarithmic; octave, third-octave, or critical band or ERB; with four, six, seven, or more subbands, such as seventeen or twenty-three subbands). Typically the same subband division scheme is used for noise synthesis in task TC200 as for source analysis in T400, and the same filters may even be used for the two tasks, although for analysis the filters are typically arranged in parallel rather than in serial cascade.
It may be desirable to implement task T200 to generate the masking signal such that levels of each of a time-domain characteristic and a frequency-domain characteristic are based on levels of a corresponding characteristic of the source signal (e.g., as described herein with reference to implementations of task T230). Other implementations of task T200 may use results from analysis of the source signal in another domain, such as an LPC domain, a wavelet domain, and/or a cepstral domain. For example, task T200 may be implemented to perform a multiresolution analysis (MRA), a mel-frequency cepstral coefficient (MFCC) analysis, a cascade time-frequency linear prediction (CTFLP) analysis, and/or an analysis based on other psychoacoustic principles, on the source signal for use in generating an appropriate masking signal. Task T200 may perform voice activity detection (VAD) such that the source characteristics include an indication of presence or absence of voice activity (e.g., for each frame of the source signal).
In another example, task T200 is implemented to generate the masking signal based on at least one entry that is selected from a database of noise signals or noise patterns according to one or more characteristics of the source signal. For example, task T200 may be implemented to use such a source characteristic to select configuration parameters for a noise signal from a noise pattern database. Such configuration parameters may include a frequency profile and/or a temporal profile. Characteristics that may be used in addition to or in the alternative to those source characteristics noted herein include one or more of: sharpness (center frequency and bandwidth), roughness and/or fluctuation strength (modulation frequency and depth), impulsiveness, tonality (proportion of loudness that is due to tonal components), tonal audibility, tonal multiplicity (number of tones), bandwidth, and N percent exceedance level. In this example, task T200 may be implemented to generate the noise signal using an entry from a database of stored PCM samples by performing a technique such as, for example, wavetable synthesis, granular synthesis, or graintable synthesis. In such cases, task TC210 may be implemented to calculate the gain factors based on one or more characteristics (e.g., energy) of the selected or generated noise signal.
In a further example, task T200 is implemented to generate the noise signal from the source signal. Such an implementation of task T200 may generate the noise signal by rearranging frames of the source signal into a different sequence in time, by calculating an average frame from multiple frames of the source signal, and/or by generating frames from parameter values extracted from frames of the source signal (e.g., pitch frequency and/or LP filter coefficients).
The source component may have a frequency distribution that differs from one direction to another. Such variations may arise from task T100 (e.g., from the operation of applying a source spatially directive filter to generate the source component). Such variations may also arise from the response of the audio output stage and/or loudspeaker array. It may be desirable to produce the masking component according to an estimation of frequency- and direction-dependent variations in the source component.
Task T200 may be implemented to produce a map of estimated intensity of the source component across a range of spatial directions relative to the array, and to produce the masking signal based on this map. It may also be desirable for the map to indicate changes in the estimated intensity across a range of frequencies. Such a map may be implemented to have a desired resolution in the frequency and direction domains. In the direction domain, for example, the map may have a resolution of five, ten, twenty, or thirty degrees over a 180-degree range. In the frequency domain, the map may have a set of direction-dependent values for each subband.
Task TC150 may be implemented to calculate the masking target levels according to such a map of estimated intensity of the source component.
Task TC200 may be implemented to use the masking target levels to select and/or to shape the noise signal. In a frequency-domain implementation, task TC200 may select a different noise signal for each of two or more (possibly all) of the subbands. For example, such an implementation of task TC200 may select, from among a plurality of noise signals or patterns, the signal or pattern that best matches the masking target levels for the subband (e.g., in a least-squares-error sense). In a time-domain implementation, task TC200 may select the masking spatially directive filter from among two or more different pre-calculated filters. For example, such an implementation of task TC200 may use the masking target levels to select a suitable masking spatially directive filter, and then to select and/or filter the noise signal to reduce remaining differences between the masking target levels and the response of the selected filter. In either domain, task TC200 may also be implemented to select a different masking spatially selective filter for each of two or more (possibly all) of the subbands, based on a best match (e.g., in a least-squares-error sense) between an estimated response of the filter and the masking target levels for the corresponding subband or subbands.
Method M100 may be used in any of a wide variety of different applications. For example, method M100 may be used to reproduce the far-end communications signal in a two-way voice communication, such as a telephone call. In such a case, a primary concern may be to protect the privacy of the user (e.g., by obscuring the sidelobes of the source component).
It may be desirable for the device to activate a privacy masking mode in response to an incoming and/or an outgoing telephone call. Such a device may be implemented such that when the user is in a private phone call, the input source signal is assumed to be a sparse speech signal (e.g., sparse in time and frequency) carrying an important message. In such case, task T200 may be configured to generate a masking signal whose spectrum is complementary to the spectrum of the input source signal (e.g., just enough noise to fill in spectral valleys of the speech itself), so that nearby people in the dark zone hear a “white” spectrum of sound, and the privacy of the user is protected. In an alternative phone-call scenario, task T200 generates the masking signal as babble noise whose level just enough to satisfy the masking frequency profile (e.g., the subband masking thresholds).
In another use case, the device is used to reproduce a recorded or streamed media signal, such as a music file, a broadcast audio or video presentation (e.g., radio or television), or a movie or video clip streamed over the Internet. In this case, privacy may be less important, and it may be desirable for the device to operate in a polite masking mode. For example, it may be desirable to configure task T200 such that the combined sound field will be less distracting to a bystander than the unmasked source component by itself (e.g., by having a substantially constant level over time in the direction of the masking component). A media signal may have a greater dynamic range and/or may be less sparse over time than a voice communications signal. Processing delays may also be less problematic for a media signal than for a voice communications signal.
Method M100 may also be implemented to drive a loudspeaker array to generate a sound field that includes more than one source component.
In one example of a multi-source use case, method M100 is implemented to generate source components that include the same audio content in different natural (e.g., spoken) languages. Typical applications for such a system include public address and/or video billboard installations in public spaces, such as an airport or railway station or another situation in which a multilingual presentation may be desired. For example, such a case may be implemented so that the same video content on a display screen is visible to each of two or more users, with the loudspeaker array being driven to provide the same accompanying audio content in different languages (e.g., two or more of English, Spanish, Chinese, Korean, French, etc.) at different respective viewing angles. Presentation of a video program with simultaneous presentation of the accompanying audio content in two or more languages may also be desirable in smaller settings, such as a home or office.
In another example of a multi-source use case, method M100 is implemented to generate source components having unrelated audio content into different respective directions. For example, each of two or more of the source components may carry far-end audio content for a different voice communication (e.g., telephone call). Alternatively or additionally, each of two or more of the source components may include an audio track for a different respective media reproduction (e.g., music, video program, etc.).
For a case in which different source components are associated with different video content, it may be desirable to display such content on multiple display screens and/or with a multiview-capable display screen. One example of a multiview-capable display screen is configured to display each of the video programs using a different light polarization (e.g., orthogonal linear polarizations, or circular polarizations of opposite handedness), and each viewer wears a set of goggles that is configured to pass light having the polarization of the desired video program and to block light having other polarizations. In another example of a multiview-capable display screen, a different video program is visible at least of two or more viewing angles. In such a case, method M100 may be implemented to direct the source component for each of the different video programs in the direction of the corresponding viewing angle.
In a further example of a multi-source use case, method M100 is implemented to generate two or more source components that include the same audio content in different natural (e.g., spoken) languages and at least one additional source component having unrelated audio content (e.g., for another media reproduction and/or for a voice communication).
For a case in which multiple source signals are supported, each source component may be oriented in a respective direction that is fixed (e.g., selected, by a user or automatically, from among two or more fixed options), as described herein with reference to task T100. Alternatively, each of at least one (possibly all) of the source components may be oriented in a respective direction that may vary over time in response to changes in an estimated direction of a corresponding user. Typically it is desirable to implement independent direction control for each source, such that each source component or beam is steered independently of the other(s) (e.g., by a corresponding instance of task T100).
In a typical multi-source application, it may be desirable to provide about thirty or forty to sixty degrees of separation between the directions of orientation of adjacent source components. One typical application is to provide different respective source components to each of two or more users who are seated shoulder-to-shoulder (e.g., on a couch) in front of the loudspeaker array. At a typical viewing distance of 1.5 to 2.5 meters, the span occupied by a viewer is about thirty degrees. With an array of four microphones, a resolution of about fifteen degrees may be possible. With an array having more microphones, a more narrow beam may be obtained.
As for a single-source case, privacy may be a concern for multi-source cases, especially if at least one of the source signals is a far-end voice communication (e.g., a telephone call). For a typical multiple-source case, however, leakage of one source component to another may be a greater concern, as each source component is potentially an interferer to other source components being produced at the same time. Accordingly, it may be desirable to generate a source component to have a null in the direction of another source component. For example, each source beam may be directed to a respective user, with a corresponding null being generated in the direction of each of one or more other users. Such design will typically cope with a “waterbed” effect, as the energy suppressed by creating a null on one side of a beam is likely to re-emerge as a sidelobe on the other side. The beam and null (or nulls) of a source component may be designed together or separately. It may be desirable to direct two or more narrow nulls of a source component next to each other to obtain a broader null.
In a multiple-source application, it may be desirable for the system to treat any source component as a masker to other source components being generated at the same time. In one example, the levels and/or spectral equalizations of each source signal are dynamically adjusted according to the signal contents, so that the corresponding source component functions as a good masker to other source components.
In a multi-source case, method M100 may be implemented to combine beamforming (and possibly nullforming) of the source signals with generation of one or more masking components. Such a masking component may be designed according to the spatial distributions of the source component or components to be masked, and it may be desirable to design the masking component or components to minimize disturbance to bystanders and/or users enjoying other source components at adjacent locations.
As shown in
It may be desirable to implement method M100 to adapt the direction of the source component, and/or the direction of the masking component, in response to changes in the location of the user. For a multiple-user case, it may be desirable to implement method M100 to perform such adaptation individually for each of two or more users. In order to determine the respective source and/or masking directions, such a method may be implemented to perform user tracking.
Additionally or in the alternative, task T500 may be configured to perform passive tracking by applying a multi-microphone speech tracking algorithm to a multichannel sound signal produced by a microphone array (e.g., in response to sound emitted by the user or users). Examples of multi-microphone approaches to localization of one or more sound sources include directionally selective filtering operations, such as beamforming (e.g., filtering a sensed multichannel signal in parallel with several beamforming filters that are each fixed in a different direction, and comparing the filter outputs to identify the direction of arrival of the speech), blind source separation (e.g., independent component analysis, independent vector analysis, and/or a constrained implementation of such a technique), and estimating direction-of-arrival by comparing differences in level and/or phase between a pair of channels of the multichannel microphone signal. Such a task may include performing an echo cancellation operation on the multichannel microphone signal to block sound components that were produced by the loudspeaker array and/or performing a voice recognition operation on at least one channel of the multichannel microphone signal.
For accurate tracking results, it may be desirable for the microphone array (or other sensing device) to be aligned in space with the loudspeaker array in a reciprocal arrangement. In an ideally reciprocal arrangement, the direction to a point source P as indicated by a sensing device (e.g., a microphone array and associated tracking logic) is the same as the source direction used to direct a beam from the loudspeaker array to the point source P. A reciprocal arrangement may be used to create the privacy zones (e.g., by beamforming and nullforming) at the actual locations of the users. If the sensing and emitting arrays are not arranged reciprocally, the accuracy of creating a beam or null for designated source locations may be unacceptable. The quality of the null especially may suffer from such a mismatch, as a nullforming operation typically requires a higher level of accuracy than a comparable beamforming operation.
With an array of many microphones, a narrow beam may be produced. With a four-microphone array, for example, a resolution of about fifteen degrees is possible. For a typical television viewing distance of two meters, a span of fifteen degrees corresponds to a shoulder-to-shoulder width, and a span of thirty degrees corresponds to a typical angle between the directions of adjacent users seated on a couch. A typical application is to provide forty to sixty degrees between the directions of adjacent source beams.
It may be desirable to direct two or more narrow nulls together to obtain a broad null. The beam and nulls may be designed together or separately. Such design will typically cope with a “waterbed” effect, as creating a null on one side is likely to create a sidelobe on the other side.
As described above, it may be desirable to implement method M100 to support privacy zones for multiple listeners. In such an implementation of method M140, task T500 may be implemented to track multiple users. Multiple source beams may be directed to respective users, with corresponding nulls being generated in other user directions.
Any beamforming method may be used to estimate the direction of each of one or more users as described above. For example, a reciprocal implementation of a method used to generate the source and/or masking components may be applied.
For a one-dimensional (1-D) array of microphones, a direction of arrival (DOA) for a source may be easily defined in a range of, for example, −90° to 90°. For an array that includes more than two microphones at arbitrary relative locations (e.g., a non-coaxial array), it may be desirable to use a straightforward extension of one-dimensional principles as described above, e.g. (θ1, θ2) in a two-pair case in two dimensions; (θ1, θ2, θ3) in a three-pair case in three dimensions, etc. A key problem is how to apply spatial filtering to such a combination of paired 1-D DOA estimates.
We may apply a beamformer/null beamformer (BFNF) as shown in
As the approach shown in
where lp indicates the distance between the microphones of pair p (reciprocally, between a pair of loudspeakers), w indicates the frequency bin number, and fs indicates the sampling frequency.
A method as described herein (e.g., method M100) may be combined with automatic speech recognition (ASR) for system control. Such a control may support different functions (e.g., control of television and/or telephone functions) for different users. The method may be configured, for example, to use an embedded speech recognition engine create a privacy zone whenever an activation code is uttered (e.g., a particular phrase, such as “Qualcomm voice”).
In a typical use scenario as shown in
In a similar manner, the system may be configured to enter a masking mode in response to a corresponding activation code. It may be desirable to implement the system to adapt its masking behavior to the current operating mode (e.g., to perform privacy zone generation for phone functions, and to perform environmentally-friendly masking for media functions). In a multiuser case, the system may create the source and masking components in response to the activation code and the direction from which the code is received, as in the following three-user example:
During generation of the privacy zone for user 1, a second user may prompt the system to create a second privacy zone as shown in
During generation of the privacy zones for users 1 and 2, a third user may prompt the system to create another privacy zone as shown in
Signal analyzer 400 calculates an estimated intensity of the source component. Signal analyzer 400 may be implemented (e.g., as described herein with reference to tasks T400 and TA110) to calculate the estimated intensity in different directions, and in different frequency subbands, to produce a frequency-dependent spatial intensity map (e.g., as shown in
Apparatus A130A also includes a target level calculator C150 configured to calculate a masking target level (e.g., an effective masking threshold) for each of a plurality of frequency bins or subbands over a desired masking frequency range, based on the estimated intensity of the source component (e.g., as described herein with reference to task TC150). Calculator C150 may be implemented, for example, to produce a reference map that indicates a desired masking level for each direction and frequency (e.g., as shown in
Apparatus A130A also includes an implementation 230 of masking signal generator 200. Generator 230 is configured to generate a directional masking signal, based on the masking target levels produced by target level calculator C150, that includes a null beam in the source direction (e.g., as described herein with reference to tasks TC200 and TA300).
Masking spatially directive filter 300A is configured to filter the modified noise signal to produce a multichannel masking signal that has a null in the source direction (e.g., as described herein with reference to task TA300). Masking signal generator 230 (e.g., generator 230B) may be implemented to select filter 300A from among two or more spatially directive filters according to the desired null direction (e.g., the source direction). Additionally or alternatively, such a generator may be implemented to select a different masking spatially selective filter for each of two or more (possibly all) of the subbands, based on a best match (e.g., in a least-squares-error sense) between an estimated response of the filter and the masking target levels for the corresponding subband or subbands.
Audio output stage 300 is configured to mix the multichannel source and masking signals to produce a plurality of driving signals SD10-1 to SD10-N (e.g., as described herein with reference to tasks T300 and T310). Audio output stage 300 may be implemented to perform such mixing in the digital domain or in the analog domain. For example, audio output stage 300 may be configured to produce a driving signal for each loudspeaker channel by converting digital source and masking signals to analog, or by converting a digital mixed signal to analog. Audio output stage 300 may also be configured to amplify, apply a gain to, and/or control a gain of the source signal; to filter the source and/or masking signals; to provide impedance matching to the loudspeakers of the array; and/or to perform any other desired audio processing operation.
Noise selector 650 is configured to select an appropriate type of noise signal or pattern (e.g., speech, music, babble noise, street noise, car interior noise, white noise) based on the source characteristics. For example, noise selector 650 may be implemented to select, from among a plurality of noise signals or patterns in database 700, the signal or pattern that best matches the source characteristics (e.g., in a least-squares-error sense). Database 700 is configured to produce (e.g., to synthesize or reproduce) a noise signal according to the selected noise signal or pattern indicated by noise selector 650.
In this case, it may be desirable to configure target level calculator C150 to calculate the masking target levels based on information about the selected noise signal or pattern (e.g., the energy spectrum of the selected noise signal). For example, target level calculator C150 may be configured to produce the target levels according to characteristics, such as changes over time in the energy spectrum of the selected masking signal (e.g., over several frames) and/or harmonicity of the selected masking signal, that distinguish the selected noise signal from one or more other entries in database 700 having similar time-average energy spectra. In apparatus A130B, masking signal generator 230 (e.g., generator 230B) is arranged to produce the directional masking signal by modifying, according to the masking target levels, the noise signal produced by database 700.
Any among apparatus A130, A130A, A130B, and A140 may also be realized as an implementation of apparatus A102 (e.g., such that audio output stage 300 is implemented as audio output stage 310 to drive array LA100). Additionally or alternatively, any among apparatus A130, A130A, and A130B may be realized as an implementation of apparatus A140 (e.g., including an instance of direction estimator 500).
Each of the microphones for direction estimation as discussed herein (e.g., with reference to location and tracking of one or more users) may have a response that is omnidirectional, bidirectional, or unidirectional (e.g., cardioid). The various types of microphones that may be used include (without limitation) piezoelectric microphones, dynamic microphones, and electret microphones. It is expressly noted that the microphones may be implemented more generally as transducers sensitive to radiations or emissions other than sound. In one such example, the microphone array is implemented to include one or more ultrasonic transducers (e.g., transducers sensitive to acoustic frequencies greater than fifteen, twenty, twenty-five, thirty, forty, or fifty kilohertz or more).
Apparatus A100 and apparatus MF100 may be implemented as a combination of hardware (e.g., a processor) with software and/or with firmware. Such apparatus may also include an audio preprocessing stage AP10 as shown in
It may be desirable for audio preprocessing stage AP10 to produce each microphone signal as a digital signal, that is to say, as a sequence of samples. Audio preprocessing stage AP20, for example, includes analog-to-digital converters (ADCs) C10a, C10b, and C10c that are each arranged to sample the corresponding analog signal. Typical sampling rates for acoustic applications include 8 kHz, 12 kHz, 16 kHz, and other frequencies in the range of from about 8 to about 16 kHz, although sampling rates as high as about 44.1, 48, or 192 kHz may also be used. Typically, converters C10a, C10b, and C10c will be configured to sample each signal at the same rate.
In this example, audio preprocessing stage AP20 also includes digital preprocessing stages P20a, P20b, and P20c that are each configured to perform one or more preprocessing operations (e.g., spectral shaping) on the corresponding digitized channel to produce a corresponding one of a left microphone signal AL10, a center microphone signal AC10, and a right microphone signal AR10 for input to task T500 or direction estimator 500. Typically, stages P20a, P20b, and P20c will be configured to perform the same functions on each signal. It is also noted that preprocessing stage AP10 may be configured to produce a different version of a signal from at least one of the microphones (e.g., at a different sampling rate and/or with different spectral shaping) for content use, such as to provide a near-end speech signal in a voice communication (e.g., a telephone call). Although
Loudspeaker array LA100 may include cone-type and/or rectangular loudspeakers. The spacings between adjacent loudspeakers may be uniform or nonuniform, and the array may be linear or nonlinear. As noted above, techniques for generating the multichannel signals for driving the array may include pairwise BFNF and MVDR.
When beamforming techniques are used to produce spatial patterns for broadband signals, selection of the transducer array geometry involves a trade-off between low and high frequencies. To enhance the direct handling of low frequencies by the beamformer, a larger loudspeaker spacing is preferred. At the same time, if the spacing between loudspeakers is too large, the ability of the array to reproduce the desired effects at high frequencies will be limited by a lower aliasing threshold. To avoid spatial aliasing, the wavelength of the highest frequency component to be reproduced by the array should be greater than twice the distance between adjacent loudspeakers.
As consumer devices become smaller and smaller, the form factor may constrain the placement of loudspeaker arrays. For example, it may be desirable for a laptop, netbook, or tablet computer or a high-definition video display to have a built-in loudspeaker array. Due to the size constraints, the loudspeakers may be small and unable to reproduce a desired bass region. Alternatively, the loudspeakers may be large enough to reproduce the bass region but spaced too closely to support beamforming or other acoustic imaging. Thus it may be desirable to provide the processing to produce a bass signal in a closely spaced loudspeaker array in which beamforming is employed.
It is expressly noted that the principles described herein are not limited to use with a uniform linear array of loudspeakers (e.g., as shown in
In the example of
Although particular examples of directional masking in a range of 180 degrees are shown, the principles described herein may be extended to provide directional masking across any desired angular range in a plane (e.g., a two-dimensional range). Such extension may include the addition of appropriately placed loudspeakers to the array. For example,
Such principles may also be extended to provide directional masking across any desired angular range in space (3D).
A psychoacoustic phenomenon exists that listening to higher harmonics of a signal may create a perceptual illusion of hearing the missing fundamentals. Thus, one way to achieve a sensation of bass components from small loudspeakers is to generate higher harmonics from the bass components and play back the harmonics instead of the actual bass components. Descriptions of algorithms for substituting higher harmonics to achieve a psychoacoustic sensation of bass without an actual low-frequency signal presence (also called “psychoacoustic bass enhancement” or PBE) may be found, for example, in U.S. Pat. No. 5,930,373 (Shashoua et al., issued Jul. 27, 1999) and U.S. Publ. Pat. Appls. Nos. 2006/0159283 A1 (Mathew et al., published Jul. 20, 2006), 2009/0147963 A1 (Smith, published Jun. 11, 2009), and 2010/0158272 A1 (Vickers, published Jun. 24, 2010). Such enhancement may be particularly useful for reproducing low-frequency sounds with devices that have form factors which restrict the integrated loudspeaker or loudspeakers to be physically small. For example, task T300 may be implemented to perform PBE to produce the driving signals that drive the array of loudspeakers to produce the combined sound field.
It may be desirable to apply PBE not only to reduce the effect of low-frequency reproducibility limits, but also to reduce the effect of directivity loss at low frequencies. For example, it may be desirable to combine PBE with spatially directive filtering (e.g., beamforming) to create the perception of low-frequency content in a range that is steerable by a beamformer. In one example, any of the implementations of task T100 as described herein is modified to perform PBE on the source signal and to produce the multichannel source signal from the PBE-processed source signal. In the same example or in an alternative example, any of the implementations of task T200 as described herein is modified to perform PBE on the masking signal and to produce the multichannel masking signal from the PBE-processed masking signal.
The use of a loudspeaker array to produce directional beams from an enhanced signal results in an output that has a much lower perceived frequency range than an output from the audio signal without such enhancement. Additionally, it becomes possible to use a more relaxed beamformer design to steer the enhanced signal, which may support a reduction of artifacts and/or computational complexity and allow more efficient steering of bass components with arrays of small loudspeakers. At the same time, such a system can protect small loudspeakers from damage by low-frequency signals (e.g., rumble). Additional description of such enhancement techniques, which may be combined with directional masking as described herein, may be found in, e.g., U.S. patent application Ser. No. 13/190,464, entitled “SYSTEMS, METHODS, AND APPARATUS FOR ENHANCED ACOUSTIC IMAGING” (filed Jul. 25, 2011).
The methods and apparatus disclosed herein may be applied generally in any transceiving and/or audio sensing application, including mobile or otherwise portable instances of such applications and/or sensing of signal components from far-field sources. For example, the range of configurations disclosed herein includes communications devices that reside in a wireless telephony communication system configured to employ a code-division multiple-access (CDMA) over-the-air interface. Nevertheless, it would be understood by those skilled in the art that a method and apparatus having features as described herein may reside in any of the various communication systems employing a wide range of technologies known to those of skill in the art, such as systems employing Voice over IP (VoIP) over wired and/or wireless (e.g., CDMA, TDMA, FDMA, and/or TD-SCDMA) transmission channels.
It is expressly contemplated and hereby disclosed that communications devices disclosed herein may be adapted for use in networks that are packet-switched (for example, wired and/or wireless networks arranged to carry audio transmissions according to protocols such as VoIP) and/or circuit-switched. It is also expressly contemplated and hereby disclosed that communications devices disclosed herein may be adapted for use in narrowband coding systems (e.g., systems that encode an audio frequency range of about four or five kilohertz) and/or for use in wideband coding systems (e.g., systems that encode audio frequencies greater than five kilohertz), including whole-band wideband coding systems and split-band wideband coding systems.
The foregoing presentation of the described configurations is provided to enable any person skilled in the art to make or use the methods and other structures disclosed herein. The flowcharts, block diagrams, and other structures shown and described herein are examples only, and other variants of these structures are also within the scope of the disclosure. Various modifications to these configurations are possible, and the generic principles presented herein may be applied to other configurations as well. Thus, the present disclosure is not intended to be limited to the configurations shown above but rather is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein, including in the attached claims as filed, which form a part of the original disclosure.
Those of skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, and symbols that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Important design requirements for implementation of a configuration as disclosed herein may include minimizing processing delay and/or computational complexity (typically measured in millions of instructions per second or MIPS), especially for computation-intensive applications, such as playback of compressed audio or audiovisual information (e.g., a file or stream encoded according to a compression format, such as one of the examples identified herein) or applications for wideband communications (e.g., voice communications at sampling rates higher than eight kilohertz, such as 12, 16, 32, 44.1, 48, or 192 kHz).
Goals of a multi-microphone processing system may include achieving ten to twelve dB in overall noise reduction, preserving voice level and color during movement of a desired speaker, obtaining a perception that the noise has been moved into the background instead of an aggressive noise removal, dereverberation of speech, and/or enabling the option of post-processing for more aggressive noise reduction.
An apparatus as disclosed herein (e.g., any among apparatus A100, A102, A130, A130A, A130B, A140, MF100, MF102, MF130, and MF140) may be implemented in any combination of hardware with software, and/or with firmware, that is deemed suitable for the intended application. For example, the elements of such an apparatus may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of the elements of the apparatus may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips).
One or more elements of the various implementations of the apparatus disclosed herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field-programmable gate arrays), ASSPs (application-specific standard products), and ASICs (application-specific integrated circuits). Any of the various elements of an implementation of an apparatus as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions, also called “processors”), and any two or more, or even all, of these elements may be implemented within the same such computer or computers.
A processor or other means for processing as disclosed herein may be fabricated as one or more electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips). Examples of such arrays include fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, DSPs, FPGAs, ASSPs, and ASICs. A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors. It is possible for a processor as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to a directional sound masking procedure as described herein, such as a task relating to another operation of a device or system in which the processor is embedded (e.g., an audio sensing device). It is also possible for part of a method as disclosed herein to be performed by a processor of the audio sensing device and for another part of the method to be performed under the control of one or more other processors.
Those of skill will appreciate that the various illustrative modules, logical blocks, circuits, and tests and other operations described in connection with the configurations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Such modules, logical blocks, circuits, and operations may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an ASIC or ASSP, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to produce the configuration as disclosed herein. For example, such a configuration may be implemented at least in part as a hard-wired circuit, as a circuit configuration fabricated into an application-specific integrated circuit, or as a firmware program loaded into non-volatile storage or a software program loaded from or into a data storage medium as machine-readable code, such code being instructions executable by an array of logic elements such as a general purpose processor or other digital signal processing unit. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. A software module may reside in a non-transitory storage medium such as RAM (random-access memory), ROM (read-only memory), nonvolatile RAM (NVRAM) such as flash RAM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), registers, hard disk, a removable disk, or a CD-ROM; or in any other form of storage medium known in the art. An illustrative storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
It is noted that the various methods disclosed herein (e.g., any among methods M100, M102, M110, M120, M130, M140, and other methods disclosed by way of description of the operation of the various apparatus described herein) may be performed by an array of logic elements such as a processor, and that the various elements of an apparatus as described herein may be implemented as modules designed to execute on such an array. As used herein, the term “module” or “sub-module” can refer to any method, apparatus, device, unit or computer-readable data storage medium that includes computer instructions (e.g., logical expressions) in software, hardware or firmware form. It is to be understood that multiple modules or systems can be combined into one module or system and one module or system can be separated into multiple modules or systems to perform the same functions. When implemented in software or other computer-executable instructions, the elements of a process are essentially the code segments to perform the related tasks, such as with routines, programs, objects, components, data structures, and the like. The term “software” should be understood to include source code, assembly language code, machine code, binary code, firmware, macrocode, microcode, any one or more sets or sequences of instructions executable by an array of logic elements, and any combination of such examples. The program or code segments can be stored in a processor-readable storage medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication link.
The implementations of methods, schemes, and techniques disclosed herein may also be tangibly embodied (for example, in tangible, computer-readable features of one or more computer-readable storage media as listed herein) as one or more sets of instructions readable and/or executable by a machine including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The term “computer-readable medium” may include any medium that can store or transfer information, including volatile, nonvolatile, removable and non-removable media. Examples of a computer-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette or other magnetic storage, a CD-ROM/DVD or other optical storage, a hard disk, a fiber optic medium, a radio frequency (RF) link, or any other medium which can be used to store the desired information and which can be accessed. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet or an intranet. In any case, the scope of the present disclosure should not be construed as limited by such embodiments.
Each of the tasks of the methods described herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions), embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc.), that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to receive and/or transmit encoded frames.
It is expressly disclosed that the various methods disclosed herein may be performed by a portable communications device such as a handset, headset, or portable digital assistant (PDA), and that the various apparatus described herein may be included within such a device. A typical real-time (e.g., online) application is a telephone conversation conducted using such a mobile device.
In one or more exemplary embodiments, the operations described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, such operations may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term “computer-readable media” includes both computer-readable storage media and communication (e.g., transmission) media. By way of example, and not limitation, computer-readable storage media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM), or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage; and/or magnetic disk storage or other magnetic storage devices. Such storage media may store information in the form of instructions or data structures that can be accessed by a computer. Communication media can comprise any medium that can be used to carry desired program code in the form of instructions or data structures and that can be accessed by a computer, including any medium that facilitates transfer of a computer program from one place to another. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association, Universal City, Calif.), where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
An acoustic signal processing apparatus as described herein (e.g., any among apparatus A100, A102, A130, A130A, A130B, A140, MF100, MF102, MF130, and MF140) may be incorporated into an electronic device that accepts speech input in order to control certain operations, or may otherwise benefit from separation of desired noises from background noises, such as communications devices. Many applications may benefit from enhancing or separating clear desired sound from background sounds originating from multiple directions. Such applications may include human-machine interfaces in electronic or computing devices which incorporate capabilities such as voice recognition and detection, speech enhancement and separation, voice-activated control, and the like. It may be desirable to implement such an acoustic signal processing apparatus to be suitable in devices that only provide limited processing capabilities.
The elements of the various implementations of the modules, elements, and devices described herein may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or gates. One or more elements of the various implementations of the apparatus described herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs, ASSPs, and ASICs.
It is possible for one or more elements of an implementation of an apparatus as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to an operation of the apparatus, such as a task relating to another operation of a device or system in which the apparatus is embedded. It is also possible for one or more elements of an implementation of such an apparatus to have structure in common (e.g., a processor used to execute portions of code corresponding to different elements at different times, a set of instructions executed to perform tasks corresponding to different elements at different times, or an arrangement of electronic and/or optical devices performing operations for different elements at different times).
The present application for patent claims priority to Provisional Application No. 61/616,836, entitled “SYSTEMS, METHODS, AND APPARATUS FOR PRODUCING A DIRECTIONAL SOUND FIELD,” filed Mar. 28, 2012, and assigned to the assignee hereof. The present application for patent claims priority to Provisional Application No. 61/619,202, entitled “SYSTEMS, METHODS, APPARATUS, AND COMPUTER-READABLE MEDIA FOR GESTURAL MANIPULATION OF A SOUND FIELD,” filed Apr. 2, 2012, and assigned to the assignee hereof. The present application for patent claims priority to Provisional Application No. 61/666,196, entitled “SYSTEMS, METHODS, APPARATUS, AND COMPUTER-READABLE MEDIA FOR GENERATING CORRELATED MASKING SIGNAL,” filed Jun. 29, 2012, and assigned to the assignee hereof. The present application for patent claims priority to Provisional Application No. 61/741,782, entitled “SYSTEMS, METHODS, AND APPARATUS FOR PRODUCING A DIRECTIONAL SOUND FIELD,” filed Oct. 31, 2012, and assigned to the assignee hereof. The present application for patent claims priority to Provisional Application No. 61/733,696, entitled “SYSTEMS, METHODS, AND APPARATUS FOR PRODUCING A DIRECTIONAL SOUND FIELD,” filed Dec. 5, 2012, and assigned to the assignee hereof.
Number | Date | Country | |
---|---|---|---|
61616836 | Mar 2012 | US | |
61619202 | Apr 2012 | US | |
61666196 | Jun 2012 | US | |
61741782 | Oct 2012 | US | |
61733696 | Dec 2012 | US |