The present invention relates generally to audio signal processing. More specifically, embodiments of the present invention relate to audio processing methods and audio processing apparatus for audio signal rendering based on a mono-channel audio signal.
In many audio processing applications, a mono-channel audio signal may be received and sound is output based on the mono-channel audio signal. As an example, in a voice communication system, voice is captured as a mono-channel signal by a voice communication terminal A. The mono-channel signal is transmitted to a voice communication terminal B. The voice communication terminal B receives and renders the mono-channel signal. As another example, a desired sound such as speech, music and etc. may be recorded as a mono-channel signal. The recorded mono-channel signal may be read and played back by a playback device.
To increase intelligibility of desired sounds to audience, noise reduction methods such as Wiener filtering may be used to reduce noise, so that the desired sounds in the rendered signal can be more intelligible.
According to an embodiment of the invention, an audio processing method is provided. According to the method, a mono-channel audio signal is transformed into a plurality of first subband signals. Proportions of a desired component and a noise component are estimated in each of the subband signals. Second subband signals corresponding respectively to a plurality of channels are generated from each of the first subband signals. Each of the second subband signals comprises a first component and a second component obtained by assigning a spatial hearing property and a perceptual hearing property different from the spatial hearing property to the desired component and the noise component in the corresponding first subband signal respectively, based on a multi-dimensional auditory presentation method. The second subband signals are transformed into signals for rendering with the multi-dimensional auditory presentation method.
According to an embodiment of the invention, an audio processing apparatus is provided. The apparatus includes a time-to-frequency transformer, an estimator, a generator, and a frequency-to-time transformer. The time-to-frequency transformer is configured to transform a mono-channel audio signal into a plurality of first subband signals. The estimator is configured to estimate proportions of a desired component and a noise component in each of the subband signals. The generator is configured to generate second subband signals corresponding respectively to a plurality of channels from each of the first subband signals. Each of the second subband signals comprises a first component and a second component obtained by assigning a spatial hearing property and a perceptual hearing property different from the spatial hearing property to the desired component and the noise component in the corresponding first subband signal respectively, based on a multi-dimensional auditory presentation method. The frequency-to-time transformer is configured to transform the second subband signals into signals for rendering with the multi-dimensional auditory presentation method.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
The embodiments of the present invention are below described by referring to the drawings. It is to be noted that, for purpose of clarity, representations and descriptions about those components and processes known by those skilled in the art but not necessary to understand the present invention are omitted in the drawings and the description.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, a device (e.g., a cellular telephone, portable media player, personal computer, television set-top box, or digital video recorder, or any media player), a method or a computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired line, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
As illustrated in
For each mono-channel audio signal s(t), the time-to-frequency transformer 101 is configured to transform the mono-channel audio signal s(t) into a number K of subband signals (corresponding to K frequency bins) D(k,t), where k is the frequency bin index. For example, the transformation may be performed through a fast-Fourier Transform (FFT).
The estimator 102 is configured to estimate proportions of a desired component and a noise component in each subband signal D(k,t).
A noisy audio signal may be viewed as a mixture of a desired signal and a noise signal. If the human auditory system is able to extract the sound corresponding to the desired signal (also called as desired sound) from the interference corresponding to the noise signal, the audio signal is intelligible to the human auditory system. For example, in voice communication applications, the desired sound may be speech, and in recording and playback applications, the desired sound may be music. In general, depending on specific applications, the desired sound may comprise one or more sounds that audience wants to hear, and accordingly, the noise may include one or more sounds that the audience does not want to hear, such as stationary white or pink noise, non-stationary babble noise, or interference speech, etc. Based on specific spectrum characteristics of the desired signal and the noise signal, it is possible to adopt an appropriate method to estimate proportions of the desired component corresponding to the desired signal and the noise component corresponding to the noise signal in each subband signal. The proportions of the desired component and the noise component may be estimated independently. Alternatively, in case of knowing one of the proportions, it is possible to obtain another proportion by regarding the remaining portion other than the estimated desired component as the noise component, or regarding the remaining portion other than the estimated noise component as the desired component.
In an example, the proportions of the desired component and the noise component may be estimated as a gain function. Specifically, it is possible to track the noise component in the audio
signal to estimate a noise spectrum, and derive a gain function G(k,t) for each subband signal D(k,t) from the estimated noise spectrum and the subband signal D(k,t).
In general, the desired (e.g., speech) component Ŝ(k,t) may be obtained based on its proportion, for example, the gain function G(k,t). In case of gain function, the desired component Ŝ(k,t) may be obtained as below:
{circumflex over (S)}(k, t)=G(k,t)D(k,t) (1).
The proportion of the noise component may be estimated as (1−G(k,t)). The noise component {circumflex over (N)}(k, t) may be obtained as below:
{circumflex over (N)}(k, t)=(1−G(k,t))D(k,t) (2).
Various gain functions may be used, including but not limited to spectral subtraction, Wiener filter, minimum-mean-square-error log spectrum amplitude estimation (MMSE-LSA).
In an example of spectral subtraction, a gain function GSS(k,t) may be derived as below:
In an example of Wiener Filter, a gain function GWIENER(k,t) may be derived as below:
In an example of MMSE-LSA, a gain function GMMSE-LSA(k,t) may be derived as below:
In the above examples, RPRIO(k,t) represents a priori SNR, and may be derived as below:
RPOST(k,t) represents a posteriori signal-noise ratio SNR, and may be derived as below:
where PŜ(k,t), PN(k,t), and PD(k,t) denote the power of the desired component Ŝ(k, t), the noise component {circumflex over (N)}(k, t), and the subband signal D(k,t), respectively. In an example, the value of the gain function may be bounded in the range from 0 to 1.
It should be noted that the proportions of the desired component and the noise component are not limited to the gain function. Other methods that provide an indication of desired signal and noise classification can be equally applied. The proportions of the desired component and the noise component may also be estimated based on a probability of desired signal (e.g., speech) or noise. An example of the probability-based proportions may be found in Sun, Xuejing/Yen, Kuan-Chieh/Alves, Rogerio (2010): “Robust noise estimation using minimum correction with harmonicity control”, In INTERSPEECH-2010, 1085-1088. In this example, the speech absence probability (SAP) q(k, t) may be calculated as below:
The proportions of the desired component and the noise component may be estimated as (1−q(k,t)) and q(k,t) respectively. The desired component Ŝ (k,t) and the noise component {circumflex over (N)} (k,t) may be obtained as below:
{circumflex over (S)}(k, t)=(1−q(k, t))D(k,t) (10),
{circumflex over (N)}(k, t)=q(k, t)D(k,t) (11).
The measures of the desired component and the noise component are not limited to their power on the subband. Other measures obtained based on segmentation according to harmonicity (e.g. the harmonicity measure described in Sun, Xuejing/Yen, Kuan-Chieh/Alves, Rogerio (2010): “Robust noise estimation using minimum correction with harmonicity control”, In INTERSPEECH-2010, 1085-1088.), spectra or temporal structures may also be used.
Alternatively, to emphasize the desired component, it is also possible to relatively increase the proportion of the desired component or reduce the proportion of the noise component. For example, it is possible to apply an attenuation factor α to the proportion of the noise component, where α≦1. In a further example, 0.5<α≦1.
For each subband signal D(k,t), proportions of the desired component Ŝ(k,t) and the noise component {circumflex over (N)}(k, t) are estimated by the estimator 102. To improve the intelligibility of the mono-channel audio signal, a conventional way is to remove the noise component in the subband signals. However, due to non-stationarity of noise and estimation errors, and the general requirement of actually removing undesired signal to isolate the desired signal, conventional approaches suffer various processing artifacts, such as distortion and musical noise. Because of removing the undesired signal, the estimation of the proportions such as the gain function and the probability of the desired signal and the undesired signal can lead to a destruction or removal of some important information, or the preservation of undesired information in the audio rendering.
When listening with two ears, the human auditory system uses several cues for sound source localization, mainly including interaural time difference (ITD) and interaural level difference (ILD). By performing sound localization, the human auditory system is able to extract the sound of a desired source out of interfering noise. Based on this observation, it is possible to assign a specific spatial hearing property (e.g., sounded as originating from a specific sound source location) to the desired signal by using the cues for sound source localization. The assignment of the spatial hearing property may be achieved through a multi-dimensional auditory presentation method, including but not limited to a binaural auditory presentation method, a method based on a plurality of speakers, and an ambisonics auditory presentation method. Accordingly, it is possible to assign a spatial hearing property, different from that assigned to the desired signal (e.g., sounded as originating from a different sound source location), to the noise signal by using the cues for sound source localization.
In general, the sound source location is determined by an azimuth, an elevation and a distance of the sound source relative to the human auditory system. Depending on specific multi-dimensional auditory presentation methods, the sound source location is assigned by setting at least one of the azimuth, the elevation and the distance. Accordingly, the difference between the different spatial hearing properties comprises at least one of a difference between the azimuths, a difference between the elevations and a difference between the distances.
Alternatively, it is also possible to assign another kind of perceptual hearing properties which facilitate reducing the perceptual attention to the noise signal. For example, the perceptual hearing properties may be those achieved by temporal whitening or frequency whitening (also called as temporal or frequency whitening properties), such as a reflection property, a reverberation property, and a diffusivity property. Such an approach will generally aim to render the desired signal as a focused spatial sound source, whilst the noise signal is perceptually thus aiding the segmentation and intelligibility of the desired signal by the listener.
The generator 103 is configured to generate subband signals M(k,l,t) conesponding respectively to a number L of channels from each subband signal D(k,t), where l is the channel index. The configurations of the channels depend on the requirement of the multi-dimensional auditory presentation method to be adopted to assign the spatial hearing property. Each subband signal M(k,l,t) may include a component SM(k,l,t) obtained by assigning a spatial hearing property to the desired component Ŝ(k,t) in the conesponding subband signal D(k,t), and a component SN(k,l,t) obtained by assigning a perceptual hearing property different from the spatial hearing property to the noise component {circumflex over (N)}(k,t) in the conesponding subband signal D(k,t).
The frequency-to-time transformer 104 is configured to transform the subband signals M(k,l,t) into the signal S(t) for rendering with the multi-dimensional auditory presentation method.
By assigning a spatial hearting property and a different perceptual hearing property to the desired signal and the noise signal, the desired signal and the noise signal can be assigned different virtual locations or perceptual features. This permits the use of perceptual separation to increase the perceptual isolation and thus the intelligibility or understanding of the desired signal, without deleting or extracting signal components from the overall signal energy, thus creating less unnatural distortions.
As illustrated in
At step 205, proportions of a desired component and a noise component in the subband signal D(k,t) is estimated. Methods of estimating described in connection with the estimator 102 may be adopted at step 205 to estimate the proportions of the desired component and the noise component in the subband signal D(k,t).
At step 207, subband signals M(k,l,t) conesponding respectively to a number L of channels are generated from the subband signal D(k,t), where l is the channel index. The subband signal M(k,l,t) may include a component SM(k,l,t) obtained by assigning a spatial hearing property to the desired component Ŝ(k,t) in the conesponding subband signal D(k,t), and a component SN(k,l,t) obtained by assigning a perceptual hearing property different from the spatial hearing property to the noise component {circumflex over (N)}(k,t) in the corresponding subband signal D(k,t), based on a multi-dimensional auditory presentation method. The configurations of the channels depend on the requirement of the multi-dimensional auditory presentation method to be adopted to assign the spatial hearing property. Methods of generating the subband signals M(k,l,t) described in connection with the generator 103 may be adopted at step 207.
At step 209, the subband signals M(k,l,t) are transformed into the signal S(t) for rendering with the multi-dimensional auditory presentation method.
At step 211, it is determined whether there is another mono-channel audio signal s(t+1) to be processed. If yes, the method 200 returns to step 203 to process the mono-channel audio signal s(t+1). If no, the method 200 ends at step 213.
As illustrated in
The extractor 301 is configured to extract the desired component Ŝ(k,t) and the noise component {circumflex over (N)}(k,t) from each subband signal D(k,t) based on the proportions estimated by the estimator 102 respectively. In general, it is possible to extract the desired component Ŝ (k,t) and the noise component {circumflex over (N)}(k,t) by applying the conesponding proportions to the subband signal D(k,t). Equations (1) and (2), as well as Equations (10) and (11) are examples of such an extraction method.
The filters 302-1 to 302-L correspond to the L channels respectively. Each filter 302-l is configured to filter the extracted desired component Ŝ(k,t) for each subband signal D(k,t) by applying a transfer function HS,l(k,t) for assigning the spatial hearing property, and thus generate a filtered desired component SM(k,l,t)=Ŝ(k,t)HS,l(k,t).
The filters 303-1 to 303-L correspond to the L channels respectively. Each filter 303-l is configured to filter the extracted noise component {circumflex over (N)}(k t) for each subband signal D(k,t) by applying a transfer function HN,l(k,t) for assigning the perceptual hearing property, and thus generate a filtered noise component SN(k,l,t)={circumflex over (N)}(k,t) HN,l(k,t).
The adders 304-1 to 304-L correspond to the L channels respectively. Each adder 304-l is configured to sum the filtered desired component SM(k,l,t) and the filtered noise component SN(k,l,t) for each subband signal D(k,t) to obtain a subband signal M(k,l,t)=Ŝ(k,t)HS,l(k,t)+{circumflex over (N)}(k,t) HN,l(k,t).
As illustrated in
At step 405, the extracted desired component Ŝ(k,t) for the subband signal D(k,t) is filtered by applying a transfer function HS,l(k,t) for assigning the spatial hearing property, thus generating a filtered desired component SM(k,l,t)=Ŝ(k,t)HS,l(k,t).
At step 407, the extracted noise component {circumflex over (N)}(k,t) for the subband signal D(k,t) is filtered by applying a transfer function HN,l(k,t) for assigning the perceptual hearing property, thus generating a filtered noise component SN(k,l,t)={circumflex over (N)}(k,t) HN,l(k,t).
At step 409, the filtered desired component SM(k,l,t) and the filtered noise component SN(k,l,t) for the subband signal D(k,t) are summed up to obtain a subband signal M(k,l,t)=Ŝ (k,t)HS,l(k,t)+{circumflex over (N)}(k,t) HN,l(k,t).
At step 411, it is determined whether there is another channel l′ to be processed. If yes, the process 400 returns to step 405 to generate another subband signal M(k,l′,t). If no, the process 400 goes to step 413.
At step 413, it is determined whether there is another subband signal D(k′,t) to be processed. If yes, the process 400 returns to step 403 to process the subband signal D(k′,t). If no, the process 400 ends at step 415.
In further embodiments of the generator and the process described in connection with
In the binaural auditory presentation method, it is also possible to assign the perceptual hearing property to the noise component.
If the perceptual hearing property is a spatial hearing property different from that assigned to the desired component, in an example, there are two channels, one for left ear and one for right ear. The transfer function HN,1(k,t) is a head-related transfer function (HRTF) for one of left ear and right ear, and the transfer function HN,2(k,t) is a HRTF for another of left ear and right ear. HRTFs HN,l(k,t) and HN,2(k,t) can assign a sound location different from that assigned to the desired component, to the noise component. In an example, the desired component may be assigned with a sound location having an azimuth of 0 degree, and the noise component may be assigned with a sound location having an azimuth of 90 degree, with the listener as an observer. Such an arrangement is illustrated in
Alternatively, it is possible to divide the noise component into at least two portions, and provide each portion with a set of two HRTFs for assigning a different sound location. The proportions of the divided portions in the noise component may be constant, or adaptive both in time and frequency.
The perceptual hearing property may also be that assigned through temporal or frequency whitening. In case of temporal whitening, the transfer functions HN,l(k,t) are configured to spread the noise component across time to reduce the perceptual significance of the noise signal. In case of frequency whitening, the transfer functions HN,l(k,t) are configured to achieve a spectral whitening of the noise component to reduce the perceptual significance of the noise signal. One example of the frequency whitening is to use the inverse of the long term average spectrum (LTAS) as the transfer functions HN,l(k,t). It should be noted that the transfer functions HN,l(k,t) may be time varying and/or frequency dependent. Various perceptual hearing properties may be achieved through the temporal or frequency whitening, including but not limited to reflection, reverberation, or diffusivity.
In further embodiments of the generator and the process described in connection with
HN,1(k,t)=j (12),
HN,2(k,t)=−j (13),
where j represents the imaginary unit. Because the speakers are placed away from the listener and the noise is of low perceptual significance, the physical position of the speakers can inherently assign a sound location to the rendered desired sound, the transfer functions HS,l(k,t) may be degraded to a constant such as 1.
Alternatively, it is also possible to add additional temporal or frequency whitening property to the transfer functions HN,l(k,t) as below:
HN,1(k,t)=j+HW,1(k) (14),
HN,2(k,t)=−(j+HW,2(k)) (15),
where HW,l(k) is configured to assign the temporal or frequency whitening property such as reflection, diffusivity or reverberation to the noise component in the corresponding channel. In an example of a 5-channel system—Left, Centre, Right, Left Surround, Right Surround, there are five transfer functions HS,L(k,t), HS,C(k,t), HS,R(k,t), HS,LS(k,t) and HS,RS(k,t) corresponding to Left, Centre, Right, Left Surround and Right Surround channels respectively, for assigning the spatial hearing property to the desired component, and five transfer functions HN,L(k,t), HN,C(k,t), HN,R(k,t), HN,LS(k,t) and HN,RS(k,t) corresponding to Left, Centre, Right, Left Surround and Right Surround channels respectively, for assigning the perceptual hearing property to the noise component. An example configuration of the transfer functions is as below:
There is a low correlation between the surround transfer functions HLS(k) and HRS(k), and therefore, a low correlation between HN,LS(k,t) and HN,RS(k,t). It should be apparent from this that other arrangements of the desired signal and the noise signal are also possible. For example, the Left and Right channels may be used rather than the Centre channel for the desired signal, or the noise signal may be distributed across more of the channels with low correlations therebetween.
In further embodiments of the generator and the process described in connection with
In this case, there are generally four channels. The transfer functions for assigning the spatial hearing property include HS,W(k,t)=a constant such as 1 or √{square root over (2)}/2, HS,X(k,t)=cos(φ)cos(θ), HS,Y(k,t)=sin(φ)cos(θ) and HS,Z(k,t)=sin(θ) corresponding to W, X, Y and Z channels respectively. By applying these transfer functions to the extracted desired component Ŝ(k,t), the desired sound may be assigned a specific sound location (azimuth φ, elevation θ) in the rendering. Alternatively, the sound location may be specified by only one item of azimuth φ and elevation θ. For example, it is possible to assume the elevation θ=0. In this case, there can be three channels W, X and Y, corresponding to a first order horizontal sound field representation. It should be noted that the embodiment is also applicable to a 3D (WXYZ) or higher order planar or 3D sound field representation. The transfer functions for assigning the perceptual hearing property include HN,W(k,t), HN,X(k,t), HN,Y(k,t) and HN,Z(k,t) corrresponding to W, X, Y and Z channels respectively. HN,W(k,t), HN,X(k,t), HN,Y(k,t) and HN,Z(k,t) may apply a temporal or frequency whitening for reduce the perceptual significance of the noise signal, or a spatial hearing property different from that assigned to the desired component.
As illustrated in
For each channel l and each subband signal D(k,t), the calculator 602 is configured to calculate a filter parameter H(k,l,t). Each filter parameter H(k,l,t) is a weighted sum of a transfer function HS,l(k,t) for assigning the spatial hearing property and another transfer function HN,l(k,t) for assigning the perceptual hearing property. The weight WS for the transfer function HS,l(k,t) and the weight WN for the other transfer function HN,l(k,t) are in positive correlation to the proportions of the desired component and the noise component in the corresponding subband signal D(k,t). Namely, each filter parameter H(k,l,t) may be denoted as below:
H(k,l,t)=WSHS,l(k,t)+WNHN,l(k,t).
In an example, the weight WS and the weight WN may be the proportions of the desired component and the noise component respectively.
For each subband signal D(k,t), each filter 601-l is configured to apply the filter parameter H(k,l,t) to the subband signal D(k,t) to obtain a subband signal M(k,l,t)=D(k,t)H(k,l,t).
As illustrated in
At step 705, each filter parameters H(k,l,t) is applied to the subband signal D(k,t) to obtain a subband signal M(k,l,t)=D(k,t)H(k,l,t).
At step 707, it is determined whether there is another subband signal D(k′,t) to be processed. If yes, the process 700 returns to step 703 to process the subband signal D(k′,t). If no, the process 700 ends at step 709.
According to the embodiments described in connection with
In further embodiments of the generator and the process described in connection with
In the binaural auditory presentation method, it is also possible to assign the perceptual hearing property to the noise component.
If the perceptual hearing property is a spatial hearing property different from that assigned to the desired component, in an example, there are two channels, one for left ear and one for right ear. The transfer function HN,1(k,t) is a head-related transfer function (HRTF) for one of left ear and right ear, and the transfer function HN,2(k,t) is a HRTF for another of left ear and right ear. HRTFs HN,1(k,t) and HN,2(k,t) can assign a sound location different from that assigned to the desired component, to the noise component. In an example, the desired component may be assigned with a sound location having an azimuth of 0 degree, and the noise component may be assigned with a sound location having an azimuth of 90 degree, with the listener as an observer.
Alternatively, it is possible to divide the noise component into at least two portions, and provide each portion with a set of two HRTFs for assigning a different sound location. The proportions of the divided portions in the noise component may be constant, or adaptive both in time and frequency.
The perceptual hearing property may also be that assigned through temporal or frequency whitening. In case of temporal whitening, the transfer functions HN,l(k,t) are configured to spread the noise component across time to reduce the perceptual significance of the noise signal. In case of frequency whitening, the transfer functions HN,l(k,t) are configured to achieve a spectral whitening of the noise component to reduce the perceptual significance of the noise signal. One example of the frequency whitening is to use the inverse of the long term average spectrum (LTAS) as the transfer functions HN,l(k,t). It should be noted that the transfer functions HN,l(k,t) may be time varying and/or frequency dependent. Various perceptual hearing properties may be achieved through the temporal or frequency whitening, including but not limited to reflection, reverberation, or diffusivity.
In further embodiments of the generator and the process described in connection with
Alternatively, it is also possible to add additional temporal or frequency whitening property to the transfer functions HN,l(k,t) as in Equations (14) and (15).
In an example of a 5-channel system—Left, Centre, Right, Left Surround, Right Surround, there are five transfer functions HS,L(k,t), HS,C(k,t), HS,R(k,t), HS,LS(k,t) and HS,RS(k,t) corresponding to Left, Centre, Right, Left Surround and Right Surround channels respectively, for assigning the spatial hearing property to the desired component, and five transfer functions HN,L(k,t), HN,C(k,t), HN,R(k,t), HN,LS(k,t) and HN,RS(k,t) corresponding to Left, Centre, Right, Left Surround and Right Surround channels respectively, for assigning the perceptual hearing property to the noise component. An example configuration of the transfer functions is as below:
There are a low correlation between the surround transfer functions HLS(k) and HRS(k), and therefore, a low correlation between HN,LS(k,t) and HN,RS(k,t). It should be apparent from this that other arrangements of the desired signal and the noise signal are also possible. For example, the Left and Right channels may be used rather than the Centre channel for the desired signal, or the noise signal may be distributed across more of the channels with low correlations therebetween.
In further embodiments of the generator and the process described in connection with
In this case, there are generally four channels. The transfer functions for assigning the spatial hearing property include HS,W(k,t)=a constant such as 1 or √{square root over (2)}/2, HS,X(k,t)=cos(φ)cos(θ), HS,Y(k,t)=sin(φ)cos(θ) and HS,Z(k,t)=sin(θ) corresponding to W, X, Y and Z channels respectively. By applying these transfer functions, the desired sound may be assigned a specific sound location (azimuth φ, elevation θ) in the rendering. Alternatively, the sound location may be specified by only one item of azimuth φ and elevation θ. For example, it is possible to assume the elevation θ=0. In this case, there can be three channels W, X and Y, corresponding to a first order horizontal sound field representation. It should be noted that the embodiment is also applicable to a 3D (WXYZ) or higher order planar or 3D sound field representation. The transfer functions for assigning the perceptual hearing property include HN,W(k,t), HN,X(k,t), HN,Y(k,t) and HN,Z(k,t) corresponding to W, X, Y and Z channels respectively. HN,W(k,t), HN,X(k,t), HN,Y(k,t) and HN,Z(k,t) may apply a temporal or frequency whitening for reduce the perceptual significance of the noise signal, or a spatial hearing property different from that assigned to the desired component.
As illustrated in
The detector 805 is configured to detect an audio output device which is activated presently for audio rendering, and determine the multi-dimensional auditory presentation method adopted by the audio output device. The apparatus 800 may be able to be coupled with at least two audio output devices which can support the audio rendering based on different multi-dimensional auditory presentation methods. For example, the audio output devices may include a head phone supporting a binaural auditory presentation method and a speaker system supporting an ambisonics auditory presentation method. A user may operate the apparatus 800 to switch between the audio output devices for audio rendering. In this case, the detector 805 is used to determine the multi-dimensional auditory presentation method presently being used. Upon the detector 805 determines the multi-dimensional auditory presentation method, the generator 803 and the frequency-to-time transformer 804 operate based on the determined multi-dimensional auditory presentation method. In case that the multi-dimensional auditory presentation method is determined, the generator 803 and the frequency-to-time transformer 804 perform the same functions with the generator 103 and the frequency-to-time transformer 104 respectively. The frequency-to-time transformer 804 is further configured to transmit the signals for rendering to the detected audio output device.
As illustrated in
By assigning different perceptual hearing properties to different components, there may be spectral gaps in the signals for rendering. This may create perceptual problems, particularly when a single intermediate channel can be heard in isolation.
In further embodiments of the apparatuses and the methods described in the above, it is possible to perform a control in estimating the proportions so that the proportions of the desired component and the noise component do not fall below the corresponding lower limits. For example, generally, especially in case of the binaural auditory presentation method, the proportions of the desired component and the noise component in each subband signal D(k,t) are respectively estimated as not greater than 0.9 and not smaller than 0.1. By doing this, in an example of voice communication, it is possible to achieve about 20 dB maximum noise suppression on voice channel and about −20 dB min of residual desired signal in the noise channel. Also, in case that the multi-dimensional auditory presentation method is based on multiple speakers, such as the aforementioned 5-channel system, the proportion of the desired component in each subband signal D(k,t) is estimated as not greater than 0.7, and the proportion of the noise component in each subband signal D(k,t) is estimated as not smaller than 0. By doing this, in an example of voice communication, it is possible to achieve about infinitive maximum noise suppression on voice channel and about −10 dB min of residual signal in the noise channel. Consequently, it is helpful to avoid the case where the background channel seems abnormal if it is gated off suddenly and can be heard isolation.
As a further improvement, it is possible to limit the proportions of the desired component and the noise component independently. Alternately, the proportions of the desired component and the noise component can be derived as separate functions from the probability or the simple gain, and therefore have different properties. For example, assuming that the proportion of the desired component is represented as G, the proportion of the noise component is estimated as √{square root over (1−G2)}. Accordingly, it is possible to achieve a preservation of energy.
In
The CPU 1001, the ROM 1002 and the RAM 1003 are connected to one another via a bus 1004. An input/output interface 1005 is also connected to the bus 1004.
The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, or the like; an output section 1007 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; the storage section 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs a communication process via the network such as the internet.
A drive 1010 is also connected to the input/output interface 1005 as required. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1010 as required, so that a computer program read therefrom is installed into the storage section 1008 as required.
In the case where the above-described steps and processes are implemented by the software, the program that constitutes the software is installed from the network such as the internet or the storage medium such as the removable medium 1011.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
The following exemplary embodiments (each an “EE”) are described.
Number | Date | Country | Kind |
---|---|---|---|
2011 1 0421777 | Dec 2011 | CN | national |
This application claims priority to Chinese Patent Application No. 201110421777.1 filed 15 Dec. 2011 and U.S. Provisional Patent Application No. 61/586,945 filed 16 Jan. 2012, hereby incorporated by reference in their entireties for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2012/069303 | 12/12/2012 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2013/090463 | 6/20/2013 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
7391877 | Brungart | Jun 2008 | B1 |
7761291 | Renevey | Jul 2010 | B2 |
20060133619 | Curry | Jun 2006 | A1 |
20080008341 | Edwards | Jan 2008 | A1 |
20080232603 | Soulodre | Sep 2008 | A1 |
20090304203 | Haykin | Dec 2009 | A1 |
20100002886 | Doclo | Jan 2010 | A1 |
20100316232 | Acero | Dec 2010 | A1 |
20110119061 | Brown | May 2011 | A1 |
Number | Date | Country |
---|---|---|
2009-0090693 | Aug 2009 | KR |
2010004473 | Jan 2010 | WO |
Entry |
---|
Edmonds, B.A. et al “The Role of Head-Related Time and Level Cues in the Unmasking of Speech in Noise and Competing Speech” Acta Acustica United with Acustica, vol. 91, No. 3, pp. 546-553, published by S. Hirzel on May-Jun. 2005. |
Culling, J.F. et al “The Role of Head-Induced Interaural Time and Level Differences in the Speech Reception Threshold for Multiple Interfering Sound Sources” Journal of the Acoustical Society of America, vol. 116, No. 2, pp. 1057-1065, Aug. 2004. |
Shinn-Cunningham, B.G., et al “Spatial Unmasking of Nearby Speech Sources in a Simulated Anechoic Environment” Journal of the Acoustical Society of America, vol. 110, No. 2, pp. 1118-1129, Aug. 2001. |
Sun, X. et al “Robust Noise Estimation Using Minimum Correction with Harmonicity Control” In Proceeding of: Interspeech 2010, 11th Annual Conference of the International Speech Communication Association, Japan, Sep. 26-30, 2010. |
Dirks, D.D. et al “The Effect of Spatially Separated Sound Sources on Speech Intelligibility” Journal of Speech and Hearing Research, American Speech-Language-Hearing Association, vol. 12, No. 1, Mar. 1, 1969, pp. 5-38. |
Hirsh, Ira “The Relation Between Localization and Intelligibility” The Journal of the Acoustical Society of America, vol. 22, No. 2, Mar. 1950, p. 200. |
Number | Date | Country | |
---|---|---|---|
20150071446 A1 | Mar 2015 | US |
Number | Date | Country | |
---|---|---|---|
61586945 | Jan 2012 | US |