METHOD AND APPARATUS FOR PROCESSING A STEREO SIGNAL

TECHNICAL FIELD

The present disclosure relates to the field of audio signal processing and reproduction. More specially, the disclosure relates to a method for processing a stereo signal and an apparatus for processing a stereo signal. The present disclosure also relates to a computer-readable storage medium.

BACKGROUND

Three-dimensional (3D) audio effects are a group of spatial sound effects produced by stereo speakers, surround-sound speakers, speaker-arrays, or headphones. The generation of audio effects frequently involves a virtual placement of sound sources at selected positions in three-dimensional space, including behind, above or below the listener.

3D audio processing may involve a spatial domain convolution of sound waves using head-related transfer functions. Specifically, sound waves can be transformed, (e.g., using head-related transfer function (HTRF) or HRTF filters and/or cross talk cancellation techniques) to mimic natural sounds waves which emanate from a point in 3D space. The listener can thus perceive different sounds as coming from different 3D locations, even though the sounds may be produced by just two speakers.

HRTFs and binaural room impulse responses (BRIRs) are both important for generating immersive 3D audio signals through headphones. The immersive 3D audio signals provide spatial audio cues on which humans rely to localize sound in space: interaural level differences (ILD), interaural time differences (ITD) and spectral cues. However, HRTFs or BRIRs depend highly on individual anatomies, and the measurement of HRTFs or BRIRs in high resolution is time-consuming. Usually, non-individual HRTFs or synthesized BRIRs are applied for the binaural renderer instead.

Studies have shown that simulated directional sounds that are generated using non-individual HRTFs suffer from front-back confusion, which is a problem in static binaural rendering due to ambiguous interaural cues. In addition, the externalization of a simulated sound source may be reduced, especially for the virtual sound source in the median plane. The localization and externalization can be improved by the individual measurement of HRTFs/BRIRs, individualized HRTFs/BRIRs, and dynamic rendering that incorporates movements of the source or the listener by using head tracking devices. However, in many commercial applications, binaural rendering can neither use individual HRIRs nor high-quality head tracking devices.

SUMMARY OF THE INVENTION

The main technical field of the present disclosure is binaural audio reproduction over headphones. It is an object of the disclosure to improve the localization and externalization of mono or stereo signals in the median plane. This improves externalization and localization of virtual sound sources presented over headphones.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

A first aspect of the disclosure provides a method for processing a stereo signal, the method comprising: obtaining a center channel signal by up-mixing the stereo signal; generating a filtered center channel signal by applying one or more peak filters and one or more notch filters to the center channel signal; and generating a binaural signal based on the filtered center channel signal.

In one embodiment, the method further comprises obtaining the stereo signal.

The method for processing a stereo signal according to the first aspect can result in good localization and externalization of the stereo signal in the median plane.

Stereophonic sound or, more commonly, stereo, is a method of sound reproduction that creates an illusion of multi-directional audible perspective. This is usually achieved by using two or more independent audio channels through a configuration of two or more loudspeakers (or stereo headphones) in such a way as to create the impression of sound heard from various directions, as in natural hearing.

A stereo signal may contain synchronized directional information from the left and right aural fields. Normally a stereo signal comprises at least two channels, one for the left field and one for the right field.

In an example, a stereo signal may be obtained by a receiver. For example, the receiver may obtain the stereo signal from another device or another system via a wired or wireless communication channel.

In another example, a stereo signal may be obtained using a processor and at least two microphones. The at least two microphones are used to record information obtained from a sound source, and the processor is used to process information recorded by the microphones, to obtain the stereo signal.

Up-mixing, in its most general sense, is the opposite of down-mixing. This means that up-mixing is a process that can take some number of audio channels and turn them into a greater number of audio channels. For example, up-mixing may transform 2-channels into 5.1 channels. Up-mixing is commonly used to better integrate legacy two-channel mono, stereo, or surround encoded content into 5.1 channel programs. Chosen properly, up-mixing further speeds the transition to 5.1 by helping out legacy content, and by assisting in the creation of new 5.1 channel material.

In an example, an audio signal processing arrangement includes a first filter for splitting off signal components from the left channel signal at least within one frequency band. Signal components are split off from the right channel signal by a second filter. The output signals of the filters are compared with the right channel signal and the left channel signal, respectively. The filter parameters of the filters are adjusted to values at which there is maximum correlation between the compared signals according to a given criterion. The center channel signal is derived in dependence on the filter adjustment. This can be effected by combining the output signals of the filters. In this manner, a center channel signal is obtained formed by the correlating left and right channel signal components, so that the stereo image is hardly disturbed by the addition of the center channel signal, whereas the perceived position of the virtual sources in the stereo image becomes less dependent on the listener's position with respect to the left and right loudspeakers.

In one embodiment form of the first aspect, the method further comprises: obtaining a side channel signal by up-mixing the stereo signal; processing the side channel signal according to a first head related transfer function, to obtain a processed side channel signal; processing the filtered center channel signal according to a second head related transfer function, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating the binaural signal based on the processed side channel signal and the processed center channel signal.

In an example, up-mixing the stereo signal to obtain the side channel signal and up-mixing the stereo signal to obtain the center channel signal are performed in one up-mixing process.

In an example, the head related transfer function, HRTF, which is used to process the side channel signal and the HRTF which is used to process the center channel signal are the same HRTF.

In another example, the HRTF which is used to process the side channel signal and the HRTF which is used to process the center channel signal are different.

In one embodiment of the first aspect, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; processing the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In an example, up-mixing the stereo signal to obtain the left channel signal, the right channel signal and up-mixing the stereo signal to obtain the center channel signal are performed in one up-mixing process.

In another example, the HRTF which is used to process the left channel signal, the right channel signal and the HRTF which is used to process the center channel signal are different.

In one embodiment of the first aspect, the method further comprises: filtering the side channel signal and the center channel signal, using one or more decorrelation filters, to obtain a decorrelated side signal and a decorrelated center signal; and obtaining a reflection signal based on the decorrelated side signal and the decorrelated center signal.

In an example, one decorrelation filter is used to filter the side channel signal and the center channel signal.

In another example, the decorrelation filter which is used to filter the side channel signal and the decorrelation filter which is used to filter the center channel signal are identical.

In another example, the decorrelation filter which is used to filter the side channel signal and the decorrelation filter which is used to filter the center channel signal are different filters.

In one embodiment of the first aspect, the method further comprises: filtering the left channel signal, the right channel signal and the center channel signal, using one or more decorrelation filters, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and obtaining a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In an example, one decorrelation filter is used to filter the left channel signal, the right channel signal and the center channel signal.

In another example, the decorrelation filter which is used to filter left channel signal and the right channel signal and the decorrelation filter which is used to filter the center channel signal are identical.

In another example, the decorrelation filter which is used to filter left channel signal, the right channel signal and the decorrelation filter which is used to filter the center channel signal are different filters.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are same.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are different.

In one embodiment of the first aspect, the method further comprises: obtaining an initial audio signal; and decomposing the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal.

In one embodiment of the first aspect, the method further comprises: obtaining an initial audio signal; decomposing the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal and an ambient signal; obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; adding the ambient signal with the left channel signal, to obtain a left sum signal; adding the ambient signal with the right channel signal, to obtain a right sum signal; processing the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In an example, up-mixing the stereo signal to obtain the left channel signal and the right channel signal and up-mixing the stereo signal to obtain the center channel signal is performed in one up-mixing process.

In another example, the HRTF which is used to process the left channel signal and the right channel signal and the HRTF which is used to process the center channel signal are different.

In one embodiment form of the first aspect, the method further comprises: filtering the left channel signal, the right channel signal and the center channel signal, using one or more decorrelation filters, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and obtaining a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In an example, one decorrelation filter is used to filter the left channel signal, the right channel signal and the center channel signal.

In another example, the decorrelation filter which is used to filter the left channel signal and the right channel signal and the decorrelation filter which is used to filter the center channel signal are identical.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are identical.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are different filters.

In one embodiment of the first aspect, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; convolving the stereo signal with a local reverberation to obtain a convolved stereo signal; adding the convolved stereo signal with the left channel signal, to obtain a left sum signal; adding the convolved stereo signal with the right channel signal, to obtain a right sum signal; processing the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In another example, the HRTF which is used to process the left channel signal, the right channel signal and the HRTF which is used to process the center channel signal are different.

In an example, one decorrelation filter is used to filter the left channel signal, the right channel signal and the center channel signal.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are same.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are different.

In one embodiment of the first aspect, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; convolving the stereo signal with a local reverberation to obtain a convolved stereo signal; processing the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal according to a pair of head related transfer functions to obtain a processed center channel signal; wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal, the convolved stereo signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal, the convolved stereo signal and the processed center channel signal.

In an example, up-mixing the stereo signal to obtain the left channel signal and the right channel signal and up-mixing the stereo signal to obtain the center channel signal are performed in one up-mixing process.

In another example, the HRTF which is used to process the left channel signal and the right channel signal and the HRTF which is used to process the center channel signal are different functions.

In an example, one decorrelation filter is used to filter the left channel signal, the right channel signal and the center channel signal.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are identical.

In an example, the decorrelation filter which is used to filter left channel signal and the decorrelation filter which is used to filter the right channel signal are different.

In one embodiment form of the first aspect, the one or more peak filters comprises a first peak filterer centered at 4 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency above 13 kHz and having a ¼-octave bandwidth; and wherein the one or more notch filters comprises: a notch filter centered at a frequency between 4 kHz and 8 kHz and having a 1-octave bandwidth.

In an example, the typical center frequency for the notch filter is 7 kHz, and the typical center frequency for the second peak filter is 13 kHz.

In one embodiment form of the first aspect, the one or more peak filters comprises a first peak filter centered at 1 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency between 10 kHz and 12 kHz and having a ¼-octave bandwidth; and wherein the one or more notch filters comprises: a first notch filter centered at 9 kHz and having a ¼-octave bandwidth, a second notch filter centered at 16 kHz and having a ¼-octave bandwidth.

In an example, the typical center frequency for the second peak filter is 11 kHz.

A second aspect of the disclosure provides an apparatus for processing a stereo signal, the apparatus comprises processing circuitry configured to,

- obtain a center channel signal by up-mixing the stereo signal;
- obtain a filtered center channel signal by applying one or more peak filters and one or more notch filters to the center channel signal; and
- generating a binaural signal based on the filtered center channel signal.

The processing circuitry may comprise hardware and software. The hardware may comprise analog or digital circuitry, or both analog and digital circuitry. In one embodiment, the processing circuitry comprises one or more processors and a non-volatile memory connected to the one or more processors. The non-volatile memory may carry executable program code which, when executed by the one or more processors, causes the apparatus to perform the operations or methods described herein.

The filters described in this disclosure may be implemented in hardware or in software or in a combination of hardware and software.

In one embodiment of the second aspect, the processing circuitry is further configured to obtain a side channel signal by up-mixing the stereo signal;

- process the side channel signal according to a first head related transfer function, to obtain a processed side channel signal; and
- process the filtered center channel signal according to a second head related transfer function, to obtain a processed center channel signal;
- wherein the binaural signal is generated based on the processed side channel signal and the processed center channel signal.

In one embodiment of the second aspect, the processing circuitry is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal;

- process the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; and
- process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and
- wherein a left signal of the binaural signal is generated based on the processed left channel signal and the processed center channel signal,
- a right signal of the binaural signal is generated based on the processed right channel signal and the processed center channel signal.

In one embodiment of the second aspect, the processing circuitry is further configured to:

- filter the side channel signal and the center channel signal, to obtain a decorrelated side signal and a decorrelated center signal; and
- obtain a reflection signal based on the decorrelated side signal and the decorrelated center signal.

In one embodiment of the second aspect, processing circuitry is further configured to,

- filter the left channel signal, the right channel signal and the center channel signal, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and
- obtain a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment of the second aspect, wherein the processing circuitry is configured to obtain an initial audio signal, and decompose the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal.

In one embodiment of the second aspect, wherein the processing circuitry is configured to obtain an initial audio signal, decompose the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal and an ambient signal;

- obtain a left channel signal and a right channel signal by up-mixing the stereo signal;
- add the ambient signal to the left channel signal, to obtain a left sum signal,
- add the ambient signal to the right channel signal, to obtain a right sum signal;
- process the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal, and process the filtered center channel signal according to a pair of head related transfer functions to obtain a processed center channel signal;
- generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, and generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment of the second aspect, the processing circuitry is further configured to:

- filter the left channel signal, the right channel signal and the center channel signal, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and
- obtain a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment of the second aspect, the processing circuitry is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal;

- convolve the stereo signal with a local reverberation to obtain a convolved stereo signal;
- add the convolved stereo signal with the left channel signal, to obtain a left sum signal, add the convolved stereo signal with the right channel signal, to obtain a right sum signal;
- process the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; and process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal;
- generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal,
- generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment of the second aspect, the processing circuitry is further configured to,

- filter the left channel signal, the right channel signal and the center channel signal, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and
- obtain a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment of the second aspect, the processing circuitry is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal;

- convolve the stereo signal with a local reverberation to obtain a convolved stereo signal;
- process the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; and
- process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal;
- generate a left signal of the binaural signal based on the processed left channel signal, the convolved stereo signal and the processed center channel signal, generate a right signal of the binaural signal based on the processed right channel signal, the convolved stereo signal and the processed center channel signal.

In one embodiment of the second aspect, the processing circuitry is further configured to,

- filter the left channel signal, the right channel signal and the center channel signal, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and
- obtain a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment of the second aspect, wherein the one or more peak filters comprise a first peak filterer centered at 4 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency above 13 kHz and having a ¼-octave bandwidth; and wherein the one or more notch filters comprises:

- a notch filter centered at a frequency between 4 kHz and 8 kHz with 1-octave bandwidth.

In one embodiment of the second aspect, wherein the one or more peak filters comprise a first peak filter centered at 1 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency between 10 kHz and 12 kHz and having a ¼-octave bandwidth; and wherein the one or more notch filters comprise:

- a first notch filter centered at 9 kHz and having a ¼-octave bandwidth, a second notch filter centered at 16 kHz and having a ¼-octave bandwidth.

A third aspect of the disclosure provides an apparatus for processing a stereo signal, the apparatus comprises: an up-mix unit configured to obtain a center channel signal by up-mixing the stereo signal; one or more peak filters and one or more notch filters configured to filter the center channel signal to obtain a filtered center channel signal; and a binaural signal generate unit configured to generate a binaural signal based on the filtered center channel signal.

In one embodiment, the apparatus comprises a stereo signal obtain unit configured to obtain the stereo signal.

In one embodiment of the third aspect, the up-mix unit is further configured to obtain a side channel signal by up-mixing the stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to process the side channel signal according to a first head related transfer function, to obtain a processed side channel signal; the HRTF unit is further configured to process the filtered center channel signal according to a second head related transfer function, to obtain a processed center channel signal; and wherein the binaural signal generate unit is configured to generate the binaural signal based on the processed side channel signal and the processed center channel signal.

In one embodiment of the third aspect, the up-mix unit is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to process the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; the HRTF unit is further configured to process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the binaural signal generate unit is configured to generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, the binaural signal generate unit is configured to generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment of the third aspect, the apparatus further comprises: one or more decorrelation filters configured to filter the side channel signal and the center channel signal, to obtain a decorrelated side signal and a decorrelated center signal; and a reflection obtain unit configured to obtain a reflection signal based on the decorrelated side signal and the decorrelated center signal.

In one embodiment of the third aspect, the apparatus further comprises: one or more decorrelation filters configured to filter the left channel signal, the right channel signal and the center channel signal, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and a reflection obtain unit configured to obtain a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment of the third aspect, the stereo signal obtain unit is configured to obtain an initial audio signal, and decompose the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or least squares, to obtain the stereo signal.

In one embodiment of the third aspect, the stereo signal obtain unit is configured to obtain an initial audio signal, decompose the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal and an ambient signal;

the up-mix unit is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to add the ambient signal to the left channel signal, to obtain a left sum signal, add the ambient signal to the right channel signal, to obtain a right sum signal; the HRTF unit is further configured to process the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal, and the HRTF unit is further configured to process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the binaural signal generate unit is configured to generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment of the third aspect, the up-mix unit is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal; the apparatus further comprises a convolve unit, the convolve unit is configured to convolve the stereo signal with a local reverberation to obtain a convolved stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to add the convolved stereo signal with the left channel signal, to obtain a left sum signal, add the convolved stereo signal with the right channel signal, to obtain a right sum signal; the HRTF unit is further configured to process the left sum signal and the right sum signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal, and the HRTF unit is further configured to process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the binaural signal generate unit is configured to generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment of the third aspect, the up-mix unit is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal; the apparatus further comprises a convolve unit, the convolve unit is configured to convolve the stereo signal with a local reverberation to obtain a convolved stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to process the left channel signal and the right channel signal according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; the HRTF unit is further configured to process the filtered center channel signal according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the binaural signal generate unit is configured to generate a left signal of the binaural signal based on the processed left channel signal, the convolved stereo signal and the processed center channel signal, generate a right signal of the binaural signal based on the processed right channel signal, the convolved stereo signal and the processed center channel signal.

In one embodiment of the third aspect, the one or more peak filters comprise a first peak filter centered at 4 kHz and having a ⅓-octave bandwidth and a second peak filter centered at a frequency above 13 kHz and having a ¼-octave bandwidth; and the one or more notch filters comprises a notch filter centered at a frequency between 4 kHz and 8 kHz with 1-octave bandwidth.

In one embodiment of the third aspect, the one or more peak filters comprise a first peak filter centered at 1 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency between 10 kHz and 12 kHz and having a ¼-octave bandwidth, and the one or more notch filters comprise a first notch filter centered at 9 kHz and having a ¼-octave bandwidth and a second notch filter centered at 16 kHz and having a ¼-octave bandwidth.

The method according to the first aspect of the disclosure can be performed by the apparatus according to the second aspect or the third aspect of the disclosure. Further features of the method according to the first aspect of the disclosure result directly from the functionality of the apparatus according to the second aspect or the third aspect of the disclosure and its different embodiment forms.

A fourth aspect of the disclosure relates to a computer-readable storage medium storing program code. The program code comprises instructions for carrying out the method of the first aspect or one of its embodiments.

The disclosure can be implemented in hardware and/or software.

BRIEF DESCRIPTION OF THE DRAWINGS

To illustrate the technical features of embodiments of the present disclosure more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present disclosure, but modifications on these embodiments are possible without departing from the scope of the present disclosure as defined in the claims.

FIG. 1 shows an example about a sound space is divided into three planes, the horizontal plane, the median plane and the frontal plane;

FIG. 2 shows a schematic diagram of a method of binaural rendering with externalization and localization enhancement method according to an embodiment;

FIG. 3 shows another schematic diagram of a method of binaural rendering with externalization and localization enhancement method according to an embodiment;

FIG. 4 shows a block diagram of a general method to simulate a virtual sound source according to an embodiment;

FIG. 5 shows another schematic diagram of a method of binaural rendering with externalization and localization enhancement method according to an embodiment;

FIG. 6 shows an example of magnitude spectra of peak notch filter for a frontal (left panel) and rear (right panel) sound source;

FIG. 7 shows an example of frontal and rear view direction in a rendering system;

FIG. 8 shows an example of the gain factor across different azimuth angles (θ) for the sound source located on the horizontal plane;

FIG. 9 shows a schematic diagram of a method to decorrelate the input audio signal according to an embodiment;

FIG. 10 shows a schematic diagram of a method to enhancement of externalization of a mono signal according to an embodiment;

FIG. 11 shows another schematic diagram of a method to enhancement of externalization of a mono signal according to an embodiment;

FIG. 12 shows another schematic diagram of a method to enhancement of externalization of a mono signal according to an embodiment;

FIG. 13 shows a schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 14 shows another schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 15 shows another schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 16 shows another schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 17 shows another schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 18 shows another schematic diagram of a method to enhancement of externalization of a stereo signal according to an embodiment;

FIG. 19 shows a schematic diagram of a method for processing a stereo signal according to an embodiment;

FIG. 20 shows a schematic diagram illustrating an apparatus for processing a stereo signal according to an embodiment;

FIG. 21 shows a schematic diagram illustrating a device for processing a stereo signal according to an embodiment.

In the figures, identical reference signs will be used for identical or functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the disclosure may be placed. It will be appreciated that the disclosure may be placed in other aspects and that structural or logical changes may be made without departing from the scope of the disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the disclosure is defined by the appended claims.

For instance, it will be appreciated that a disclosure in connection with a described method will generally also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures.

Moreover, in the following detailed description as well as in the claims, embodiments with functional blocks or processing units are described, which are connected with each other or exchange signals. It will be appreciated that the disclosure also covers embodiments which include additional functional blocks or processing units, such as pre- or post-filtering and/or pre- or post-amplification units, that are arranged between the functional blocks or processing units of the embodiments described below.

Finally, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

A channel is a pathway for passing on information, in this context sound information. Physically, it might, for example, be a tube you speak down, or a wire from a microphone to an earphone, or connections between electronic components inside an amplifier or a computer.

A track is a physical home for the contents of a channel when recorded on magnetic tape. There can be as many parallel tracks as technology allows, but for everyday purposes there are 1, 2 or 4. Two tracks can be used for two independent mono signals in one or both playing directions, or a stereo signal in one direction. Four tracks (such as a cassette recorder) are organized to work pairwise for a stereo signal in each direction; a mono signal is recorded on one track (same track as the left stereo channel) or on both simultaneously (depending on the tape recorder or on how the mono signal source is connected to the recorder).

A mono sound signal does not contain any directional information. In an example, there may be several loudspeakers along a railway platform and hundreds around an airport, but the signal remains mono. Directional information cannot be generated simply by sending a mono signal to two “stereo” channels. However, an illusion of direction can be conjured from a mono signal by panning it from channel to channel.

A stereo sound signal may contain synchronized directional information from the left and right aural fields. Consequently, it requires at least two channels, one for the left field and one for the right field. The left channel is fed by a mono microphone pointing at the left field and the right channel by a second mono microphone pointing at the right field (you will also find stereo microphones that have the two directional mono microphones built into one piece). In an example, Quadraphonic stereo uses four channels, surround stereo has at least additional channels for anterior and posterior directions apart from left and right. Public and home cinema stereo systems can have even more channels, dividing the sound fields into narrower sectors.

It is important that the externalization and the localization accuracy can be enhanced by applying non-individual HRTFs/BRIRs for the binaural rendering system.

In an example, a sound space is divided into three specific planes: the horizontal plane, the median plane and the frontal plane, as shown in FIG. 1. The three planes are perpendicular to one another and intersect at the origin. This clockwise spherical coordinate system is also called head related coordinate system in some documents, in which the angle between the directional vector of the sound source and the horizontal plane is denoted by elevation angle φ with −90°≤φ≤90° and the angle between the horizontal projection of directional vector and the front is denoted by azimuth angle θ with −180°<θ≤180°. A sound source directly in front of the listening subject corresponds to 0° in Azimuth and Elevation.

There is another example to design some adjustment filters based on peak and notch filters to improve the sound localization in the median plane.

TABLE 1

Filter Type
Center Frequency
Band Width

“Frontness”

Peak
4 kHz
1/4 octave

Notch
7.5 kHz
1 octave

Peak
14 kHz
1/4 octave

“Aboveness”

Peak
4 kHz
1/4 octave

Peak
8 kHz
1/4 octave

“Behindness”

Peak
4 kHz
1/4 octave

Notch
9 kHz
1/4 octave

Peak
11 kHz
1/4 octave

Notch
16 kHz
1/4 octave

The positions of the peak and notch filters for frontal, above and rear sound sources are listed in Table 1. In this method, the design of peak and notch filters is based on the characteristic of HRTF itself and a little psychoacoustic experiments. Since some information of peaks and notches is already included in the HRTF, it is somehow like enlarge the spectral difference, which may introduce coloration problem. In addition, identical gain factors applied for different azimuth angles may introduce localization problem.

In another example, the input signals are divided into 5 sub-bands by a bandpass filter bank and configured to emphasize or deemphasize each band for maximum localization ability. However, this method requires fine-tuning the gains of all band-pass filters by the user which is not very practical. In addition, the bandwidth of the sub-bands is fixed, and there is no discussion about the choice of the bandwidth. Some psychoacoustic experiments indicated that the bandwidths of filters also play an important role in enhancement of sound source localization. Some methods tried to minimize the cone-of-confusion by spectral adjustments which simulate HRTF characteristics of subjects showing good performance in front-back localization (with large protrusion angle). One method is similar to emphasizing or deemphasizing the magnitude in some special frequencies. However, this method requires individual HRTF measurements, which is not practical. These methods may increase the peak or notch components of HRTF to enlarge the spectral difference of confusion direction. However, in these methods, larger spectral differences between rendered front and rear sound sources cannot guarantee better localization when only frontal or rear sound sources are rendered. These methods are only suitable on the horizontal plane. Also, loss of direction and bad sound quality may result.

In another example, a method is disclosed to enhance externalization of a mono audio signal. As shown in FIG. 2, a mono audio signal is first filtered by a pair of modeled HRTF, then the filtered signals are decorrelated to enhance the spaciousness of sound images. The image source method based reverberator is designed to simulate the reverberation. Finally, a pair of notch filters is designed based on averaged HRTFs at 0° from the center for image processing and integrated computing (CIPIC) database to enhance the sound localization. In this example, the decorrelator is applied to the direct part and thus the localization accuracy of a frontal sound source may be reduced (there is no separation between direct and early reflection in the processing). The notch filter is based on measured HRTFs and applied to binaural rendered signals. Any mismatch between the user's HRTF and the model used will cause bad quality.

In the case of a pair of virtual stereo signals (e.g., located at −30° and 30°), the generated phantom signal (0°) is difficult to be perceived as externalized. Some methods involving up-mixing stereo signals to center (i.e. center channel signal) and side signals are proposed. In these methods, the center and two side signals can be considered as three virtual sound sources. A method is disclosed to up-mix stereo signals to virtual surround sound to enhance the spaciousness of the rendered signals. However, the externalization and localization of rendered sound sources in the median plane are not enhanced. It is an object of one embodiment of the present disclosure to further enhance externalization based on an upmixed signal.

FIG. 19 shows a schematic diagram of a method for processing a stereo signal according to an embodiment. The method comprises:

S11: obtaining the stereo signal.

In an example, a stereo signal may be obtained by a receiver. For example, the receiver may obtain the stereo signal from another device or another system over a wired or wireless communication channel.

In another example, a stereo signal may be obtained according to a processor and at least two microphones. The at least two microphones are used to record information obtained from a sound source, and the processor is used to process information recorded by the microphones, to obtain the stereo signal.

In one embodiment, the obtaining the stereo signal comprises: obtaining an initial audio signal; and decomposing the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal.

S12: obtaining a center channel signal by up-mixing the stereo signal.

Up-mixing, in its most general sense, is the opposite of down-mixing. This means that up-mixing is a process that transforms a set of audio channels into a new set of audio channels which comprises more audio channels than the initial set. For example, up-mixing may transform 2 channels into 5.1 channels. Up-mixing is commonly used to better integrate legacy two-channel mono, stereo, or surround encoded content into 5.1 channel programs. Chosen properly, up-mixing further speeds the transition to 5.1 by helping out legacy content, and by assisting in the creation of new 5.1 channel material.

In an example, a strategy for up-mixing a stereo signal into a multi-channel signal is based on predicting or guessing the way in which the sound engineer would have proceeded if she or he were doing a multi-channel mix. For example, in the direct/ambient approach the ambience signals recorded at the back of the venue in the live recording could have been sent to the rear channels of the surround mix to achieve the immersion of the listener in the sound field. Or in the case of studio mix, a multi-channel reverberation unit could have been used to create this effect by assigning different reverberation levels to the front and rear channels. Also, the availability of a center channel could have helped the engineer to create a more stable frontal image for off-the-axis listening by panning the instruments among three channels instead of two. A series of techniques are disclosed for extracting and manipulating information in the stereo signals. Each signal in the stereo recording is analyzed by computing its Short-Time Fourier Transform (STFT) to obtain its time-frequency representation, and then comparing the two signals in this new domain using a variety of metrics. One or many mapping or transformation functions are then derived based on the particular metric and applied to modify the STFT's of the input signals.

In another example, in a stereo mix it is common that one featured vocalist or soloist is panned to the center. The intention of the sound engineer doing the mix is to create the auditory impression that the soloist is in the center of the stage. However, in a two-loudspeaker reproduction set up, the listener needs to be positioned exactly between the loudspeakers (e.g., the sweet spot) to perceive the intended auditory image. If the listener moves closer to one of the loudspeakers, the perception is destroyed by the precedence effect, and the image collapses towards the direction of the loudspeaker. For this reason (among others), a center channel containing the dialogue is used in movie theatres, so that the audience sitting towards either side of the room can still associate the dialogue with the image on the screen. In fact, most of the popular home multi-channel formats like 5.1 Surround now include a center channel to deal with this problem. If the sound engineer had had the option to use a center channel, he or she would have probably panned (or sent) the soloist or dialogue exclusively to this channel. Moreover, not only the center-panned signal collapses for off-axis listeners. Sources panned primarily toward on side (far from the listener) might appear to be panned toward the opposite side (closer to the listener). The sound engineer could have also avoided this by panning among the three channels, for example by panning between center and left-front channels all the sources with spatial locations on the left hemisphere, and panning between center and right-front channels all sources with locations toward the right.

S13: generating a filtered center channel signal.

A filtered center channel signal is generated by applying one or more peak filters and one or more notch filters to the center channel signal.

In one embodiment, the one or more peak filters and one or more notch filters, comprise: a notch filter centered at a frequency between 4 kHz and 8 kHz and having a 1-octave bandwidth, a first peak filter centered at 4 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency above 13 kHz and having a ¼-octave bandwidth.

In an example, the typical center frequency for the notch filter is 7 kHz, and the typical center frequency for the second peak filter is 13 kHz.

In one embodiment, the one or more peak filters and one or more notch filters, comprises: a first notch filter centered at 9 kHz and having a ¼-octave bandwidth, a second notch filter centered at 16 kHz and having a ¼-octave bandwidth, a first peak filter centered at 1 kHz and having a ⅓-octave bandwidth, and a second peak filter centered at a frequency between 10 kHz and 12 kHz and having a ¼-octave bandwidth.

In an example, the typical center frequency for the second peak filter is 11 kHz.

In an example, the filtering process may be performed according to the following formula:

- Input signal: s(t)
- Peak and notch filter: p(t).
- This formula is a convolution in time domain,
- t denotes for time, τ is a variable which should is integrated from −∞ to ∞. dτ stands for an infinitesimal piece of the variable τ.

s′(t)=s(t)*p(t)=∫_−∞^∞p(t−τ)s(τ)dτ,

- * denotes convolution.
  
  The input signal s(t) may be a mono signal or a center channel signal.

S14: generating a binaural signal based on the filtered center channel signal.

The method for processing a stereo signal improve the localization and externalization of stereo signal in the median plane.

In one embodiment, the method further comprises: obtaining a side channel signal by up-mixing the stereo signal; processing the side channel signal, according to a first head related transfer function, to obtain a processed side channel signal; processing the filtered center channel signal, according to a second head related transfer function, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating the binaural signal based on the processed side channel signal and the processed center channel signal.

In one embodiment, a head related transfer function convolution is performed according to the formula:

d
_i(t)=s(t)*hrir_i(t)=∫_−∞^∞hrir_i(t−τ)s(τ)dτ,i∈{left,right}hrir_i(t)=IFFT{HRTF_i(f)}

- s(t) denotes a signal which is inputted to this process, * denotes convolution, s(t) is input signal, d_i(t) is the output signal of this process.
- t denotes for time, τ is a variable, which should be integrated from −∞ to ∞. dτ stands for the smallest piece of the variable τ. IFFT is the backwards Fourier transformation.
- i∈{left,right} means, the symbol “i” can stand for the left or the right. For example, hrir_i(t) means the hrir_left(t) or hrir_right(t).

In one embodiment, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; processing the left channel signal and the right channel signal according to two pairs of head related transfer functions to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal according to a pair of head related transfer functions to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment, the method further comprises: filtering the side channel signal and the center channel signal, using one or more decorrelation filters, to obtain a decorrelated side signal and a decorrelated center signal; and obtaining a reflection signal based on the decorrelated side signal and the decorrelated center signal.

In an example, a decorrelated signal is generated in accordance with the following formula (which defines an example of a decorrelation filter):

$s (f_{i}, t) = I F F T {F F T {s (t)} \times C (f_{i}, f)}}, with i = 1, 2, 3, \dots, 24$

$s_{left}^{″} (t) = \sum_{i = 1}^{2 4} s (f_{i}, t)$

$s_{right}^{″} (t) = \sum_{i = 1}^{2 4} s (f_{i}, t - τ_{i})$

wherein τ_iis randomized, f_iis a center frequency, and the coefficients C(f_i, f) represent a critical band filter bank. FFT means the Fourier transformation, transforming the signal from time domain to frequency domain. IFFT is the backwards Fourier transformation, transforming the signal from frequency domain to time domain. f means the frequency. f_iis the center frequency. t is the time. Σ_i=1²⁴s(f_i, t) means the summation of s(f_i,t), i.e., s(f₁, t)+s (f₂, t)+s (f₃, t)+s(f₄, t) . . . s(f₂₄, t).

In audiology and psychoacoustics the concept of critical bands describes the frequency bandwidth of the “auditory filter” created by the cochlea, the sense organ of hearing within the inner ear.

In one embodiment, the method further comprises: filtering the left channel signal, the right channel signal and the center channel signal, using one or more decorrelation filters, to obtain a decorrelated left signal, a decorrelated right signal and a decorrelated center signal; and obtaining a reflection signal based on the decorrelated left signal, the decorrelated right signal and the decorrelated center signal.

In one embodiment, the location of i^thorder image-sources along the x-, y- and z-coordinate {x_i, y_i, z_i} can be expressed as:

$(\begin{matrix} x_{i} \\ y_{i} \\ z_{i} \end{matrix}) = (\begin{matrix} {(- 1)}^{i} x_{s} + [i + \frac{1 - {(- 1)}^{i}}{2} x_{r} \\ {(- 1)}^{i} y_{s} + [i + \frac{1 - {(- 1)}^{i}}{2} y_{r} \\ {(- 1)}^{i} z_{s} + [i + \frac{1 - {(- 1)}^{i}}{2} z_{r} \end{matrix})$

where {x_s, y_s, z₅} and {x_r, y_r, z_r} are the coordinate of the sound source and room, respectively.

The angle (θ_i, φ_i) between the each image source and the listener can be calculated as:

$θ_{i} = \arccos \frac{z_{i} - z_{r}}{\sqrt{{(x_{i} - x_{r})}^{2} + {(y_{i} - y_{r})}^{2} + {(z_{i} - z_{r})}^{2}}}$

$φ_{i} = \arccos \frac{y_{i} - y_{r}}{x_{i} - x_{r}}$

The attenuation of the early reflections is:

$α_{i} = \frac{1}{\sqrt{{(x_{i} - x_{r})}^{2} + {(y_{i} - y_{r})}^{2} + {(z_{i} - z_{r})}^{2}}}$

The early reflection can be calculated as (N is the number of early reflections):

e
_left(t)=Σ_i=1^Nα_is″_left(t)*hrir_left(t,θ_i,φ_i))

e
_right(t)=Σ_i=1^Nα_is″_right(t)*hrir_right(t,θ_i,φ_i))

t is the time, θ_i, φ_iare azimuth and elevation angles, respectively. * denotes for convolution in time domain.

In one embodiment, the obtaining the stereo signal comprises: obtaining an initial audio signal; decomposing the initial audio signal, using one or any combination of the following methods: Ambient Phase Estimation, Principal Component Analysis or Least Squares Analysis, to obtain the stereo signal and an ambient signal; wherein the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; adding the ambient signal with the left channel signal, to obtain a left sum signal; adding the ambient signal with the right channel signal, to obtain a right sum signal; processing the left sum signal and the right sum signal, according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal, according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; convolving the stereo signal with a local reverberation to obtain a convolved stereo signal; adding the convolved stereo signal with the left channel signal, to obtain a left sum signal; adding the convolved stereo signal with the right channel signal, to obtain a right sum signal; processing the left sum signal and the right sum signal, according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal, according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.

In one embodiment, the method further comprises: obtaining a left channel signal and a right channel signal by up-mixing the stereo signal; convolving the stereo signal with a local reverberation to obtain a convolved stereo signal; processing the left channel signal and the right channel signal, according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; processing the filtered center channel signal, according to a pair of head related transfer functions, to obtain a processed center channel signal; and wherein the generating a binaural signal based on the filtered center channel signal comprises: generating a left signal of the binaural signal based on the processed left channel signal, the convolved stereo signal and the processed center channel signal, generating a right signal of the binaural signal based on the processed right channel signal, the convolved stereo signal and the processed center channel signal.

In one embodiment, late reverberation e.g., calculated by convolution with late reverberation synthesized or recorded in the room (h_late,left(t), h_late,right(t)) is performed according to the following formula:

l
_left(t)=s(t)*h_late,left(t)=∫_−∞^∞h_late,left(t−τ)s(τ)dτ

l
_right(t)=s(t)*h_late,right(t)=∫_−∞^∞h_late,right(t−τ)s(τ)dτ

This is a convolution formula in time domain. t denotes for time. * denotes for convolution in time domain. t denotes for time, τ is a variable, which should be integrated from −∞ to ∞. dτ stands for the smallest piece of the variable τ. s(t) is the input signal in time domain.

In one embodiment, the binaural signals are the sum of direct sound, early reflections and late reverberation:

Left=d_left(t)+e_left(t)+l_left(t)

Right=d_right(t)+e_right(t)+l_right(t)

FIG. 20 shows a schematic diagram of an apparatus for processing a stereo signal according to an embodiment. The apparatus comprises: a stereo signal obtain unit configured to obtain the stereo signal; a up-mix unit configured to obtain a center channel signal by up-mixing the stereo signal; one or more peak filters and one or more notch filters configured to filter the center channel signal to obtain a filtered center channel signal; and a binaural signal generate unit (204) configured to generate a binaural signal based on the filtered center channel signal.

In one embodiment, the up-mix unit is further configured to obtain a side channel signal by up-mixing the stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to process the side channel signal, according to a first head related transfer function, to obtain a processed side channel signal; the HRTF unit is further configured to process the filtered center channel signal, according to a second head related transfer function, to obtain a processed center channel signal; and the binaural signal generate unit is configured to generate the binaural signal based on the processed side channel signal and the processed center channel signal.

In one embodiment, the up-mix unit is further configured to obtain a left channel signal and a right channel signal by up-mixing the stereo signal; the apparatus further comprises a head related transfer function, HRTF, unit, the HRTF unit is configured to process the left channel signal and the right channel signal, according to two pairs of head related transfer functions, to obtain a processed left channel signal and a processed right channel signal; the HRTF unit is further configured to process the filtered center channel signal, according to a pair of head related transfer functions, to obtain a processed center channel signal; and the binaural signal generate unit is configured to generate a left signal of the binaural signal based on the processed left channel signal and the processed center channel signal, the binaural signal generate unit is configured to generate a right signal of the binaural signal based on the processed right channel signal and the processed center channel signal.