This application claims the benefit of Korean Patent Application No. 10-2017-0176720, filed on Dec. 21, 2017, and Korean Patent Application No. 10-2018-0050407, filed on May 2, 2018, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates to a signal processing method and device for effectively reproducing an audio signal, and more particularly, to a signal processing method and device to provide an interactive and an immersive three-dimensional audio signal in a head mounted display (HMD).
A binaural rendering technology is essentially required to provide immersive and interactive audio in a head mounted display (HMD) device. Binaural rendering represents modeling a 3D audio, which provides a sound that gives a sense of presence in a three-dimensional space, into a signal to be delivered to the ears of a human being. A listener may be experienced a sense of three-dimensionality from a binaural rendered 2-channel audio output signal through a headphone, an earphone, or the like. A specific principle of the binaural rendering is described as follows. A human being listens to a sound through both ears, and recognizes the position and the direction of a sound source from the sound. Therefore, if a 3D audio may be modeled into audio signals to be delivered to both ears of a human being, the three-dimensionality of 3D audio may be reproduced through a 2-channel audio output without a large number of speakers.
Here, when the number of channels or objects included in an audio signal to be binaural rendered increases, the amount of calculation and power consumption required for binaural rendering may be increased. Therefore, a technology for efficiently performing binaural rendering on an input audio signal is required in a mobile device limited in calculation amount and power consumption.
Furthermore, the number of head related transfer functions (HRTFs) obtainable by the audio signal processing device may be limited due to limited memory capacity and constraints in the measurement process. This may cause degradation of the sound localization performance of the audio signal processing device. Therefore, additional processing of the audio signal processing device for the input HRTF may be required to increase the communicative resolution of the audio signal being reproduced on the three-dimensional space. In addition, a binaural rendered audio signal in a virtual reality may be combined with additional signals to improve reproducibility. In this case, when the audio signal processing device synthesizes the binaural rendered audio signal and the additional signal in time domain, the sound quality of the output audio signal may be degraded due to a comb-filtering effect. This is because timbre may be distorted due to binaural rendering and the different delays of additional signals. Further, when the audio signal processing device synthesizes the binaural-rendered audio signal and the additional signal in frequency domain, an additional amount of computation is required as compared with the case of using only binaural rendering. There is thus a need for techniques to preserve the timbre of an input audio signal while reducing the amount of computation in further processing and synthesis.
An object of an embodiment of the present disclosure is to reduce a distortion of timbre due to a comb-filtering effect in generating an output audio signal by binaural rendering an input audio signal based on a plurality of filters.
An audio signal processing device according to an embodiment of the present disclosure includes a processor for outputting an output audio signal generated based on an input audio signal. The processor may be obtain a first pair of head-related transfer function (HRTF)s including a first ipsilateral HRTF and a first contralateral HRTF based on a position of a virtual sound source corresponding to an input audio signal, from a first set of transfer functions including HRTFs corresponding to each specific position with respect to a listener, and generate an output audio signal by performing binaural rendering the input audio signal based on the first pair of HRTFs, and wherein a phase response of each of the plurality of ipsilateral HRTFs included in the first set of transfer functions in a frequency domain may be the same regardless of the position of the each of the plurality of ipsilateral HRTFs. A phase response of the first ipsilateral HRTF may be a linear phase response.
A contralateral group-delay corresponding to a phase response of the first contralateral HRTF may be determined based on an ipsilateral group-delay corresponding to the modified phase response of the first ipsilateral HRTF, and the phase response of the first contralateral HRTF may be a linear phase response.
The contralateral group-delay may be a value determined by using an interaural time difference (ITD) information with respect to the ipsilateral group-delay.
The ITD information may be a value obtained based on a measured pair of HRTFs, and the measured pair of HRTFs corresponds to the position of the virtual sound source with respect to the listener.
The contralateral group-delay may be a value determined by using a head modeling information of the listener with respect to the ipsilateral group-delay.
The ipsilateral group-delay and the contralateral group-delay are integer multiples of a sample according to a sampling frequency in the time domain.
The processor may be configured to generate the output audio signal, in the time domain, by delaying the input audio signal based on the contralateral group-delay and the ipsilateral group-delay, respectively.
The processor may be configured to generate a final output audio signal based on the phase response modified first pair of HRTFs and an additional audio signal in the time domain, and output the final output audio signal. An ipsilateral group-delay of the additional audio signal may be the same as the ipsilateral group-delay of the first ipsilateral HRTF group-delay and a contralateral group-delay of the additional audio signal may be the same as the contralateral group-delay of the first contralateral HRTF.
The processor may be configured to obtain a panning gain according to the position of the virtual sound source with respect to the listener, filter the input audio signal based on the panning gain, and delay the filtered input audio signal based on the ipsilateral group-delay of the first ipsilateral group-delay and the contralateral group-delay of the first contralateral group-delay to generate the additional audio signal.
The processor may be configured to generate the output signal by binaural rendering the input audio signal based on the first pair of HRTFs, generate the additional audio signal by filtering the input audio signal based on an additional filter pair including an ipsilateral additional filter and a contralateral additional filter, and generate the final output audio signal by mixing the output audio signal and the additional audio signal in the time domain. A phase response of the ipsilateral additional filter may be the same as the phase response of the first ipsilateral HRTF, and a phase response of the contralateral additional filter may be the same as the phase response of the first contralateral HRTF.
The additional filter pair may be a filter generated based on a panning gain according to the position of the virtual sound source with respect to the listener, and a magnitude component of frequency response of each of the ipsilateral additional filter and the contralateral additional filter may be constant.
The additional filter pair may be a filter generated based on a size of an object modeled by the virtual sound source and a distance from the listener to the virtual sound source.
A phase response of each of a plurality of HRTFs included in the first set of transfer functions in the frequency domain may be the same each other regardless of the position corresponding to each of the plurality of HRTFs. The processor may be configured to obtain the first pair of HRTFs based on at least two pairs of HRTFs when the position of the virtual sound source may be a position other than a position corresponding to each of the plurality of HRTFs. The at least two pairs of HRTFs may be obtained based on the position of the virtual sound source from the first set of transfer functions.
The processor may be configured to obtain the first pair of HRTFs by interpolating the at least two pairs of HRTFs in a time domain.
The processor may be configured to obtain a second pair of HRTFs including a second ipsilateral HRTF and a second contralateral HRTF, based on the position of the virtual sound source, from a second set of transfer functions other than the first set of transfer functions, and generate the output audio signal based on the first pair of HRTFs and the second pair of HRTFs. A phase response of the second ipsilateral HRTF may be same as the phase response of the first ipsilateral HRTF, and a phase response of the second contralateral HRTF may be the same as the phase response of the first contralateral HRTF.
An operation method for an audio signal processing device outputting an output audio signal generated based on an input audio signal including the steps of: obtaining a pair of head-related transfer function (HRTF)s including a ipsilateral HRTF and a contralateral HRTF based on a position of a virtual sound source corresponding to an input audio signal, from a set of transfer functions including HRTFs corresponding to each specific position with respect to a listener; and generating an output audio signal by performing binaural rendering the input audio signal based on the pair of HRTFs. A phase response of each of the plurality of ipsilateral HRTFs included in the set of transfer functions in a frequency domain may be the same regardless of the position of the each of the plurality of ipsilateral HRTFs.
An audio signal processing device according to an embodiment of the present disclosure includes a processor for outputting an output audio signal generated based on an input audio signal. The processor may be configured to obtain a first pair of head-related transfer function (HRTF)s including a first ipsilateral HRTF and a first contralateral HRTF based on a position of a virtual sound source corresponding to the input audio signal, from a first set of transfer functions including HRTFs corresponding to each specific position with respect to listener, modify a phase response of the first ipsilateral HRTF in a frequency domain to be a specific phase response that may be the same regardless of the position of the virtual sound source, and generate the output audio signal by performing binaural rendering the input audio signal based on the first pair of HRTFs of which the phase response of the first ipsilateral HRTF may be modified. The specific phase response may be a linear phase response.
The processor may be configured to determine a contralateral group-delay based on an ipsilateral group-delay corresponding to the modified phase response of the first ipsilateral HRTF in a time domain, modify a phase response of the first contralateral HRTF based on the contralateral group-delay, and generate the output audio signal by binaural rendering the input audio signal based on the phase response modified first pair of HRTFs of which phase responses of the first ipsilateral HRTF and the first contralateral are modified, and wherein the modified phase response of the first contralateral HRTF may be a linear phase response.
The processor may be configured to determine the contralateral group-delay based on a head modeling information of the listener.
The processor may be configured to obtain an interaural time difference (ITD) information based on the first pair of HRTFs obtained from the first set of transfer functions, and determine the contralateral group-delay based on the ITD information.
The ipsilateral group-delay and the contralateral group-delay are integer multiples of a sample according to a sampling frequency in the time domain.
The processor may be configured to in the time domain, generate the output audio signal by delaying the input audio signal based on the contralateral group-delay and the ipsilateral group-delay, respectively.
The processor may be configured to generate a final output audio signal based on the phase response modified first pair of HRTFs and an additional audio signal in the time domain, and wherein each group-delay of an ipsilateral and a contralateral of the additional audio signal may be the same as each of the ipsilateral group-delay and the contralateral group-delay, respectively.
The processor may be configured to determine a panning gain based on the position of the virtual sound source with respect to the listener, filter the input audio signal based on the panning gain, and delay the filtered input audio signal based on the ipsilateral group-delay and the contralateral group-delay to generate the additional audio signal.
The processor may be configured to generate the output signal by binaural rendering the input audio signal based on the phase response modified first pair of HRTFs, generate the additional audio signal by filtering the input audio signal based on an additional filter pair including an ipsilateral additional filter and a contralateral additional filter, and generate the final output audio signal by mixing the output audio signal with the additional audio signal. A phase response of the ipsilateral additional filter may be the same as the modified phase response of the first ipsilateral HRTF, and a phase response of the contralateral additional filter may be the same as the modified phase response of the first contralateral HRTF.
A magnitude component of frequency response of each of the ipsilateral additional filter and the contralateral additional filter may be constant. The processor may be configured to determine a panning gain based on the position of the virtual sound source with respect to the listener, generate the additional filter pair with setting the panning gain as the constant magnitude response, and generate the additional audio signal by filtering the input audio signal based on the additional filter pair.
The processor may be configured to generate the additional filter pair based on a size of an object modeled by the virtual sound source and a distance from the listener to the virtual sound source, and generate the additional audio signal by filtering the input audio signal based on the additional filter pair.
A phase response of each of the plurality of HRTFs included in the first set of transfer functions may be the same each other regardless of the position of the plurality of HRTFs. The processor may be configured to obtain at least two pairs of HRTFs among the first set of transfer functions based on the position of the virtual sound source, when the position of the virtual sound source may be a position other than a position corresponding to each of the plurality of HRTFs, and obtain the first pair of HRTFs by interpolating the at least two pairs of HRTFs in a time domain.
The processor may be configured to obtain a second pair of HRTFs including a second ipsilateral HRTF and a second contralateral HRTF, based on the position of the virtual sound source, from a second set of transfer functions other than the first set of transfer functions, modify a phase response of the second ipsilateral HRTF to be the modified phase response of the first ipsilateral HRTF, modify a phase response of the second contralateral HRTF to be the modified phase response of the first contralateral HRTF, and generate the output audio signal based on the phase response modified first pair of transfer functions and the phase response modified second pair of transfer functions.
An audio signal processing device and method according to an embodiment of the present disclosure may reduce the deterioration in sound quality due to the comb-filtering effect occurring in the binaural rendering process. Furthermore, the audio signal processing device and method may reduce the distortion of timbre occurring in the process of binaural rendering an input audio signal based on a plurality of filters to generate an output audio signal.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that the embodiments of the present disclosure can be easily carried out by those skilled in the art. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. Some parts of the embodiments, which are not related to the description, are not illustrated in the drawings to clearly describe the embodiments of the present disclosure. Like reference numerals refer to like elements throughout the description.
When it is mentioned that a certain part “includes” or “comprises” certain elements, the part may further include other elements, unless otherwise specified. When it is mentioned that a certain part “includes” or “comprises” certain elements, the part may further include other elements, unless otherwise specified.
The present disclosure relates to a method for binaural rendering an input audio signal to generate an output audio signal. An audio signal processing device according to an embodiment of the present disclosure may generate an output audio signal based on the binaural transfer function pair whose phase response has been changed. The phase response represents the phase component of the frequency response. Further, the audio signal processing device may change a phase response of an initial binaural transfer function pair corresponding to an input audio signal. The device for processing an audio signal according to an embodiment of the present disclosure may mitigate a comb-filtering effect generated in a binaural rendering process by using a transfer function which has an adjusted phase response. In addition, the audio signal processing device may mitigate timbre distortion while maintaining the sound image localization performance of the input audio signal. In this disclosure, the transfer function may include a head-related transfer function (HRTF).
Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings
The receiving unit 110 may receive an audio signal. The receiving unit 110 may receive an input audio signal input to the audio signal processing device 100. The receiving unit 110 may receive an input audio signal to be binaural rendered by the processor 120. Here, the input audio signal may include at least one of an ambisonics signal, an object signal or a channel signal. Here, the input audio signal may be one object signal or mono signal. The input audio signal may be a multi-object or multi-channel signal. According to an embodiment, when the audio signal processing device 100 includes a separate decoder, the audio signal processing device 100 may receive an encoded bitstream of the input audio signal.
According to an embodiment, the receiving unit 110 may be equipped with a receiving means for receiving the input audio signal. For example, the receiving unit 110 may include an audio signal input port for receiving the input audio signal transmitted by wire. Alternatively, the receiving unit 110 may include a wireless audio receiving module for receiving the audio signal transmitted wirelessly. In this case, the receiving unit 110 may receive the audio signal transmitted wirelessly by using a Bluetooth or Wi-Fi communication method.
The processor 120 may control the overall operation of the audio signal processing device 100. The processor 120 may control each component of the audio signal processing apparatus 100. The processor 120 may perform operations and processes for various data and signals. The processor 120 may be implemented as hardware in the form of a semiconductor chip or electronic circuit, or may be implemented as software that controls hardware. The processor 120 may be implemented as a combination of hardware and software. For example, the processor 120 may control operations of the receiving unit 110 and the output unit 130 by executing at least one program. Furthermore, the processor 120 may execute at least one program to perform the operations of the audio signal processing device 100 described below with reference to
For example, the processor 120 may generate an output audio signal. The processor 120 may generate the output audio signal by binaural rendering the input audio signal received through the receiving unit 110. The processor 120 may output the output audio signal through the output unit 130 that will be described later. According to an embodiment, the output audio signal may be a binaural audio signal. For example, the output audio signal may be a 2-channel audio signal representing the input audio signal as a virtual sound source located in a three-dimensional space. The processor 120 may perform binaural rendering based on a transfer function pair that will be described later. The processor 120 may perform binaural rendering in a time domain or a frequency domain.
According to an embodiment, the processor 120 may generate a 2-channel output audio signal by binaural rendering the input audio signal. For example, the processor 120 may generate the 2-channel output audio signal corresponding to both ears of a listener, respectively. Here, the 2-channel output audio signal may be a binaural 2-channel output audio signal. The processor 120 may generate an audio headphone signal represented in three dimensions by binaural rendering the above-mentioned input audio signal.
According to an embodiment, the processor 120 may generate the output audio signal by binaural rendering the input audio signal based on a transfer function pair. The transfer function pair may include at least one transfer function. For example, the transfer function pair may include a pair of transfer functions corresponding to both ears of the listener. The transfer function pair may include an ipsilateral transfer function and a contralateral transfer function. In detail, the transfer function pair may include an ipsilateral head related transfer function (HRTF) corresponding to a channel for an ipsilateral ear and a contralateral HRFT corresponding to a channel for a contralateral ear. Hereinafter, for convenience of explanation, if there is no special description, transfer function (or HRTF) is used as a term indicating at least one transfer function included in the transfer function (or HRTF) pair.
According to one embodiment, the processor 120 may determine a transfer function pair based on a position of a virtual sound source corresponding to an input audio signal. In this case, the processor 120 may obtain the transfer function pair from another apparatus (not shown) other than the audio signal processing device 100. For example, the processor 120 may receive at least one transfer function from a database that includes a plurality of transfer functions. The database may be an external device that stores a set of transfer functions including a plurality of transfer function pairs. In this case, the audio signal processing device 100 may include a separate communication unit (not shown) for requesting a transfer function to the database and receiving information about the transfer function from the database. The processor 120 may obtain a transfer function pair corresponding to the input audio signal based on a set of transfer functions stored in the audio signal processing device 100. The processor 120 may binaurally render the input audio signal based on the acquired transfer function pair to generate an output audio signal.
According to an embodiment, post-processing may be additionally performed on the output audio signal of the processor 120. The post-processing may include crosstalk cancellation, dynamic range control (DRC), sound volume normalization, peak limitation, etc. Furthermore, the post-processing may include frequency/time domain conversion for the output audio signal of the processor 120. The audio signal processing device 100 may include a separate post-processing unit for performing the post-processing, and according to another embodiment, the post-processing unit may be included in the processor 120.
The output unit 130 may output the output audio signal. The output unit 130 may output the output audio signal generated by the processor 120. The output unit 130 may include at least one output channel. Here, the output audio signal may be a 2-channel output audio signal respectively corresponding to both ears of the listener. The output audio signal may be a binaural 2-channel output audio signal. The output unit 130 may output a 3D audio headphone signal generated by the processor 120.
According to an embodiment, the output unit 130 may be equipped with an output means for outputting the output audio signal. For example, the output unit 130 may include an output port for externally outputting the output audio signal. Here, the audio signal processing device 100 may output the output audio signal to an external device connected to the output port. The output unit 130 may include a wireless audio transmitting module for externally outputting the output audio signal. In this case, the output unit 130 may output the output audio signal to an external device by using a wireless communication method such as Bluetooth or Wi-Fi. The output unit 130 may include a speaker. Here, the audio signal processing device 100 may output the output audio signal through the speaker. Furthermore, the output unit 130 may additionally include a converter (e.g., digital-to-analog converter, DAC) for converting a digital audio signal to an analog audio signal.
A binaural rendered audio signal in a virtual reality may be combined with additional signals to increase reproducibility. Accordingly, an audio signal processing device may generate a binaural filter that binaural renders an input audio signal based on a plurality of filters. In addition, the audio signal processing device may synthesize the filtered audio signals based on the plurality of filters. In this case, the quality of the final output audio signal may be degraded due to the difference between the phase characteristics of a frequency response of the plurality of filters (i.e., the time delay difference in the time domain). This is because the timbre of the output audio signal may be distorted due to the comb-filtering effect.
Thus, the audio signal processing device may modify the phase response of the position-specific HRTF corresponding to each specific position with respect to the listener. For example, the location-specific HRTF may include an HRTF corresponding to each location on the unit sphere with respect to the listener. According to an embodiment of the present disclosure, the audio signal processing device may binaural render the input audio signal by using a set of transfer functions of which the phase responses of the ipsilateral HRTFs are modified to coincide with each other. The audio signal processing device may synchronize each of the phase responses of the ipsilateral HRTFs for each position to have the same linear phase response. In addition, the audio signal processing device may linearize each of the phase responses of the position-specific contralateral HRTFs.
Hereinafter, an operation method of an audio signal processing device according to an embodiment of the present disclosure will be described with reference to
The HRTF may be measured in an anechoic room. The HRTF may also include information on the HRTF estimated by simulation. The simulation methods used to estimate HRTF may be at least one of spherical head model (SHM), snowman model, finite-difference time-domain method (FDTDM), or boundary element method (BEM). In this case, the spherical head model represents a simulation technique in which a human head is assumed to be spherical. In addition, the snowman model represents a simulation technique in which the head and body are assumed to be spherical.
In addition, the set of HRTFs may include HRTF pairs defined corresponding to the angles at predetermined angular intervals. For example, the predetermined angular interval may be 1 degree or 10 degrees, but the present disclosure is not limited thereto. In the present disclosure, angles may include azimuths, elevations, and combinations thereof. For example, the set of HRTFs may include a head transfer function corresponding to each combination of the azimuths and elevations with respect to the center of sphere having the predetermined value as radius of the sphere. In addition, in the present disclosure, any coordinate system that defines the azimuth and the elevation may be either a vertical polar coordinate system (VPC) or an interaural polar coordinate system (IPC). Further, the audio signal processing device may use pairs of HRTFs defined for every predetermined angular interval to obtain a pair of HRTFs corresponding to an angle between predetermined angular intervals. This will be described later with reference to
According to an embodiment, the audio signal processing device may obtain a set of transfer functions (HRTF ‘set) whose phase responses are modified. For example, the audio signal processing device may generate the set of transfer function (HRTF’ set) whose phase responses are modified from an obtained set of transfer function (HRTF set). The audio signal processing device may obtain the set of transfer function (HRTF′ set) or a pair of HRTFs whose phase response is modified from an external device. In addition, the audio signal processing device may binaural render an input audio signal based on the set of transfer functions (HRTF ‘set) whose phase response is modified.
For example, the audio signal processing device may obtain HRTF’ whose phase response has been modified (S102). Specifically, the audio signal processing device may obtain pairs of HRTFs corresponding to the input audio signal from the set of transfer functions. For example, the audio signal processing device may obtain at least one pair of HRTFs that simulate the input audio signal based on a position of a virtual sound source corresponding to the input audio signal with respect to a listener. When there are a plurality of virtual sound sources corresponding to the input audio signal, a plurality of HRTF pairs corresponding to the input audio signals may be provided. Further, the audio signal processing device may obtain a plurality of HRTF pairs based on the position of the virtual sound source. For example, when the size of an object simulated by the virtual sound source is equal to or larger than a predetermined size, the audio signal processing device may obtain an output audio signal based on a plurality of HRTF pairs. Further, the pair of HRTFs may be a pair composed of an ipsilateral HRTF and a contralateral HRTF corresponding to different positions. For example, the audio signal processing device may obtain the ipsilateral HRTF and the contralateral HRTF corresponding to different positions based on the position of the virtual sound source corresponding to the input audio signal.
Next, the audio signal processing device may modify the phase response of the HRTF pair. In addition, the audio signal processing device may receive a set of HRTF′ whose phase response has been modified from an external device. In this case, the audio signal processing device may obtain the HRTF′ pair whose phase response has modified from the modified set of HRTF's. Next, the audio signal processing device may binaural render the input audio signal based on the HRTF′ pair whose phase response has been modified. At least some of the operations of the audio signal processing device described with reference to
Hereinafter, a method for modifying a phase response of each of a plurality of HRTFs included in an obtained set of HRTFs according to an embodiment of the present disclosure will be described with reference to
Next, the audio signal processing device may obtain a final HRTF pair by modifying the phase response of the original HRTF. The modification of the phase response in this disclosure may include a replacement, substitution or correction of the phase value corresponding to some frequency bins, of the phase response. Alternatively, the phase response for some of the plurality of HRTFs included in the set of HRTFs may be maintained. Specifically, the audio signal processing device may obtain a final ipsilateral HRTF by setting the phase response of an original ipsilateral HRTF as a common ipsilateral phase response. Here, the common ipsilateral phase response may be a single phase response for a plurality of ipsilateral HRTFs included in a set of HRTFs.
For example, the audio signal processing device may set each of the phase responses of the ipsilateral HRTFs according to each specific position with respect to the listener to be a specific phase response that is same regardless of the position corresponding to each of the ipsilateral HRTFs. The audio signal processing device may match the phase response of the final ipsilateral HRTF with the common ipsilateral phase response that is the same regardless of the position of the virtual sound source corresponding to the input audio signal. In the case of a human auditory sense, a position of a sound source may be recognized based on the difference in sound volume and the difference in arrival time, between both ears of a human being. Accordingly, the audio signal processing device may fix the phase response of either the ipsilateral or the contralateral in a position-independent response. In this way, the audio signal processing device may reduce the amount of data to be stored. For example, the audio signal processing device may fix a phase response of the ipsilateral HRTF. Because the energy of the audio signal is larger on an ipsilateral than on a contralateral. Further, the audio signal processing device may set phase responses of the non-fixed side, based on the difference between the phase responses of the ipsilateral HRTF and the contralateral HRTF included in the HRTF pair for each position. According to an embodiment, the common ipsilateral phase response may be a linear response with linear characteristics. This will be described later with reference to
Further, the audio signal processing device may modify a phase response of an original contralateral HRTF to obtain a final contralateral HRTF. The audio signal processing device may obtain a contralateral phase response for the final contralateral HRTF based on an interaural phase difference (IPD) representing a phase difference between the ipsilateral and the contralateral. For example, the audio signal processing device may determine the contralateral phase response based on the phase response of the final ipsilateral HRTF.
Specifically, the audio signal processing device may obtain an IPD corresponding to the input audio signal based on IPDs of each specific position with respect to the listener. The audio signal processing device may calculate the phase difference between the original ipsilateral HRTF and the original contralateral HRTF to obtain the IPD corresponding to the input audio signal. The audio signal processing device may obtain the contralateral phase response based on the difference between the phase response of the ipsilateral HRTF and the contralateral HRTF for each frequency bin. Meanwhile, the phase response deformation of the HRTF may be performed in the time domain. For example, the audio signal processing device may apply a group-delay to the HRIR converted from the HRTF. This will be described later with reference to
Meanwhile, a slop of the phase response of the original ipsilateral HRTF included in the original set of HRTFs may not be constant for each frequency. Because of measurement errors or over fitting to a subject, the phase response of the original HRTF is less likely to be an ideal linear phase response. In this case, the time delay of HRTF for each frequency bin varies in the time domain due to the difference between phase values for each frequency bin, so that an additional distortion of the timbre may occur. According to an embodiment, the audio signal processing device may generate an output audio signal based on the ipsilateral HRTF whose phase characteristics are linearized in a frequency domain. In the embodiment described above with reference to
phi_unwrap,lin[k]=(phi_unwrap[HN]−phi_unwrap[0])/HN*k+phi_unwrap[0], where k is an integer and 0≤k≤HN. [Equation 1]
In Equation 1, k denotes an index of a frequency bin. Also, HN denotes the Nyquist frequency bin, and phi_unwrap [HN] denotes an unwrapping phase value at the Nyquist frequency bin. phi_unwrap [0] denotes an unwrapping phase value corresponding to frequency bin DC, and phi_unwrap, lin [k] represents a linearized unwrapping phase value corresponding to frequency bin k. As in Equation 1, the audio signal processing device may obtain a phase value for each frequency bin by using the linear approximated slope of the phase response. The audio signal processing device may wrap the unwrapping phase response so as to be a value between (−π, π) in a phase-axis to obtain the wrapping phase response. In addition, as in
IPD[k]=phi_unwrap,lin,left[k]−phi_unwrap,lin,right[k] [Equation 2]
In
As described above, the audio signal processing device according to an embodiment of the present disclosure may perform at least part of the process of modifying the phase response of the HRTF in the time domain. For example, the audio signal processing device may convert HRTF to HRIR, which is a response in the time domain. In this case, the phase response of the HRTF may be a zero-phase response. In the case of the zero-phase response, the amount of calculation required for audio signal processing may be reduced as described later. The audio signal processing device may perform an inverse fast Fourier transform (IFFT) on the HRTF to obtain the HRIR. Next, the audio signal processing device may modify the phase response of the HRTF by time delaying an ipsilateral HRIR and a contralateral HRIR based on the group-delay, respectively. Also, when converting the group-delay applied HRIR to HRTF, which is a frequency domain response, a phase response of the HRTF may be the linear phase response described above.
Specifically, the audio signal processing device may generate a final ipsilateral HRIR by delaying the ipsilateral HRIR based on the ipsilateral group-delay in the time domain. In this case, the ipsilateral group-delay may be a value independent of a position of a virtual sound source simulated by the HRTF. For example, the ipsilateral group-delay may be a value set based on frame size of the input audio signal. Further, the frame size may indicate the number of samples included in one frame. Accordingly, the audio signal processing device may prevent the filter coefficient of the HRIR out of the frame size based on the time ‘0’. The audio signal processing device may apply the same ipsilateral group-delay to a plurality of ipsilateral HRIRs included in a set of HRIRs. The audio signal processing device may obtain the final ipsilateral HRIR by delaying the ipsilateral HRIR based on the ipsilateral group-delay. Further, the audio signal processing device may convert the HRIR to which the ipsilateral group-delay is applied to a response of a frequency domain to obtain the final ipsilateral HRTF.
In addition, the audio signal processing device may generate a final contralateral HRIR by delaying the contralateral HRIR based on the contralateral group-delay in the time domain. In this case, the contralateral group-delay may be a value set based on the position of the virtual sound source simulated by the contralateral HRTF, unlike the ipsilateral group-delay. This is because the interaural time difference (ITD) may be varied depending on the position of the virtual sound source corresponding to the input audio signal with respect to the listener, which indicates the arrival time difference of the audio signal between the ipsilateral and the contralateral. The audio signal processing device may determine the contralateral group-delay for applying to the contralateral HRIR based on the ITD for each specific position with respect to the listener. In this case, the contralateral group-delay may be an ITD time for the position of the virtual sound source corresponding to the input audio signal with respect to the listener added to the ipsilateral group-delay time.
Also, the audio signal processing device may convert the HRIR to which contralateral group-delay is applied to a response of the frequency domain to obtain a final contralateral HRTF. In this case, as the slope of the phase response of the contralateral HRTF increases, the contralateral group-delay value be increased. Further, the audio signal processing device may determine different contralateral group-delay for each specific position with respect to the listener, based on a group-delay of an ipsilateral HRIR and a ITD. Hereinafter, a method of obtaining the ITD by the audio signal processing device according to an embodiment of the present disclosure will be described in detail with reference to
According to an embodiment, the audio signal processing device may obtain the ITD (or IPD) based on the correlation between the ipsilateral HRIR (or HRTF) and the contralateral HRIR (or HRTF). In this case, the HRIR may be a personalized HRIR. This is because cross-correlation between ipsilateral HRIR and contralateral HRIR (or HRTF) may vary depending on the head model of the listener. The audio signal processing device may also obtain the ITD by using personalized HRIRs that is a measured response based on the head model of the listener. The audio signal processing device may calculate the ITD based on the cross-correlation between the ipsilateral HRIR and the contralateral HRIR as shown in Equation 3 below.
maxDelay=xcorr(HRIR_cont,HRIR_ipsil),
ITD=abs(maxDelay−HRIR_length) [Equation 3]
In Equation 3, xcorr(x,y) is a function of outputting an index of the delay time (maxDelay) corresponding to the highest cross-correlation among cross-correlations between x and y for each delay time. In Equation 3, HRIR_cont and HRIR_ipsil indicates the contralateral HRIR and the ipsilateral HRIR, respectively, and HRIR_length indicates the length of the HRIR filter in the time domain.
Specifically, the audio signal processing device includes an ITD for elevation corresponding to each of a plurality of points 601, 602, 603, and 604 where a sagittal plane corresponding to a first azimuth angle 630 and a unit sphere centering on the listener meet, may be obtained. In this case, the plurality of points 601, 602, 603, and 604 may have the same azimuth and different elevations in the IPC. Further, the audio signal processing device may obtain a common ITD corresponding to the first azimuth 630 based on ITD for each elevation. For example, the audio signal processing device may use any one of an average value, a median value, and a mode value of ITD for each elevation as a group ITD corresponding to the first azimuth angle 630. In this case, the audio signal processing device may determine a contralateral group-delay that equally applies to a plurality of contralateral HRTFs corresponding to the first azimuth angle 630 and having different elevation angles based on the group ITD.
Equation 4 represents an operation process of the audio signal processing device when the audio signal processing device uses the median value of ITD for each elevation as the group ITD.
t_cont=median{argmax_t(xcorr(HRIR_cont(n,a,e),HRIR_ipsil(n,a,e)))−HRIR_length}+t_pers+t_ipsil [Equation 4]
In Equation 4, xcorr(x,y) is a function of outputting an index of the delay time (maxDelay) corresponding to the highest cross-correlation among cross-correlations between x and y for each delay time. In Equation 4, HRIR_cont and HRIR_ipsil indicates the contralateral HRIR and the ipsilateral HRIR, respectively, and HRIR_length indicates the length of the HRIR filter in the time domain. t_pers indicates an additional delay for personalization for each listener, ‘a’ indicates an azimuth index, ‘e’ indicates an elevation index, and t_ipsil indicates an ipsilateral group-delay.
According to an embodiment, the audio signal processing device may obtain a contralateral phase response based on the head modeling information of the listener. This is because the ITD may vary depending on the head shape of the listener. The audio signal processing device may use the head modeling information of the listener to determine a personalized contralateral group-delay. For example, the audio signal processing device may determine the contralateral group-delay based on the head modeling information of the listener and the position of the virtual sound source corresponding to the input audio signal with respect to the listener.
d_cont=sqrt((1m)̂2+r̂2−2*r*cos(90+abs(theta)))
tau_cont=d_cont/c
d_ipsil=sqrt((1m)̂2+r̂2−2*r*cos(90−abs(theta)))
tau_ipsil=d_ipsil/c, [Equation 5]
where c is the sound velocity (343 m/s), and −90<theta<90.
In Equation 5, ‘r’ may be the radius of the approximated sphere based on the head of the listener. Alternatively, ‘r’ may be the distance from the center of the listener's head to both ears. In this case, the distance from the center of the listener's head to the ipsilateral ear and to the contralateral ear may be different each other (for example, r1 and r2). Further, ‘1 m’ indicates the distance from the center of the listener's head to the virtual sound source corresponding to the input audio signal. d_cont indicates the distance from the contralateral ear of the listener to the virtual sound source, and d_ipsil indicates the distance from the ipsilateral ear of the listener to the virtual sound source. The audio signal processing device may determine the contralateral group-delay based on the personalized ITD measured for each specific position with respect to the listener.
phi_L=−w·T_L
phi_R=−w·T_R [Equation 6]
In Equation 6, ‘w’ denotes angular frequency. The derivative values of phi_L and phi_R with respect to ‘w’ are constant as −T_L and −T_R, respectively. Thus, group-delays of each of the left side and the right side may be the same throughout the frequency domain, respectively. The audio signal processing device may obtain T_L and T_R based on the position of the virtual sound source and the head size information. For example, the audio signal processing device may obtain the T_L and T_R by calculating as shown in Equation 7, based on the distance d between the virtual sound source and the right ear, and the radius r of the approximated sphere based on the head of the listener.
T_R=d/c [Equation 7]
where, T_L=T_R+(r+pi*r/2)/c, and pi is circumference.
Further, according to an embodiment, the audio signal processing device may calculate the modified ITD′ by adding an additional delay in addition to the obtained ITD. For example, the audio signal processing device may calculate the modified ITD′ by adding different additional delays (Delay_add) according to the angle between the listener and the sound source. Equation 8 shows a method of adding the additional delay (Delay_add) by dividing a section with respect to the azimuth determined by positions of the listener and the sound source. In Equation 8, ‘slope’ may indicate the slope of the phase response set based on a user-input, for each azimuth section. Also, round (x) denotes a function for outputting the result of rounding off the x value. And d1 and d2 denote parameters for determining the slope of the phase response for each azimuth section. For example, the audio signal processing device may set the values of d1 and d2 based on the user input, respectively.
ITDs'=ITDs+Delay_add
Delay_add=round(slope*azimuth), [Equation 8]
where if 0<=azimuth <=45, then slope=1/d1 (0<d1 and, d1 is an integer), and if 45<azimuth <=90, then slope=1/d2 (0<d2 and, d2 is an integer).
Also, according to an embodiment, the group-delay may be a delay time corresponding to an integer number of sample(s) based on a sampling frequency. In this case, additional utilization of an audio signal whose characteristics have been modified may be increased. The audio signal processing device may set the ipsilateral group-delay and the contralateral group-delay which is an integer multiple(s) of the sample(s). Further, when a sample out of the frame size occurs, the audio signal processing device may truncate an area that is symmetric to the sample out of the frame size based on the peak point from the front of the HRIR sample. Thus, the audio signal processing device may reduce the deterioration in sound quality caused by the sample out of the frame size.
Meanwhile, in order to perform binaural rendering covering all points on a virtual three-dimensional space around a listener, an audio signal processing device needs to obtain HRTF corresponding to all points. However, since constraints in measurement process and capacity of storable data are limited, additional processing may be required to obtain the HRTF corresponding to all points in the virtual three-dimensional space. In addition, in the case of measurement-based HRTF, additional processing may be required due to an error in magnitude response and phase response occurred during the measurement process.
Accordingly, an audio signal processing device, by using a plurality of HRTFs obtained previously, may generate an HRTF corresponding to a position other than the position of each of the plurality of obtained HRTFs. Thus, the audio signal processing device may enhance a spatial resolution of the audio signal simulated in the virtual three-dimensional space, and correct errors in the magnitude response and the phase response. Hereinafter, the method for obtaining the HRTF corresponding to the position other than the positions corresponding to the plurality of HRTFs included in the set of HRTFs by the audio signal processing device according to an embodiment of the present disclosure will be described with reference to
In addition, the audio signal processing device may receive an input to at least one of the number (N-M) of HRTFs to be added, the position of the HRTF to be added, or the group-delay, in processing the original HRTF pair. Specifically, the original set of HRTFs may include HRTFs for each angle according to predetermined angular spacing. Where the angle may be at least one of an azimuth or an elevation on a unit sphere centered at the listener. In addition, the predetermined angular spacing may include an angular spacing in the elevation direction and an angular spacing in the azimuth direction. In this case, the angular spacings for the elevation direction and the azimuth angle direction may be set to be different from each other.
For example, the audio signal processing device may obtain an HRTF corresponding to a position between a first angle and a second angle according to the predetermined angular interval. Specifically, the first angle and the second angle may have the same azimuth value and different elevation values separated by a predetermined angle interval. In this case, the audio signal processing device may interpolate a first HRTF corresponding to the first angle and a second HRTF corresponding to the second angle to generate a third HRTF corresponding to the different angle of elevation between the first angle and the second angle. In the above-described method, the audio signal processing device may generate a plurality of HRTFs corresponding to each of a plurality of positions located between the first angle and the second angle. Here, the number of HRTFs to be subjected to interpolation is described as two, but this is merely an example, and the present disclosure is not limited thereto. A plurality of HRTFs adjacent to a specific position may be interpolated to obtain HRTF corresponding to the specific position.
In this case, as described above, when the audio signal processing device interpolates a plurality of HRTFs in a frequency domain, the amount of computation for Fourier transform and inverse Fourier transform processed in the audio signal processing device may increase. Accordingly, an audio signal processing device according to an embodiment of the present disclosure may modify the phase response of each of a plurality of original HRTFs included in an original set of HRTFs. In addition, the audio signal processing device may generate an extended set of HRIR by interpolating, in the time domain, a plurality of HRTFs whose phase response is modified. Thus, the audio signal processing device may reduce the amount of unnecessary calculation. Hereinafter, a method for increasing the spatial resolution of an audio signal by the audio signal processing device will be described in detail with reference to
For example, the same phase response may be a zero-phase response. In the case of zero-phase response, the computational process required to binaural render based on HRTF may be facilitated. If the HRTF is a zero-phase response, the HRIR in the time domain may have a peak value at time ‘0’. Thus, an audio signal processing device according to an embodiment of the present disclosure may perform interpolation for a plurality of HRIRs in the time domain to reduce the amount of computation for generating an output audio signal. At the same time, the audio signal processing device may reduce the timbre distortion due to the comb-filtering described above.
According to an embodiment, the audio signal processing device may obtain a set of HRTFs in the form of HRIR, which is a response in the time domain. In this case, in step S1101, the audio signal processing device may convert the original HRIR included in the obtained set of HRTFs to a response in the frequency domain. For example, an audio signal processing device may perform FFT on an original HRIR to obtain an original HRTF in the frequency domain. Further, the audio signal processing device may perform the above-described phase response initialization on the original HRTF transformed into the response in the frequency domain to obtain the HRTF of which the phase response is initialized.
In step S1104, the audio signal processing device may convert the HRTFs whose phase responses have been initialized to a response in the time domain, to obtain the HRIRs whose phase responses is initialized. The audio signal processing device may perform the IFFT on the HRTFs whose phase response is initialized to obtain the HRIRs whose phase response is initialized. In step S1106, the audio signal processing device may generate HRIR's corresponding to positions other than the positions corresponding to the original HRTFs by interpolating at least two HRIRs of which phase responses of each HRIR is initialized, in the time domain. This is because the temporal positions of the peak values of the plurality of HRIRs corresponding to each of the plurality of HRTFs whose phase response is initialized coincide with each other, as described above. In this case, the audio signal processing device may generate the number (N-M) of HRIR's to be added based on the position of the HRTF to be added. Hereinafter, the set of HRIRs including the HRIRs whose phase response is initialized and the additionally generated HRIR's are referred to as a first set of HRIRs.
In step S1108, the audio signal processing device may apply the group-delay to each of the plurality of a first HRIRs included in the first set of HRIRs to generate an extended set of HRIRs. If the peak value of a HRIR is located at the time ‘0’ (i.e., a phase response of a HRTF is a zero-phase response), the audio signal processing device may apply the set group-delay to each of the plurality of the first HRIRs, obtained in the step S1106, without additional editing. The audio signal processing device may obtain the group-delay applied to each of the plurality of the first HRIRs based on the method for obtaining the group-delay for each ipsilateral and contralateral, described with reference to FIGS.
For example, the audio signal processing may time delay each of the plurality of ipsilateral HRIRs included in the first set of HRIRs based on an ipsilateral group-delay which is the same value regardless of a position of a sound source. In this case, the ipsilateral group-delay may be a value set based on the frame size. Further, the audio signal processing device may determine a contralateral group-delay applied to a plurality of contralateral HRIRs included in the first set of HRIRs based on the ITD described above. In this case, the contralateral group-delay may be the ITD time according to the position of the virtual sound source corresponding to the input audio signal with respect to the listener added to the ipsilateral group-delay. Accordingly, the audio signal processing device may generate the extended set of HRTFs that includes a greater number of HRTFs than the original set of HRTFs based on the original set of HRTFs. Further, the audio signal processing device may increase a spatial resolution of the audio signal in the virtual three-dimensional space around the listener efficiently in terms of the amount of computation and the timbre distortion. The audio signal processing device may increase the spatial resolution of the audio signal to enhance a sound image localization performance.
Meanwhile, in
Hereinafter, a method for an audio signal processing device according to an embodiment of the present disclosure to generate a final output audio signal based on a plurality of HRTF sets will be described. In this way, the audio signal processing device may correct errors in size response and phase response of the HRTF obtained by measurement.
According to an embodiment, the audio signal processing device may obtain a second set of HRTFs different from a first set of HRTFs. In this case, the first set of HRTFs may include a plurality of HRTFs that phase responses of the plurality of HRTFs are modified as the process of
Next, the audio signal processing device may set the phase response of each of a plurality of second HRTF pairs included in the second set of HRTFs to the phase response of each of a plurality of first HRTF pairs included in the first set HRTFs based on a phase information. For example, the audio signal processing device may match the phase response of each of the second HRTF pairs with the phase response of the first HRTF pairs for each position. The audio signal processing device may match the plurality of first HRTF pairs and the plurality of second HRTF pairs based on a position corresponding to each of the first and second HRTF pairs. For example, a first HRTF pair corresponding to a first position among the plurality of first HRTF pairs, and a second HRTF pair corresponding to the first position among the plurality of second HRTF pairs may be matched with each other. The audio signal processing device may set the phase response of each of the plurality of second HRTF pairs to the phase response of each of the plurality of the matched first HRTF pairs based on the phase information. Here, the phase information may be phase responses information of each of the first HRTF pairs for each position, stored in the audio signal processing device or an external device. The phase information may be stored as a look-up table form.
The first HRTF pair may include a first ipsilateral HRTF and a first ipsilateral HRTF. The second HRTF pair may also include a second ipsilateral HRTF and a second ipsilateral HRTF. Further, the first HRTF pair and the second HRTF pair may be HRTF pairs corresponding to the first position, respectively. For example, the audio signal processing device may match the phase responses of the first ipsilateral HRTF and the second ipsilateral HRTF. Further, the audio signal processing device may match the phase responses of the first contralateral HRTF and the second contralateral HRTF. The audio signal processing device may set the phase response of each of the second HRTF pair to the phase response of each of the first HRTF pair to generate a second HRTF′ pair having a matched phase response.
Next, the audio signal processing device may binaural render the input audio signal based on any one of the plurality of first HRTF pairs to generate a first output audio signal (Render 1 in
Meanwhile, in addition to a method of synthesizing audio signals generated through individual rendering, a plurality of HRTFs may be linearly combined to generate a combined HRTF. In this case, the amount of calculation required for rendering may be reduced as compared with a method of synthesizing audio signals.
According to an embodiment, the audio signal processing device may linearly combine the first HRTF pair and the second HRTF′ pair which phase responses are matched as described above, to generate a combined HRTF. Here, the linear combination may mean either a median or a mean. For example, the audio signal processing device may obtain a combined ipsilateral (contralateral) HRTF by calculating based on the magnitude responses of the first ipsilateral (contralateral) HRTF and the second ipsilateral (contralateral) HRTF′, for each frequency bin. Since phase responses of the first HRTF pair and the second HRTF′ pair are matched, a separate linear combination operation is not required. Next, the audio signal processing device may binaural render the input audio signal based on the combined HRTF to generate the final output audio signal in the frequency domain. In addition, the audio signal processing device may perform IFFT on the final output audio signal in the frequency domain to generate the final output audio signal in the time domain.
Meanwhile, the audio signal processing device may synthesize a binaural-rendered audio signal with an additional signal to enhance the expressiveness of the binaural-rendered audio signal. In addition, the audio signal processing device may binaural render an audio signal based on a filter obtained by combining HRTF with an additional filter for enhancing the expressiveness of an output audio signal. In the present disclosure, the additional signal may be an audio signal generated based on the additional filter. For example, the audio signal processing device may use one or more filters in addition to the HRTF according to the position of the virtual sound source corresponding to the object audio signal to generate an output audio signal. In this case, if a phase response of the additional filter and the HRTF do not match, the sound quality may be deteriorated due to the comb-filtering effect.
In this case, the audio signal processing device may obtain the first filter whose phase response has been modified in the method described above with reference to
Further, each of the plurality of additional filters may be a filter for different effects. For example, the plurality of additional filters may comprise a plurality of HRTFs (HRIRs) obtained in different ways as described above with reference to
In
output_L,R=g_H·s(n)*h_L,R(n)+g_I·s(n)·p_L,R, [Equation 9]
In Equation 9, g_H may be a mixing gain of the first output audio signals HRIR_L and HRIR_R. Also, g_I may be a mixing gain of the second output audio signal p_L, p_r. p_L,R denote the left or right channel panning gain, and h_L,R denote the left or right HRIR. n is an integer greater than 0 and less than the total number of samples, and s (n) represents the input audio signal at the nth sample. In addition, * denotes a convolution. In this case, the audio signal processing device may filter the input audio signal by a fast convolution method through a Fourier transform and an inverse Fourier transform.
According to an embodiment, the audio signal processing device may generate an energy compensated flat response for the ipsilateral and the contralateral gain. The energy level of the output audio signal may be excessively deformed with respect to the energy level of the input audio signal in accordance with the energy level change of the flat response. For example, the audio signal processing device may generate a panning gain based on a magnitude response of the ipsilateral and contralateral HRTFs corresponding to the virtual sound source of the input audio signal. The audio signal processing device may calculate the panning gains p_L and p_R corresponding to the left and right sides, respectively, as shown in Equation 10. For example, the audio signal processing device may determine the panning gains g1 and g2 by using a linear panning method or a constant power panning method. In Equation 10, the audio signal processing device may set the sum of the panning gains corresponding to each of the ears to be 1, to maintain an auditory energy of the input audio signal. In Equation 10, H_meanL represents the mean of the magnitude responses of the left HRTFs for each frequency bin, and H_meanR represents the mean of the magnitude responses of the right HRTFs for each frequency bin. In this case, a represents an azimuth index in IPC (Interaural Polar Coordinate), and k represents an index of a frequency bin.
p_L+p_R=1,
p_L=H_meanL(a)/(H_meanL(a)+H_meanR(a)),
p_R=H_meanR(a)/(H_meanL(a)+H_meanR(a)), [Equation 10]
where H_meanL(a)=mean(abs(H_L(k))), and H_meanR(a)=mean(abs(H_R(k))).
OUT_hat=IFFT[g_H·mag{S(k)}·mag{H_L,R(k)}·pha{S(k)+H_L,R(k)}+g_I·mag {S(k)}·mag{P_L,R(k)}·pha{S(k)+P_L,R(k)} [Equation 11]
In Equation 11, H_L,R (k), P_L, R (k), and S (k) denote frequency responses of h_L, R (n), p_L, R (n), and s(n) in a time domain, respectively. In addition, k represents the index of the frequency bin, and mag {x} and pha {x} represent the magnitude component and the phase component of the frequency response ‘x’, respectively.
OUT_hat_lin=IFFT[g_H·mag{H_lin(k)}·mag{S(k)}·pha{H_lin(k)+S(k)}+g_I·mag{P_L,R(k)}·mag{S(k)}·pha{H_lin(k)+S(k)}] [Equation 12]
In addition, the audio signal processing device may omit at least a portion of the Fourier transform operations to reduce the amount of computation required for generating a final output audio signal.
OUT_hat_lin=IFFT[g_H·mag{H_lin(k)}·mag{S(k)}·pha{H_lin(k)+S(k)}]+g_I·p_L,R·s(n−t_cont,ipsil) [Equation 13]
Meanwhile, as described above, the additional filter may include a spatial filter for simulating the spatial characteristics of a virtual sound source corresponding to an input audio signal. In this case, the spatial characteristics may include at least one of spread, volumization, blur, or width control effects. A characteristic of a sound source which is sound localized by using HRTF is a point-like. Thereby, the user may be experienced a sound effect such that the input audio signal is heard from the position corresponding to the virtual sound source on the three-dimensional space.
However, in the realistic three-dimensional spatial sound, the geometrical characteristics of the sound may be changed according to size of a sound source corresponding to the sound and distance from the listener to the sound source. For example, a sound of a wave or a thunder may be a sound having an area characteristic rather than a sound heard from a specific point. Meanwhile, a binaural filter for reproducing effects on a sound source other than a point may be difficult to generate through measurements. In addition, in order to reproduce the effect on the sound source other than the point, it may be difficult to construct a system capacity for storing data corresponding to various sound source environments.
Accordingly, the audio signal processing device may generate a spatial filter based on the obtained HRTF. In addition, the audio signal processing device may generate an output audio signal based on the obtained HRTF and the spatial filter. Hereinafter, a method by which an audio signal processing device generates an output audio signal by using another additional filter will be described with reference to
y_L=g_H·h_L*s+g_D·d_L*s
y_R=g_H·h_R*s+g_D·d_R*s [Equation 14]
In Equation 14, ‘s’ denotes an input audio signal, and h_L and h_R denote left and right HRTF filters (first filters), respectively. Further, d_L and d_R denote left and right spatial filters (second filters), respectively. g_H and g_D denote the mixing gains applied to the first filter and the second filter, respectively. In addition, denotes a convolution. In this case, the audio signal processing device may filter the input audio signal by a fast convolution method through Fourier transform and inverse Fourier transform. Meanwhile, the method of
In addition, a deterioration in sound quality may occur due to a difference in phase response between the first filter and the second filter during the mixing process.
Next, the audio signal processing device may generate the combined filter including H_L_new and H_R_new by interpolating the first filter and the second filter. In this case, the audio signal processing device may generate H_L_new and H_R_new by applying the above-described mixing gain to the magnitude response of each of the first filter and the second filter. The audio signal processing device may adjust the strength of the effect of each filter by using the mixing gain.
Further, the audio signal processing device may perform interpolation for each of the left filter and the right filter, of each of the first filter and the second filter. The interpolation may be performed in the time domain, or may be performed in the frequency domain via the Fourier transform. Equation 15 shows a method for an audio signal processing device to generate a left combined filter based on the first left filter and the second left filter in the frequency domain. In Equation 15, mag {X (k)} denotes the magnitude component of the filter X for k-th frequency bin, and pha {X (k)} denotes the phase component of the filter X for k-th frequency bin. Also, g_H and g_D represent the mixing gains applied to the left first filter and the left second filter, respectively.
H_L_new(k)=mag{H_L_new(k)}·exp[pha{H_L_new(k)}], [Equation 15]
where mag{H_L_new(k)}=g_H·mag{H_L(k)}+g_D·mag{D_L(k)}, and pha{H_L_new(k)}=g_H·pha{H_L(k)}+g_D·pha{D_L(k).
Meanwhile, an audio signal processing device according to an embodiment of the present disclosure may generate a combined filter by interpolating only the magnitude response of each of a plurality of filters. The audio signal processing device may use the phase response of the HRTF which is the first filter, as the phase response of the combined filter. Thus, the audio signal processing device may generate a combined filter based on the mixing gain determined in real-time. The audio signal processing device may omit the operation required to interpolate the phase response, to reduce the total amount of computation required in real-time operation. Equation 16 shows a method for the audio signal processing device to interpolate only the magnitude response of a plurality of filters to generate the combined filter.
H_L_new′(k)=mag{H_L_new(k)}·exp[pha{H_L_new}] [Equation 16]
where, mag{H_L_new(k)}=g_H·mag{H_L(k)}+g_D·mag{D_L(k)} and, pha{H_L_new(k)}=pha{H_L(k)}
In Equation 16, mag {X (k)} denotes the magnitude component of the filter X for the k-th frequency bin, and pha {X (k)} denotes the phase component of the filter X for the k-th frequency bin. Also, g_H and g_D represent the mixing gains applied to the left first filter and the left second filter, respectively. Equation 17 and Equation 18 show a method for the audio signal processing device to generate the left and right output audio signals Y_L′(k), Y_R′ (k)) by using the combined filter generated through Equation 16. In Equation 17 and Equation 18, mag {X (k)} denotes the magnitude component of the filter X for the k-th frequency bin, and pha {X (k)} denotes the phase component of the filter X for the k-th frequency bin. Also, g_H and g_D represent the mixing gain applied to the first filter and the second filter, respectively.
Y_L′(k)=g_H·H_L(k)·S(k)+g_D·D_L(k)·S(k)
={g_H·H_L(k)+g_D·D_L(k)}·S(k)
=[g_H·mag{H_L(k)}·exp[pha{H_L(k)}]+g_D·mag{D_L(k)}·exp[pha{H_L(k)}]]·S(k)
=[g_H+g_D·mag{D_L(k)}·mag{H_L_inv(k)}]·H_L(k)·S(k)
=g_new_L(k)·H_L(k)·S(k) [Equation 17]
where, g_new_L(k)=g_H+g_D mag{D_L(k)}·mag{H_L_inv(k)}, and mag{H_L_inv(k)}=1/mag{H_L(k)}
Y_R′(k)=g_H·H_R(k)·S(k)+g_D·D_R(k)·S(k)
=g_new_R(k)·H_R(k)·S(k) [Equation 18]
where, g_new_R(k)=g_H+g_D·mag{D_R(k)}·mag{H_R_inv(k)}, and mag{H_R_inv(k)}=1/mag{H_R(k)}
In Equation 17 and 18, the audio signal processing device generate the left and right combined filter based on a mixing gain g_H, g_D, a magnitude response of the second filter mag_D_R (k), and an inverse magnitude response of the first filter mag{H_R_inv(k)}. In this case, the inverse magnitude response of the first filter mag{H_R_inv(k)} may be a value calculated previously in the HRTF database. The audio signal processing device may generate the combined filters g_new_L (k), g_new_R (k) by using the magnitude response of the first filter, not the inverse magnitude response of the first filter, as in intermediate results of Equation 17 and Equation 18.
Y_BES(k)=S(k)·H_lin(k)·g_H+S(k)·IP(k)·p_L,R·g_I
=(k)·mag{H_lin(k)}·pha{H_lin(k)}·g_H+S(k)·pha{H_lin(k)}·p_L,R·g_I
=S(k)·H_lin(k)·[g_H+g_I·p_L,R·mag{1/H_lin(k)}) [Equation 19]
Y_sprd(k)=S(k)·H_lin(k)·g_H+S(k)·D_lin(k)·g_D
=S(k)[H_lin(k)·g_H+mag{D_lin(k)}·pha{H_lin(k)}·g_D]
=S(k)·H_lin(k)[g_H+mag{D_lin(k)}·mag{1/H_lin(k)}·g_D] [Equation 20]
In Equation 19 and Equation 20, mag {X (k)} denotes the magnitude component of the filter X for the k-th frequency bin, and pha {X (k)} denotes the phase component of the filter X for the k-th frequency bin. Also, H_lin denotes the HRTF generated based on the linearized phase response, p_L, R denotes the left or right panning gain, and D_lin denotes the spatial filter generated based on the linearized phase response of the HRTF. Also, g_H, g_I, and g_D represent mixing gains corresponding to the HRTF, the panning filter, and the spatial filter, respectively. IP (k) represents an impulse response having the same phase as H_lin.
Equation 21 represents a final output audio signal Y_BES+Sprd(k). Here, the audio signal processing device may generate the final output audio signal by synthesizing an output audio signal Y_BES to which BES is applied, and an output audio signal Sprd (k) to which characteristics according to the distance and the size of the sound source is applied. In Equation 21, g_B is a mixing gain corresponding to the output audio signal to which the BES is applied.
Y_BES+Sprd(k)=Y_BES(k)·g_B+S(k)·D_lin(k)·g_D
=S(k)·H_lin(k)·g_B(g_H+g_I·p·mag{1/H_lin(k)})+S(k)·mag{D_lin(k)}·H_lin(k)·mag{1/H_lin(k)}·g_D
=S(k)·H_lin(k)·(g_B·g_H+g_B·g_I·p·mag{1/H_lin(k)}+g_D·mag{D_lin(k)}·mag{1/H_lin(k)})
=S(k)·H_lin(k)·(g_B·g_H+(g_B·g_I·p+g_D·mag{D_lin(k)}·mag{1/H_lin(k)}) [Equation 21]
Referring to
Meanwhile, according to an embodiment of the present disclosure, the input audio signal may be simulated through a plurality of virtual sound sources. For example, the input audio signal may include at least one of a plurality of channel signals or an ambisonics signal. In this case, the audio signal processing device may simulate the input audio signal through a plurality of virtual sound sources. For example, the audio signal processing device may binaural render an audio signal assigned to each virtual sound source based on a plurality of HRTFs corresponding to each of a plurality of virtual sound sources, thereby generating an output audio signal. In this case, the audio signals assigned to respective virtual sound sources may be highly correlated. In addition, the phase responses of a plurality of HRTFs corresponding to respective virtual sound sources may be different from each other. As a result, the sound quality degradation due to the above-described comb-filtering effect may occur in the output audio signal. The device for processing an audio signal according to an embodiment of the present disclosure may match the phase response of each of a plurality of HRTFs corresponding to each virtual sound source. Accordingly, the audio signal processing device may mitigate the deterioration in sound quality caused by binaural rendering of the plurality of channel signals or the ambisonics signal correlated highly.
Specifically, the audio signal processing device may generate an output audio signal by using a plurality of different HRTF pairs corresponding to each of the plurality of virtual sound sources. In this embodiment, the virtual sound source may be a channel corresponding to the channel signal or a virtual channel for rendering the ambisonics signal. Further, the audio signal processing device may convert the ambisonics signal into virtual channel signals corresponding to each of a plurality of virtual sound sources arranged with respect to the head direction of the listener. In this case, the plurality of virtual sound sources may be arranged according to a sound source layout. For example, the source layout may be a virtual cube whose entire vertex is located on a unit sphere centered at the listener. In this case, the plurality of virtual sound sources may be located at the vertices of the virtual cube, respectively.
Hereinafter, for convenience of explanation, the positions of the plurality of virtual sound sources are referred to as FLU (front-left-up), FRU (front-right-up), FLD (front-Down, Rear-Left-Up, Rear-Right-Up, Rear-Left-Down, and Rear-Right-Down. In the related description of the present disclosure, the case where the sound source layout is the vertex of the cube is described as an example, but the present disclosure is not limited thereto. For example, the sound source layout may be in a form of an octahedral vertex.
The audio signal processing device may obtain a plurality of different HRTF pairs corresponding to each of the plurality of virtual sound sources. Further, the audio signal processing device may analyze each of the plurality of HRTFs in a magnitude response and a phase response. Next, the audio signal processing device may modify the phase response of each of the plurality of HRTFs in the method described above with reference to
Further, the audio signal processing device may modify the phase response of each of the plurality of contralateral HRTFs. For example, a first HRTF pair corresponding to a first virtual sound source included in a plurality of virtual sound sources may include a first ipsilateral HRTF and a first major HRTF. In this case, the audio signal processing device may obtain a phase response of a first contralateral HRTF′ in which difference of the phase response between the first ipsilateral HRTF and the first contralateral HRTF is maintained, with respect to the phase response of a first ipsilateral HRTF′. Next, the audio signal processing device may generate a two-channel output audio signal by rendering the virtual channel signal corresponding to each of the plurality of virtual sound sources based on the plurality of pairs of HRTF′ corresponding to positions of the plurality of virtual sound sources.
According to an embodiment of the present disclosure, an audio signal processing device may generate a left phase response and a right phase response based on the sound source layout. As described above, when the sound source layout is the vertex of the virtual cube, the distance from each of the four left vertices with respect to the listener to the left ear of the listener is the same. In addition, the distance from any one of the left vertices to the left ear of the listener is the same as the distance from any one of the four right vertices to the right ear of the listener. If the distance from the source to the left or right ear of the listener is the same, the group-delay applied to the audio signal may be the same. That is, when the sound source layout is left-right symmetric with respect to the listener, the audio signal processing device may generate the HRTF having common phase response for each of the left side and the right side with respect to the listener.
Hereinafter, for convenience of explanation, the four HRTF pairs corresponding to the vertex located on the left side with respect to the listener are referred to as the left group. Also, four HRTF pairs corresponding to the vertex located on the right side of the listener are referred to as the right group. The left group may include HRTF pairs corresponding to the FLU, FLD, RLU, and RLD positions, respectively. Also, the right group may include HRTF pairs corresponding to FRU, FRD, RRU, and RRD positions, respectively.
The audio signal processing device may determine phase responses of the right group and the left group, based on the phase response of each of the plurality of ipsilateral HRTFs included in each of the right group and the left group. In this case, the ipsilateral of the left group represents the left ear of the listener, and the ipsilateral of the right group represents the right ear of the listener. The audio signal processing device may use any one of mean, median value, or mode value of the phase responses of a plurality of left HRTFs included in the left group, as the left group phase response. Further, the audio signal processing device may use any one of mean, median value, or mode value of the phase responses of a plurality of right HRTFs included in the right group, as the right group phase response. In addition, the audio signal processing device may linearize the determined group phase responses.
In addition, the audio signal processing device may generate the ipsilateral HRTF's by modifying the phase response of each of the ipsilateral HRTFs included in each group based on the group phase response obtained for each group. An embodiment described based on ipsilateral HRTFs may be applied in a same or corresponding manner to the contralateral HRTFs. According to another embodiment, the audio signal processing device may select any of the phase responses of each of the four HRTFs included in the left group as the left group phase response. Further, the audio signal processing device may select any one of the phase responses of the four HRTFs included in the right group as the right group phase response. Accordingly, the audio signal processing device may reduce the distortion of timbre while maintaining the image-localization performance in the binaural rendering of the ambisonics signal and the channel signals.
In the present embodiment, the operation of the audio signal processing device is described using the first order ambisonics (FoA) as an example, but the present disclosure is not limited thereto. For example, the above-described method may be applied to a high order ambisonics (HoA) signal including a plurality of sound sources in the same or corresponding manner. This is because the ambisonics signal may be simulated with a linear sum of the spherical harmonics corresponding to each degree even if the ambisonics signal is a higher order ambisonics signal. Also, in case of a channel signal, the above-described method may be applied in the same or corresponding method.
Some embodiments may also be implemented in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. A computer readable medium can be any available medium that can be accessed by a computer, and can include both volatile and nonvolatile medium, removable and non-removable medium. The computer-readable medium may also include computer storage medium. The computer storage medium may include both volatile and nonvolatile, removable and non-removable medium implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
Although the present disclosure has been described using the specific embodiments, those skilled in the art could make changes and modifications without departing from the spirit and the scope of the present disclosure. That is, although the embodiments of binaural rendering for audio signals have been described, the present disclosure can be equally applied and extended to various multimedia signals including not only audio signals but also video signals. Therefore, any derivatives that could be easily inferred by those skilled in the art from the detailed description and the embodiments of the present disclosure should be construed as falling within the scope of right of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2017-0176720 | Dec 2017 | KR | national |
10-2018-0050407 | May 2018 | KR | national |