The foregoing summary of the invention, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the accompanying drawings, which are included by way of example, and not by way of limitation with regard to the claimed invention.
In the following description of various illustrative embodiments, reference is made to the accompanying drawings, which form a part hereof, and in which is shown, by way of illustration, various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the present invention.
Aspects of the present invention describe an artificial bandwidth expansion method for binaural speech signals (B-ABE). A binaural speech signal is a two-channel signal, left and right channels, which may contain speech of one talker or several simultaneous talkers. A binaural speech signal is produced from a monophonic speech signal, for example, by head related transfer function (HRTF) processing and mixing a plurality of these signals in a conference bridge of a centralized 3D audio conferencing system. Alternatively, a binaural signal is generated by making a recording with an artificial head, e.g., a mechanical model of a human head, and possibly torso, which has microphones in the ear canals. A KEMAR-mannequin, Knowles Electronics Mannequin for Acoustic Research mannequin, is one example of a commercial artificial head. In another embodiment, a user wears a binaural headset, which includes microphones mounted in the earpiece. The binaural signal is encoded and transmitted to the terminal. If narrowband coding is used, the receiving terminal may apply artificial bandwidth extension for speech intelligibility enhancement and 3D audio representation improvement.
Artificial bandwidth expansion algorithms typically double the sampling frequency of a signal from, e.g., 8 kHz to 16 kHz and add new spectral components to the high band, i.e., from 4 kHz to 8 kHz. This conversion from narrowband to wideband may be either totally artificial, so no extra information is transmitted or some side information concerning the missing frequency components may be transmitted. Compared to narrowband speech, artificial wideband speech has better quality and it is more intelligible. An artificial bandwidth expansion method for binaural signals (B-ABE) may be used within a system in which two separately coded channels are transmitted from a conference bridge to a user terminal. In addition, aspects of the present invention are directed other multichannel signals, such as three channels, applied to stereo speech codecs. Aspects of the present invention may also be utilized for bandwidth expansion towards low frequencies. New spectral components may be added to a low band, e.g., 100-300 Hz, signal if the bandwidth of an input signal is, e.g., 300-3400 Hz.
As described herein, aspects of the present invention apply ABE for binaural, i.e., stereo, speech signals, monaural signals, amplitude panned signals, delay panned signals, and dichotic speech signals. Aspects of the present invention improve quality and intelligibility of narrowband binaural speech, while implementation may be inexpensive from a computational point of view compared to true wideband binaural speech, because all the other speech enhancement algorithms may operate in narrowband mode before the expansion. In addition, aspects of the present invention work with all ABE algorithms designed for monophonic speech.
Specifically with respect to 3D teleconferencing, aspects of the present invention improve speech intelligibility due to a wider speech bandwidth. A wider speech bandwidth improves localization accuracy which makes it possible to use more spatial positions for sound sources, e.g., positions at listeners back or using elevation, which improves performance of the 3D teleconference system. When stereo hands-free speakers are used, only narrowband stereo echo cancellation algorithm is required; while wideband echo cancellation is required with wideband codecs. Aspects of the present invention may be implemented in a terminal device or in a gateway to connect wideband and narrowband terminal devices. 3D representation and room effect may attenuate some artefacts generated in the bandwidth extension processing.
For one channel, a conventional monophonic artificial bandwidth expansion (ABE) component 403 performs artificial expansion for one channel. Those skilled in the art will appreciate the manner in which conventional ABE may be performed. The output signal from the ABE component 403 is inputted to a high-pass filter component 405 configured to output a high band signal. The outputted high band signal is inputted into delay and energy adjustment components 407 and 409, one corresponding to each channel.
Delay and energy adjustment components 407 and 409 are configured to modify, separately for the respective right or left channel, the inputted high band signal. The modification to the high band signal is based upon the estimated delay and energy differences from ITD and ILD estimation component 403. The difference estimates are shown as inputs to the delay and energy adjustment components 407 and 409 by signal 415 shown in broken line form. Finally, via up-sampling components 411 and 413, the modified high bands are added to the original narrowband signals and a wideband binaural output signal with a doubled sampling rate, such as fs=16 kHz, is outputted. Aspects of the present invention may be implemented for additional channels and the description of two is merely illustrative. As such, aspects of the present invention may be implemented for multichannel speech signals in excess of two channels.
During simultaneous speech, speakers may be positioned to opposite sides of the listener. In such situation, a delayed speech signal of one speaker is in the left channel, whereas the other is in the right channel. The delay estimation is still calculated the same way as in a single speaker case, and for each frame, the delay of the dominant speaker is obtained and the frames are processed respectively.
Two illustrative examples for determining which one of the channels first serves as an input for the monophonic ABE algorithm component 403. In one embodiment, the same channel may be used all the time. In a second embodiment, the channel that has more energy at the moment may be used. This second embodiment has an advantage in that the ABE processed channel does not need further energy or phase adjustments, thus saving computational resources. For the other channel, the delay and the energy are modified to correspond to the original estimates. The energy difference may be used as an indicator since in a binaural signal, the polarity of the interaural time difference (ITD) is correlated with the corresponding interaural level difference (ILD) for a single sound source. As such, the signal in the contra-lateral, i.e., farther ear, channel is delayed and a low-pass filtered version of the corresponding signal is in the ipsi-lateral, i.e., nearer ear, channel. In accordance with another embodiment, it should be understood that interaural time difference (ITD) estimation also may be made for frequency bands of a signal. A signal may be split to various frequency bands and an ITD component may estimate between the corresponding bands. Then a combined ITD estimate may be made from these band-related estimates.
The high-pass filter component 405 used to extract the created high band for further modification is configured to have a cut-off frequency of 4 kHz. If the expansion starts from, for example, 3.4 kHz, where a traditional telephone band ends, the cut-off frequency would be lower respectively.
With respect to the ITD and ILD estimation component 401, one illustrative manner to estimate the delay between the channels of a binaural signal includes using an average magnitude difference function, such as,
where xl is the left channel, xr is the right channel, N is the analysis frame length, and i is the delay. The average magnitude difference function, d(i), is an estimate of a time difference between two signals, xl and xr. If the artificially created high band of one channel is copied to another signal, it has to be delayed/forwarded by the same amount as is the time difference between the original signals. Another illustrative manner is correlation based. A correlation based method may be, for example, cross correlation which is a generally known metric.
Another illustrative method is to include envelope matching metrics. Wong, Peter H. W. and Au, Oscar C.; “Fast SOLA-Based Time Scale Modification Using Envelope Matching”; Journal of VLSI Signal Processing Systems, Vol 35, Issue 1; August 2003, describes an example of where envelope matching is used for time scale modification.
In one embodiment, artificial bandwidth expansion (ABE) may be performed individually for both of the channels. However, in order to preserve the delay and level differences, some control between the expansions is needed. In one embodiment, such a control may be implemented through frame classification, because voiced speech frames, fricatives, and plosives are processed differently.
In another embodiment of the present invention, the incoming binaural signal may be analyzed to discriminate cases when there is only one speaker talking and when several simultaneous speakers are talking at the same time. Depending on the particular case, processing may be controlled differently. For example, when only one speaker is active, the processing may be performed according to one embodiment, and during simultaneous speech, bandwidth extension processing may be disabled or run individually for the channels.
One use of aspects of the present invention may be within a terminal device, such as terminal device 351. In a first embodiment, optional artificial room effect signal processing may be performed in a terminal device after the binaural artificial bandwidth expansion (B-ABE) processing. The room effect signal may takes on a monophonic input signal and may produce a binaural output. The monophonic downmix for the room effect may be made by mixing the input signal of different channels taken from the binaural input, before the ABE component 403 or after the ABE component 403. If the signal is taken after the ABE component, the downmix is a bandwidth expanded signal. The room effect may be processed in parallel the binaural input signal illustrated in
The purpose of room effect processing in teleconferencing is to make the environment sound more natural and satisfactory to a listener. In addition, room effect improves externalization of sound sources in headphone listening. This means that a listener perceives sound sources to be located farther away than in her head, which is typical in headphone listening. With respect to this first embodiment, a conference bridge, such as conference bridge 301, is configured to produce a combined narrowband binaural signal. A conference bridge performs head related transfer function (HRTF) processing, binaural mixing, and narrowband (NB) encoding. A terminal device, operatively connected to the conference bridge is configured to perform NB decoding, binaural artificial bandwidth expansion (B-ABE) processing, room effect signal processing, and playback.
In a second embodiment, the artificial room effect may be generated and added to the binaural signal by a conference bridge. With respect to this second embodiment, a conference bridge, such as conference bridge 301, is configured to produce a combined narrowband binaural signal including an artificial room effect signal. A conference bridge performs head related transfer function (HRTF) processing, binaural mixing, room effect signal processing, and narrowband (NB) encoding. A terminal device, operatively connected to the conference bridge is configured to perform NB decoding, binaural artificial bandwidth expansion (B-ABE) processing, and playback.
In a third embodiment, one or more aspects of the present invention may be performed by a gateway configured to receive narrowband binaural signal and output a wideband binaural signal for a terminal device. With respect to this third embodiment, a gateway performs narrowband (NB) encoding, B-ABE processing, and wideband (WB) encoding. A terminal device, operatively connected to the gateway is configured to perform WB decoding and playback.
In a fourth embodiment, one or more aspects of the present invention may be implemented in a conference bridge capable of processing wideband signals. In accordance with aspects of the present invention, the conference bridge makes a wideband binaural signal from a narrowband binaural input signal before mixing the wideband binaural signal with several other binaural signals. Such a configuration would be beneficial if a narrowband binaural recording is received from certain participating sites. With respect to this fourth embodiment, a conference bridge, such as conference bridge 301, is configured to perform B-ABE processing on narrowband binaural inputs before making a wideband mix. A conference bridge performs B-ABE processing, binaural mixing, and wideband (WB) encoding. A terminal device, operatively connected to the conference bridge is configured to perform WB decoding and playback.
It should be understood by those skilled in the art that aspects of the present invention may be applied to telepresence applications, i.e., applications in which a participant is placed within a virtual environment, controlling devices to make the conference environment appear more realistic to the participant. In such a telepresence application, binaural recordings are used for teleconferencing and the remote session is recorded with a binaural microphone.
It should be further understood by those skilled in the art that the example of a high frequency bandwidth expansion described in
Proceeding to step 505, the delay and energy level difference between the left and right channels of the narrowband binaural speech signal is estimated. As described herein, an average magnitude difference function may be utilized to perform this step 505. At step 507, for one of the left and right channels, an artificial bandwidth expansion algorithm expands the channel bandwidth. In one embodiment, the same channel may be used all the time, such as the left channel. In a second embodiment, the channel that has more energy at the moment may be used. It should be understood by those skilled in the art that in one embodiment, ABE processing may be calculated only for one channel where the created high band signal is added to both signals after adjusting the delay and energy levels separately for each. In another embodiment, ABE processing may be calculated for both channels separately.
From step 507, the process proceeds to step 511 where, the ABE processed signal is inputted to a high pass filter, such as high pass filter component 405, configured to output a high band signal. Again, it should be understood by those skilled in the art that a band pass filter may be used in place of a high pass filter in step 511. In such a case, a band limited signal may be processed as well.
From step 511, the process proceeds to step 513. Returning to step 505, a second output proceeds to step 509 where the delay and energy level difference estimates for each of the right and left channel are forwarded to first and second delay and energy level adjustment components, such as delay and energy adjustment components 407 and 409. The first delay and energy level adjustment component is configured to adjust one of the two channel signals and the second delay and energy level adjustment component is configured to adjust the other.
The delay and energy level difference estimate data from step 509 and the high band signal outputted from step 511 are inputted to step 513. At step 513, the high band signal is modified by the first and second delay and energy level adjustment components based upon the delay and energy level estimate data. From step 513, the process proceeds to step 517. Returning to step 501, the original narrowband binaural speech signal is up-sampled to increase the sampling rate of each of the two channels. The output from step 515 and the modified high band signal from step 513 proceed to step 517 where the two are added together. The output of step 517 is a wideband binaural speech signal with a doubled sampling rate, such as fs=16 kHz.
While illustrative systems and methods as described herein embodying various aspects of the present invention are shown, it will be understood by those skilled in the art, that the invention is not limited to these embodiments. Modifications may be made by those skilled in the art, particularly in light of the foregoing teachings. For example, each of the elements of the aforementioned embodiments may be utilized alone or in combination or subcombination with elements of the other embodiments. It will also be appreciated and understood that modifications may be made without departing from the true spirit and scope of the present invention. The description is thus to be regarded as illustrative instead of restrictive on the present invention.