The present disclosure is generally related to estimating a temporal offset of multiple channels.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing capabilities.
A computing device may include multiple microphones to receive audio signals. Generally, a sound source is closer to a first microphone than to a second microphone of the multiple microphones. Accordingly, a second audio signal received from the second microphone may be delayed relative to a first audio signal received from the first microphone. In stereo-encoding, audio signals from the microphones may be encoded to generate a mid channel and one or more side channels. The mid channel may correspond to a sum of the first audio signal and the second audio signal. A side channel may correspond to a difference between the first audio signal and the second audio signal. The first audio signal may not be temporally aligned with the second audio signal because of the delay in receiving the second audio signal relative to the first audio signal. The misalignment (or “temporal offset”) of the first audio signal relative to the second audio signal may increase a magnitude of the side channel. Because of the increase in magnitude of the side channel, a greater number of bits may be needed to encode the side channel.
Additionally, different frame types may cause the computing device to generate different temporal offsets or shift estimates. For example, the computing device may determine that a voiced frame of the first audio signal is offset by a corresponding voiced frame in the second audio signal by a particular amount. However, due to a relatively high amount of noise, the computing device may determine that a transition frame (or unvoiced frame) of the first audio signal is offset by a corresponding transition frame (or corresponding unvoiced frame) of the second audio signal by a different amount. Variations in the shift estimates may cause sample repetition and artifact skipping at frame boundaries. Additionally, variation in shift estimates may result in higher side channel energies, which may reduce coding efficiency.
According to one implementation of the techniques disclosed herein, a method of estimating a temporal offset between audio captured at multiple microphones includes capturing a reference channel at a first microphone and capturing a target channel at a second microphone. The reference channel includes a reference frame, and the target channel includes a target frame. The method also includes estimating a delay between the reference frame and the target frame. The method further includes estimating a temporal offset between the reference channel and the target channel based on a cross-correlation values of comparison values.
According to another implementation of the techniques disclosed herein, an apparatus for estimating a temporal offset between audio captured at multiple microphones includes a first microphone configured to capture a reference channel and a second microphone configured to capture a target channel. The reference channel includes a reference frame, and the target channel includes a target frame. The apparatus also includes a processor and a memory storing instructions that are executable to cause the processor to estimate a delay between the reference frame and the target frame. The instructions are also executable to cause the processor to estimate a temporal offset between the reference channel and the target channel based on a cross-correlation values of comparison values.
According to another implementation of the techniques disclosed herein, a non-transitory computer-readable medium includes instructions for estimating a temporal offset between audio captured at multiple microphones. The instructions, when executed by a processor, cause the processor to perform operations including estimating a delay between a reference frame and a target frame. The reference frame is included in a reference channel captured at a first microphone, and the target frame is included in a target channel captured at a second microphone. The operations also include estimating a temporal offset between the reference channel and the target channel based on a cross-correlation values of comparison values.
According to another implementation of the techniques disclosed herein, an apparatus for estimating a temporal offset between audio captured at multiple microphones includes means for capturing a reference channel and means for capturing a target channel. The reference channel includes a reference frame, and the target channel includes a target frame. The apparatus also includes means for estimating a delay between the reference frame and the target frame. The apparatus further includes means for estimating a temporal offset between the reference channel and the target channel based on a cross-correlation values of comparison values.
According to another implementation of the techniques disclosed herein, a method of non-causally shifting a channel includes estimating comparison values at an encoder. Each comparison value is indicative of an amount of temporal mismatch between a previously captured reference channel and a corresponding previously captured target channel. The method also includes smoothing the comparison values to generate short-term smoothed comparison values and first long-term smoothed comparison values. The method also includes calculating a cross-correlation value between the comparison values and the short-term smoothed comparison values. The method also includes comparing the cross-correlation value with a threshold, and adjusting the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to determination that the cross-correlation value exceeds the threshold. The method further includes estimating a tentative shift value based on the smoothed comparison values. The method also includes non-causally shifting a target channel by a non-causal shift value to generate an adjusted target channel that is temporally aligned with a reference channel. The non-causal shift value is based on the tentative shift value. The method further includes generating, based on the reference channel and the adjusted target channel, at least one of a mid-band channel or a side-band channel.
According to another implementation of the techniques disclosed herein, an apparatus for non-causally shifting a channel includes a first microphone configured to capture a reference channel and a second microphone configured to capture a target channel. The apparatus also includes an encoder configured to estimate comparison values. Each comparison value is indicative of an amount of temporal mismatch between a previously captured reference channel and a corresponding previously captured target channel. The encoder is also configured to smooth the comparison values to generate short-term smoothed comparison values and first long-term smoothed comparison values. The encoder is further configured to calculate a cross-correlation value between the comparison values and the short-term smoothed comparison values. The encoder is further configured to compare the cross-correlation value with a threshold, and adjust the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to determination that the cross-correlation value exceeds the threshold. The encoder is further configured to estimate a tentative shift value based on the smoothed comparison values. The encoder is also configured to non-causally shift a target channel by a non-causal shift value to generate an adjusted target channel that is temporally aligned with a reference channel. The non-causal shift value is based on the tentative shift value. The encoder is further configured to generate, based on the reference channel and the adjusted target channel, at least one of a mid-band channel or a side-band channel.
According to another implementation of the techniques disclosed herein, a non-transitory computer-readable medium includes instruction for non-causally shifting a channel. The instructions, when executed by an encoder, cause the encoder to perform operations including estimating comparison values. Each comparison value is indicative of an amount of temporal mismatch between a previously captured reference channel and a corresponding previously captured target channel. The operations also include smoothing the comparison values to generate short-term smoothed comparison values and first long-term smoothed comparison values. The operations also include calculating a cross-correlation value between the comparison values and the short-term smoothed comparison values. The operations also include adjusting the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to determination that the cross-correlation exceeds the threshold. The operations also include estimating a tentative shift value based on the smoothed comparison values. The operations also include non-causally shifting a target channel by a non-causal shift value to generate an adjusted target channel that is temporally aligned with a reference channel. The non-causal shift value is based on the tentative shift value. The operations also include generating, based on the reference channel and the adjusted target channel, at least one of a mid-band channel or a side-band channel.
According to another implementation of the techniques disclosed herein, an apparatus for non-causally shifting a channel includes means for estimating comparison values. Each comparison value is indicative of an amount of temporal mismatch between a previously captured reference channel and a corresponding previously captured target channel. The apparatus also includes means for smoothing the comparison values to generate short-term smoothed comparison values and means for smoothing the comparison values to generate first long-term smoothed comparison values. The apparatus also includes means for calculating a cross-correlation value between the comparison values and the short-term smoothed comparison values. The apparatus also includes means for comparing the cross-correlation value with a threshold, and means for adjusting the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to determination that the cross-correlation value exceeds the threshold. The apparatus also includes means for estimating a tentative shift value based on the smoothed comparison values. The apparatus also includes means for non-causally shifting a target channel by a non-causal shift value to generate an adjusted target channel that is temporally aligned with a reference channel. The non-causal shift value is based on the tentative shift value. The apparatus also includes means for generating, based on the reference channel and the adjusted target channel, at least one of a mid-band channel or a side-band channel.
Systems and devices operable to encode multiple audio signals are disclosed. A device may include an encoder configured to encode the multiple audio signals. The multiple audio signals may be captured concurrently in time using multiple recording devices, e.g., multiple microphones. In some examples, the multiple audio signals (or multi-channel audio) may be synthetically (e.g., artificially) generated by multiplexing several audio channels that are recorded at the same time or at different times. As illustrative examples, the concurrent recording or multiplexing of the audio channels may result in a 2-channel configuration (i.e., Stereo: Left and Right), a 5.1 channel configuration (Left, Right, Center, Left Surround, Right Surround, and the low frequency emphasis (LFE) channels), a 7.1 channel configuration, a 7.1+4 channel configuration, a 22.2 channel configuration, or a N-channel configuration.
Audio capture devices in teleconference rooms (or telepresence rooms) may include multiple microphones that acquire spatial audio. The spatial audio may include speech as well as background audio that is encoded and transmitted. The speech/audio from a given source (e.g., a talker) may arrive at the multiple microphones at different times depending on how the microphones are arranged as well as where the source (e.g., the talker) is located with respect to the microphones and room dimensions. For example, a sound source (e.g., a talker) may be closer to a first microphone associated with the device than to a second microphone associated with the device. Thus, a sound emitted from the sound source may reach the first microphone earlier in time than the second microphone. The device may receive a first audio signal via the first microphone and may receive a second audio signal via the second microphone.
Mid-side (MS) coding and parametric stereo (PS) coding are stereo coding techniques that may provide improved efficiency over the dual-mono coding techniques. In dual-mono coding, the Left (L) channel (or signal) and the Right (R) channel (or signal) are independently coded without making use of inter-channel correlation. MS coding reduces the redundancy between a correlated L/R channel-pair by transforming the Left channel and the Right channel to a sum-channel and a difference-channel (e.g., a side channel) prior to coding. The sum signal and the difference signal are waveform coded in MS coding. Relatively more bits are spent on the sum signal than on the side signal. PS coding reduces redundancy in each sub-band by transforming the L/R signals into a sum signal and a set of side parameters. The side parameters may indicate an inter-channel intensity difference (IID), an inter-channel phase difference (IPD), an inter-channel time difference (ITD), etc. The sum signal is waveform coded and transmitted along with the side parameters. In a hybrid system, the side-channel may be waveform coded in the lower bands (e.g., less than 2 kilohertz (kHz)) and PS coded in the upper bands (e.g., greater than or equal to 2 kHz) where the inter-channel phase preservation is perceptually less critical.
The MS coding and the PS coding may be done in either the frequency domain or in the sub-band domain. In some examples, the Left channel and the Right channel may be uncorrelated. For example, the Left channel and the Right channel may include uncorrelated synthetic signals. When the Left channel and the Right channel are uncorrelated, the coding efficiency of the MS coding, the PS coding, or both, may approach the coding efficiency of the dual-mono coding.
Depending on a recording configuration, there may be a temporal shift between a Left channel and a Right channel, as well as other spatial effects such as echo and room reverberation. If the temporal shift and phase mismatch between the channels are not compensated, the sum channel and the difference channel may contain comparable energies reducing the coding-gains associated with MS or PS techniques. The reduction in the coding-gains may be based on the amount of temporal (or phase) shift. The comparable energies of the sum signal and the difference signal may limit the usage of MS coding in certain frames where the channels are temporally shifted but are highly correlated. In stereo coding, a Mid channel (e.g., a sum channel) and a Side channel (e.g., a difference channel) may be generated based on the following Formula:
M=(L+R)/2,S=(L−R)/2, Formula 1
where M corresponds to the Mid channel, S corresponds to the Side channel, L corresponds to the Left channel, and R corresponds to the Right channel.
In some cases, the Mid channel and the Side channel may be generated based on the following Formula:
M=c·(L+R),S=c·(L−R), Formula 2
where c corresponds to a complex value which is frequency dependent. Generating the Mid channel and the Side channel based on Formula 1 or Formula 2 may be referred to as performing a “down-mixing” algorithm. A reverse process of generating the Left channel and the Right channel from the Mid channel and the Side channel based on Formula 1 or Formula 2 may be referred to as performing an “up-mixing” algorithm.
An ad-hoc approach used to choose between MS coding or dual-mono coding for a particular frame may include generating a mid signal and a side signal, calculating energies of the mid signal and the side signal, and determining whether to perform MS coding based on the energies. For example, MS coding may be performed in response to determining that the ratio of energies of the side signal and the mid signal is less than a threshold. To illustrate, if a Right channel is shifted by at least a first time (e.g., about 0.001 seconds or 48 samples at 48 kHz), a first energy of the mid signal (corresponding to a sum of the left signal and the right signal) may be comparable to a second energy of the side signal (corresponding to a difference between the left signal and the right signal) for voiced speech frames. When the first energy is comparable to the second energy, a higher number of bits may be used to encode the Side channel, thereby reducing coding efficiency of MS coding relative to dual-mono coding. Dual-mono coding may thus be used when the first energy is comparable to the second energy (e.g., when the ratio of the first energy and the second energy is greater than or equal to the threshold). In an alternative approach, the decision between MS coding and dual-mono coding for a particular frame may be made based on a comparison of a threshold and normalized cross-correlation values of the Left channel and the Right channel.
In some examples, the encoder may determine a temporal mismatch value indicative of a temporal shift of the first audio signal relative to the second audio signal. The mismatch value may correspond to an amount of temporal delay between receipt of the first audio signal at the first microphone and receipt of the second audio signal at the second microphone. Furthermore, the encoder may determine the mismatch value on a frame-by-frame basis, e.g., based on each 20 milliseconds (ms) speech/audio frame. For example, the mismatch value may correspond to an amount of time that a second frame of the second audio signal is delayed with respect to a first frame of the first audio signal. Alternatively, the mismatch value may correspond to an amount of time that the first frame of the first audio signal is delayed with respect to the second frame of the second audio signal.
When the sound source is closer to the first microphone than to the second microphone, frames of the second audio signal may be delayed relative to frames of the first audio signal. In this case, the first audio signal may be referred to as the “reference audio signal” or “reference channel” and the delayed second audio signal may be referred to as the “target audio signal” or “target channel”. Alternatively, when the sound source is closer to the second microphone than to the first microphone, frames of the first audio signal may be delayed relative to frames of the second audio signal. In this case, the second audio signal may be referred to as the reference audio signal or reference channel and the delayed first audio signal may be referred to as the target audio signal or target channel.
Depending on where the sound sources (e.g., talkers) are located in a conference or telepresence room or how the sound source (e.g., talker) position changes relative to the microphones, the reference channel and the target channel may change from one frame to another; similarly, the temporal delay value may also change from one frame to another. However, in some implementations, the mismatch value may always be positive to indicate an amount of delay of the “target” channel relative to the “reference” channel. Furthermore, the mismatch value may correspond to a “non-causal shift” value by which the delayed target channel is “pulled back” in time such that the target channel is aligned (e.g., maximally aligned) with the “reference” channel. The down mix algorithm to determine the mid channel and the side channel may be performed on the reference channel and the non-causal shifted target channel.
The encoder may determine the mismatch value based on the reference audio channel and a plurality of mismatch values applied to the target audio channel. For example, a first frame of the reference audio channel, X, may be received at a first time (m1). A first particular frame of the target audio channel, Y, may be received at a second time (n1) corresponding to a first mismatch value, e.g., shift1=n1−m1. Further, a second frame of the reference audio channel may be received at a third time (m2). A second particular frame of the target audio channel may be received at a fourth time (n2) corresponding to a second mismatch value, e.g., shift2=n2−m2.
The device may perform a framing or a buffering algorithm to generate a frame (e.g., 20 ms samples) at a first sampling rate (e.g., 32 kHz sampling rate (i.e., 640 samples per frame)). The encoder may, in response to determining that a first frame of the first audio signal and a second frame of the second audio signal arrive at the same time at the device, estimate a mismatch value (e.g., shift1) as equal to zero samples. A Left channel (e.g., corresponding to the first audio signal) and a Right channel (e.g., corresponding to the second audio signal) may be temporally aligned. In some cases, the Left channel and the Right channel, even when aligned, may differ in energy due to various reasons (e.g., microphone calibration).
In some examples, the Left channel and the Right channel may be temporally not aligned due to various reasons (e.g., a sound source, such as a talker, may be closer to one of the microphones than another and the two microphones may be greater than a threshold (e.g., 1-20 centimeters) distance apart). A location of the sound source relative to the microphones may introduce different delays in the Left channel and the Right channel. In addition, there may be a gain difference, an energy difference, or a level difference between the Left channel and the Right channel.
In some examples, a time of arrival of audio signals at the microphones from multiple sound sources (e.g., talkers) may vary when the multiple talkers are alternatively talking (e.g., without overlap). In such a case, the encoder may dynamically adjust a temporal mismatch value based on the talker to identify the reference channel. In some other examples, the multiple talkers may be talking at the same time, which may result in varying temporal mismatch values depending on who is the loudest talker, closest to the microphone, etc.
In some examples, the first audio signal and second audio signal may be synthesized or artificially generated when the two signals potentially show less (e.g., no) correlation. It should be understood that the examples described herein are illustrative and may be instructive in determining a relationship between the first audio signal and the second audio signal in similar or different situations.
The encoder may generate comparison values (e.g., difference values or cross-correlation values) based on a comparison of a first frame of the first audio signal and a plurality of frames of the second audio signal. Each frame of the plurality of frames may correspond to a particular mismatch value. The encoder may generate a first estimated mismatch value based on the comparison values. For example, the first estimated mismatch value may correspond to a comparison value indicating a higher temporal-similarity (or lower difference) between the first frame of the first audio signal and a corresponding first frame of the second audio signal.
The encoder may determine the final mismatch value by refining, in multiple stages, a series of estimated mismatch values. For example, the encoder may first estimate a “tentative” mismatch value based on comparison values generated from stereo pre-processed and re-sampled versions of the first audio signal and the second audio signal. The encoder may generate interpolated comparison values associated with mismatch values proximate to the estimated “tentative” mismatch value. The encoder may determine a second estimated “interpolated” mismatch value based on the interpolated comparison values. For example, the second estimated “interpolated” mismatch value may correspond to a particular interpolated comparison value that indicates a higher temporal-similarity (or lower difference) than the remaining interpolated comparison values and the first estimated “tentative” mismatch value. If the second estimated “interpolated” mismatch value of the current frame (e.g., the first frame of the first audio signal) is different than a final mismatch value of a previous frame (e.g., a frame of the first audio signal that precedes the first frame), then the “interpolated” mismatch value of the current frame is further “amended” to improve the temporal-similarity between the first audio signal and the shifted second audio signal. In particular, a third estimated “amended” mismatch value may correspond to a more accurate measure of temporal-similarity by searching around the second estimated “interpolated” mismatch value of the current frame and the final estimated mismatch value of the previous frame. The third estimated “amended” mismatch value is further conditioned to estimate the final mismatch value by limiting any spurious changes in the mismatch value between frames and further controlled to not switch from a negative mismatch value to a positive mismatch value (or vice versa) in two successive (or consecutive) frames as described herein.
In some examples, the encoder may refrain from switching between a positive mismatch value and a negative mismatch value or vice-versa in consecutive frames or in adjacent frames. For example, the encoder may set the final mismatch value to a particular value (e.g., 0) indicating no temporal-shift based on the estimated “interpolated” or “amended” mismatch value of the first frame and a corresponding estimated “interpolated” or “amended” or final mismatch value in a particular frame that precedes the first frame. To illustrate, the encoder may set the final mismatch value of the current frame (e.g., the first frame) to indicate no temporal-shift, i.e., shift1=0, in response to determining that one of the estimated “tentative” or “interpolated” or “amended” mismatch value of the current frame is positive and the other of the estimated “tentative” or “interpolated” or “amended” or “final” estimated mismatch value of the previous frame (e.g., the frame preceding the first frame) is negative. Alternatively, the encoder may also set the final mismatch value of the current frame (e.g., the first frame) to indicate no temporal-shift, i.e., shift1=0, in response to determining that one of the estimated “tentative” or “interpolated” or “amended” mismatch value of the current frame is negative and the other of the estimated “tentative” or “interpolated” or “amended” or “final” estimated mismatch value of the previous frame (e.g., the frame preceding the first frame) is positive.
The encoder may select a frame of the first audio signal or the second audio signal as a “reference” or “target” based on the mismatch value. For example, in response to determining that the final mismatch value is positive, the encoder may generate a reference channel or signal indicator having a first value (e.g., 0) indicating that the first audio signal is a “reference” signal and that the second audio signal is the “target” signal. Alternatively, in response to determining that the final mismatch value is negative, the encoder may generate the reference channel or signal indicator having a second value (e.g., 1) indicating that the second audio signal is the “reference” signal and that the first audio signal is the “target” signal.
The encoder may estimate a relative gain (e.g., a relative gain parameter) associated with the reference signal and the non-causal shifted target signal. For example, in response to determining that the final mismatch value is positive, the encoder may estimate a gain value to normalize or equalize the energy or power levels of the first audio signal relative to the second audio signal that is offset by the non-causal mismatch value (e.g., an absolute value of the final mismatch value). Alternatively, in response to determining that the final mismatch value is negative, the encoder may estimate a gain value to normalize or equalize the power levels of the non-causal shifted first audio signal relative to the second audio signal. In some examples, the encoder may estimate a gain value to normalize or equalize the energy or power levels of the “reference” signal relative to the non-causal shifted “target” signal. In other examples, the encoder may estimate the gain value (e.g., a relative gain value) based on the reference signal relative to the target signal (e.g., the un-shifted target signal).
The encoder may generate at least one encoded signal (e.g., a mid signal, a side signal, or both) based on the reference signal, the target signal, the non-causal mismatch value, and the relative gain parameter. The side signal may correspond to a difference between first samples of the first frame of the first audio signal and selected samples of a selected frame of the second audio signal. The encoder may select the selected frame based on the final mismatch value. Fewer bits may be used to encode the side channel because of reduced difference between the first samples and the selected samples as compared to other samples of the second audio signal that correspond to a frame of the second audio signal that is received by the device at the same time as the first frame. A transmitter of the device may transmit the at least one encoded signal, the non-causal mismatch value, the relative gain parameter, the reference channel or signal indicator, or a combination thereof.
The encoder may generate at least one encoded signal (e.g., a mid signal, a side signal, or both) based on the reference signal, the target signal, the non-causal mismatch value, the relative gain parameter, low band parameters of a particular frame of the first audio signal, high band parameters of the particular frame, or a combination thereof. The particular frame may precede the first frame. Certain low band parameters, high band parameters, or a combination thereof, from one or more preceding frames may be used to encode a mid signal, a side signal, or both, of the first frame. Encoding the mid signal, the side signal, or both, based on the low band parameters, the high band parameters, or a combination thereof, may improve estimates of the non-causal mismatch value and inter-channel relative gain parameter. The low band parameters, the high band parameters, or a combination thereof, may include a pitch parameter, a voicing parameter, a coder type parameter, a low-band energy parameter, a high-band energy parameter, a tilt parameter, a pitch gain parameter, a FCB gain parameter, a coding mode parameter, a voice activity parameter, a noise estimate parameter, a signal-to-noise ratio parameter, a formants parameter, a speech/music decision parameter, the non-causal shift, the inter-channel gain parameter, or a combination thereof. A transmitter of the device may transmit the at least one encoded signal, the non-causal mismatch value, the relative gain parameter, the reference channel (or signal) indicator, or a combination thereof.
Referring to
The first device 104 may include an encoder 114, a transmitter 110, one or more input interfaces 112, or a combination thereof. A first input interface of the input interfaces 112 may be coupled to a first microphone 146. A second input interface of the input interface(s) 112 may be coupled to a second microphone 148. The encoder 114 may include a temporal equalizer 108 and may be configured to down mix and encode multiple audio signals, as described herein. The first device 104 may also include a memory 153 configured to store analysis data 190. The second device 106 may include a decoder 118. The decoder 118 may include a temporal balancer 124 that is configured to up-mix and render the multiple channels. The second device 106 may be coupled to a first loudspeaker 142, a second loudspeaker 144, or both.
During operation, the first device 104 may receive a first audio signal 130 (e.g., a first channel) via the first input interface from the first microphone 146 and may receive a second audio signal 132 (e.g., a second channel) via the second input interface from the second microphone 148. As used herein, “signal” and “channel” may be used interchangeably. The first audio signal 130 may correspond to one of a right channel or a left channel. The second audio signal 132 may correspond to the other of the right channel or the left channel. In the example of
A sound source 152 (e.g., a user, a speaker, ambient noise, a musical instrument, etc.) may be closer to the first microphone 146 than to the second microphone 148. Accordingly, an audio signal from the sound source 152 may be received at the input interface(s) 112 via the first microphone 146 at an earlier time than via the second microphone 148. This natural delay in the multi-channel signal acquisition through the multiple microphones may introduce a temporal shift between the first audio signal 130 and the second audio signal 132.
The temporal equalizer 108 may be configured to estimate a temporal offset between audio captured at the microphones 146, 148. The temporal offset may be estimated based on a delay between a first frame 131 (e.g., a “reference frame”) of the first audio signal 130 and a second frame 133 (e.g., a “target frame”) of the second audio signal 132, where the second frame 133 includes substantially similar content as the first frame 131. For example, the temporal equalizer 108 may determine a cross-correlation between the first frame 131 and the second frame 133. The cross-correlation may measure the similarity of the two frames as a function of the lag of one frame relative to the other. Based on the cross-correlation, the temporal equalizer 108 may determine the delay (e.g., lag) between the first frame 131 and the second frame 133. The temporal equalizer 108 may estimate the temporal offset between the first audio signal 130 and the second audio signal 132 based on the delay and historical delay data.
The historical data may include delays between frames captured from the first microphone 146 and corresponding frames captured from the second microphone 148. For example, the temporal equalizer 108 may determine a cross-correlation (e.g., a lag) between previous frames associated with the first audio signal 130 and corresponding frames associated with the second audio signal 132.
Each lag may be represented by a “comparison value.” That is, a comparison value may indicate a time shift (k) between a frame of the first audio signal 130 and a corresponding frame of the second audio signal 132. In accordance with the disclosure herein, comparison value may additionally indicate an amount of temporal mismatch, or a measure of the similarity or dissimilarity between a first reference frame of a reference channel and a corresponding first target frame of a target channel. In some implementations, cross-correlation function between the reference frame and the target frame may be used to measure the similarity of the two frames as a function of the lag of one frame relative to the other. According to one implementation, the comparison values (e.g., cross-correlation values) for previous frames may be stored at the memory 153. A smoother 190 of the temporal equalizer 108 may “smooth” (or average) comparison values over a long-term set of frames and use the long-term smoothed comparison values for estimating a temporal offset (e.g., “shift”) between the first audio signal 130 and the second audio signal 132.
To illustrate, if CompValN(k) represents the comparison value at a shift of k for the frame N, the frame N may have comparison values from k=T_MIN (a minimum shift) to k=T_MAX (a maximum shift). The smoothing may be performed such that a long-term smoothed comparison value CompValLT
The smoothing techniques described above may substantially normalize the shift estimate between voiced frames, unvoiced frames, and transition frames. Normalized shift estimates may reduce sample repetition and artifact skipping at frame boundaries. Additionally, normalized shift estimates may result in reduced side channel energies, which may improve coding efficiency.
The temporal equalizer 108 may determine a final mismatch value 116 (e.g., a non-causal mismatch value) indicative of the shift (e.g., a non-causal mismatch or a non-causal shift) of the first audio signal 130 (e.g., “reference”) relative to the second audio signal 132 (e.g., “target”). The final mismatch value 116 may be based on the instantaneous comparison value CompValN(k) and the long-term smoothed comparison CompValLT
In some implementations, the third value (e.g., 0) of the final mismatch value 116 may indicate that delay between the first audio signal 130 and the second audio signal 132 has switched sign. For example, a first particular frame of the first audio signal 130 may precede the first frame 131. The first particular frame and a second particular frame of the second audio signal 132 may correspond to the same sound emitted by the sound source 152. The delay between the first audio signal 130 and the second audio signal 132 may switch from having the first particular frame delayed with respect to the second particular frame to having the second frame 133 delayed with respect to the first frame 131. Alternatively, the delay between the first audio signal 130 and the second audio signal 132 may switch from having the second particular frame delayed with respect to the first particular frame to having the first frame 131 delayed with respect to the second frame 133. The temporal equalizer 108 may set the final mismatch value 116 to indicate the third value (e.g., 0) in response to determining that the delay between the first audio signal 130 and the second audio signal 132 has switched sign.
The temporal equalizer 108 may generate a reference signal indicator 164 based on the final mismatch value 116. For example, the temporal equalizer 108 may, in response to determining that the final mismatch value 116 indicates a first value (e.g., a positive value), generate the reference signal indicator 164 to have a first value (e.g., 0) indicating that the first audio signal 130 is a “reference” signal. The temporal equalizer 108 may determine that the second audio signal 132 corresponds to a “target” signal in response to determining that the final mismatch value 116 indicates the first value (e.g., a positive value). Alternatively, the temporal equalizer 108 may, in response to determining that the final mismatch value 116 indicates a second value (e.g., a negative value), generate the reference signal indicator 164 to have a second value (e.g., 1) indicating that the second audio signal 132 is the “reference” signal. The temporal equalizer 108 may determine that the first audio signal 130 corresponds to the “target” signal in response to determining that the final mismatch value 116 indicates the second value (e.g., a negative value). The temporal equalizer 108 may, in response to determining that the final mismatch value 116 indicates a third value (e.g., 0), generate the reference signal indicator 164 to have a first value (e.g., 0) indicating that the first audio signal 130 is a “reference” signal. The temporal equalizer 108 may determine that the second audio signal 132 corresponds to a “target” signal in response to determining that the final mismatch value 116 indicates the third value (e.g., 0). Alternatively, the temporal equalizer 108 may, in response to determining that the final mismatch value 116 indicates the third value (e.g., 0), generate the reference signal indicator 164 to have a second value (e.g., 1) indicating that the second audio signal 132 is a “reference” signal. The temporal equalizer 108 may determine that the first audio signal 130 corresponds to a “target” signal in response to determining that the final mismatch value 116 indicates the third value (e.g., 0). In some implementations, the temporal equalizer 108 may, in response to determining that the final mismatch value 116 indicates a third value (e.g., 0), leave the reference signal indicator 164 unchanged. For example, the reference signal indicator 164 may be the same as a reference signal indicator corresponding to the first particular frame of the first audio signal 130. The temporal equalizer 108 may generate a non-causal mismatch value 162 indicating an absolute value of the final mismatch value 116.
The temporal equalizer 108 may generate a gain parameter 160 (e.g., a codec gain parameter) based on samples of the “target” signal and based on samples of the “reference” signal. For example, the temporal equalizer 108 may select samples of the second audio signal 132 based on the non-causal mismatch value 162. Alternatively, the temporal equalizer 108 may select samples of the second audio signal 132 independent of the non-causal mismatch value 162. The temporal equalizer 108 may, in response to determining that the first audio signal 130 is the reference signal, determine the gain parameter 160 of the selected samples based on the first samples of the first frame 131 of the first audio signal 130. Alternatively, the temporal equalizer 108 may, in response to determining that the second audio signal 132 is the reference signal, determine the gain parameter 160 of the first samples based on the selected samples. As an example, the gain parameter 160 may be based on one of the following Equations:
where gD corresponds to the relative gain parameter 160 for down mix processing, Ref(n) corresponds to samples of the “reference” signal, N1 corresponds to the non-causal mismatch value 162 of the first frame 131, and Targ(n+N1) corresponds to samples of the “target” signal. The gain parameter 160 (gD) may be modified, e.g., based on one of the Equations 1a-1f, to incorporate long-term smoothing/hysteresis logic to avoid large jumps in gain between frames. When the target signal includes the first audio signal 130, the first samples may include samples of the target signal and the selected samples may include samples of the reference signal. When the target signal includes the second audio signal 132, the first samples may include samples of the reference signal, and the selected samples may include samples of the target signal.
In some implementations, the temporal equalizer 108 may generate the gain parameter 160 based on treating the first audio signal 130 as a reference signal and treating the second audio signal 132 as a target signal, irrespective of the reference signal indicator 164. For example, the temporal equalizer 108 may generate the gain parameter 160 based on one of the Equations 1a-1f where Ref(n) corresponds to samples (e.g., the first samples) of the first audio signal 130 and Targ(n+N1) corresponds to samples (e.g., the selected samples) of the second audio signal 132. In alternate implementations, the temporal equalizer 108 may generate the gain parameter 160 based on treating the second audio signal 132 as a reference signal and treating the first audio signal 130 as a target signal, irrespective of the reference signal indicator 164. For example, the temporal equalizer 108 may generate the gain parameter 160 based on one of the Equations 1a-1f where Ref(n) corresponds to samples (e.g., the selected samples) of the second audio signal 132 and Targ(n+N1) corresponds to samples (e.g., the first samples) of the first audio signal 130.
The temporal equalizer 108 may generate one or more encoded signals 102 (e.g., a mid channel, a side channel, or both) based on the first samples, the selected samples, and the relative gain parameter 160 for down mix processing. For example, the temporal equalizer 108 may generate the mid signal based on one of the following Equations:
M=Ref(n)+gDTarg(n+N1), Equation 2a
M=Ref(n)+Targ(n+N1), Equation 2b
where M corresponds to the mid channel, gD corresponds to the relative gain parameter 160 for downmix processing, Ref(n) corresponds to samples of the “reference” signal, N1 corresponds to the non-causal mismatch value 162 of the first frame 131, and Targ(n+N1) corresponds to samples of the “target” signal.
The temporal equalizer 108 may generate the side channel based on one of the following Equations:
S=Ref(n)−gDTarg(n+N1), Equation 3a
S=gDRef(n)−Targ(n+N1), Equation 3b
where S corresponds to the side channel, gD corresponds to the relative gain parameter 160 for down-mix processing, Ref(n) corresponds to samples of the “reference” signal, N1 corresponds to the non-causal mismatch value 162 of the first frame 131, and Targ(n+N1) corresponds to samples of the “target” signal.
The transmitter 110 may transmit the encoded signals 102 (e.g., the mid channel, the side channel, or both), the reference signal indicator 164, the non-causal mismatch value 162, the gain parameter 160, or a combination thereof, via the network 120, to the second device 106. In some implementations, the transmitter 110 may store the encoded signals 102 (e.g., the mid channel, the side channel, or both), the reference signal indicator 164, the non-causal mismatch value 162, the gain parameter 160, or a combination thereof, at a device of the network 120 or a local device for further processing or decoding later.
The decoder 118 may decode the encoded signals 102. The temporal balancer 124 may perform up-mixing to generate a first output signal 126 (e.g., corresponding to first audio signal 130), a second output signal 128 (e.g., corresponding to the second audio signal 132), or both. The second device 106 may output the first output signal 126 via the first loudspeaker 142. The second device 106 may output the second output signal 128 via the second loudspeaker 144.
The system 100 may thus enable the temporal equalizer 108 to encode the side channel using fewer bits than the mid signal. The first samples of the first frame 131 of the first audio signal 130 and selected samples of the second audio signal 132 may correspond to the same sound emitted by the sound source 152 and hence a difference between the first samples and the selected samples may be lower than between the first samples and other samples of the second audio signal 132. The side channel may correspond to the difference between the first samples and the selected samples.
Referring to
During operation, the first device 204 may receive more than two audio signals. For example, the first device 204 may receive the first audio signal 130 via the first microphone 146, an Nth audio signal 232 via the Nth microphone 248, and one or more additional audio signals (e.g., the second audio signal 132) via the additional microphones (e.g., the second microphone 148).
The temporal equalizer(s) 208 may generate one or more reference signal indicators 264, final mismatch values 216, non-causal mismatch values 262, gain parameters 260, encoded signals 202, or a combination thereof. For example, the temporal equalizer(s) 208 may determine that the first audio signal 130 is a reference signal and that each of the Nth audio signal 232 and the additional audio signals is a target signal. The temporal equalizer(s) 208 may generate the reference signal indicator 164, the final mismatch values 216, the non-causal mismatch values 262, the gain parameters 260, and the encoded signals 202 corresponding to the first audio signal 130 and each of the Nth audio signal 232 and the additional audio signals.
The reference signal indicators 264 may include the reference signal indicator 164. The final mismatch values 216 may include the final mismatch value 116 indicative of a shift of the second audio signal 132 relative to the first audio signal 130, a second final mismatch value indicative of a shift of the Nth audio signal 232 relative to the first audio signal 130, or both. The non-causal mismatch values 262 may include the non-causal mismatch value 162 corresponding to an absolute value of the final mismatch value 116, a second non-causal mismatch value corresponding to an absolute value of the second final mismatch value, or both. The gain parameters 260 may include the gain parameter 160 of selected samples of the second audio signal 132, a second gain parameter of selected samples of the Nth audio signal 232, or both. The encoded signals 202 may include at least one of the encoded signals 102. For example, the encoded signals 202 may include the side channel corresponding to first samples of the first audio signal 130 and selected samples of the second audio signal 132, a second side channel corresponding to the first samples and selected samples of the Nth audio signal 232, or both. The encoded signals 202 may include a mid channel corresponding to the first samples, the selected samples of the second audio signal 132, and the selected samples of the Nth audio signal 232.
In some implementations, the temporal equalizer(s) 208 may determine multiple reference signals and corresponding target signals, as described with reference to
The transmitter 110 may transmit the reference signal indicators 264, the non-causal mismatch values 262, the gain parameters 260, the encoded signals 202, or a combination thereof, via the network 120, to the second device 106. The decoder 118 may generate one or more output signals based on the reference signal indicators 264, the non-causal mismatch values 262, the gain parameters 260, the encoded signals 202, or a combination thereof. For example, the decoder 118 may output a first output signal 226 via the first loudspeaker 142, a Yth output signal 228 via the Yth loudspeaker 244, one or more additional output signals (e.g., the second output signal 128) via one or more additional loudspeakers (e.g., the second loudspeaker 144), or a combination thereof.
The system 200 may thus enable the temporal equalizer(s) 208 to encode more than two audio signals. For example, the encoded signals 202 may include multiple side channels that are encoded using fewer bits than corresponding mid channels by generating the side channels based on the non-causal mismatch values 262.
Referring to
The first audio signal 130 may correspond to a plurality of frames (e.g., a frame 302, a frame 304, a frame 306, or a combination thereof). Each of the plurality of frames may correspond to a subset of samples (e.g., corresponding to 20 ms, such as 640 samples at 32 kHz or 960 samples at 48 kHz) of the first samples 320. For example, the frame 302 may correspond to the sample 322, the sample 324, one or more additional samples, or a combination thereof. The frame 304 may correspond to the sample 326, the sample 328, the sample 330, the sample 332, one or more additional samples, or a combination thereof. The frame 306 may correspond to the sample 334, the sample 336, one or more additional samples, or a combination thereof.
The sample 322 may be received at the input interface(s) 112 of
A first value (e.g., a positive value) of the final mismatch value 116 may indicate that the second audio signal 132 is delayed relative to the first audio signal 130. For example, a first value (e.g., +X ms or +Y samples, where X and Y include positive real numbers) of the final mismatch value 116 may indicate that the frame 304 (e.g., the samples 326-332) correspond to the samples 358-364. The samples 326-332 and the samples 358-364 may correspond to the same sound emitted from the sound source 152. The samples 358-364 may correspond to a frame 344 of the second audio signal 132. Illustration of samples with cross-hatching in one or more of
It should be understood that a temporal offset of Y samples, as shown in
The temporal equalizer 108 of
Referring to
A second value (e.g., a negative value) of the final mismatch value 116 may indicate that the first audio signal 130 is delayed relative to the second audio signal 132. For example, the second value (e.g., −X ms or −Y samples, where X and Y include positive real numbers) of the final mismatch value 116 may indicate that the frame 304 (e.g., the samples 326-332) correspond to the samples 354-360. The samples 354-360 may correspond to the frame 344 of the second audio signal 132. The samples 354-360 (e.g., the frame 344) and the samples 326-332 (e.g., the frame 304) may correspond to the same sound emitted from the sound source 152.
It should be understood that a temporal offset of −Y samples, as shown in
The temporal equalizer 108 of
Referring to
During operation, the resampler 504 may generate one or more resampled signals. For example, the resampler 504 may generate a first resampled signal 530 by resampling (e.g., down-sampling or up-sampling) the first audio signal 130 based on a resampling (e.g., down-sampling or up-sampling) factor (D) (e.g., ≥1). The resampler 504 may generate a second resampled signal 532 by resampling the second audio signal 132 based on the resampling factor (D). The resampler 504 may provide the first resampled signal 530, the second resampled signal 532, or both, to the signal comparator 506. The first audio signal 130 may be sampled at a first sample rate (Fs) to generate the samples 320 of
The signal comparator 506 may generate comparison values 534 (e.g., difference values, similarity values, coherence values, or cross-correlation values), a tentative mismatch value 536, or both, as further described with reference to
In a particular implementation, the value of the smoothing parameters (e.g., α) may be based on the short-term signal level (EST) and the long-term signal level (ELT) of the channels. As an example, the short-term signal level may be calculated for the frame (N) being processed (EST(N)) as the sum of the sum of the absolute values of the downsampled reference samples and the sum of the absolute values of the downsampled target samples. The long-term signal level may be a smoothed version of the short-term signal levels. For example, ELT(N)=0.6*ELT(N−1)+0.4*EST(N). Further, the value of the smoothing parameters (e.g., α) may be controlled according to a pseudo-code described as follows
Set α to an initial value (e.g., 0.95).
if EST>4*ELT, modify the value of α (e.g., α=0.5)
if EST>2*ELT and EST≤4*ELT, modify the value of α (e.g., α=0.7)
In a particular implementation, the value of the smoothing parameters (e.g., α) may be controlled based on the correlation of the short-term and the long-term smoothed comparison values. For example, when the comparison values of the current frame are very similar to the long-term smoothed comparison values, it is an indication of a stationary talker and this could be used to control the smoothing parameters to further increase the smoothing (e.g., increase the value of α). On the other hand, when the comparison values as a function of the various shift values does not resemble the long-term smoothed comparison values, the smoothing parameters can be adjusted (e.g., adapted) to reduce smoothing (e.g., decrease the value of α).
In a particular implementation, the signal comparator 506 may estimate short-term smoothed comparison values (CompValST
In other implementations, the short-term smoothed comparison values may be the same as the comparison values generated in the frame being processed (CompValN(k)).
The signal comparator 506 may estimate a cross-correlation value of the short-term and the long-term smoothed comparison values. In some implementations, the cross-correlation value (CrossCorr_CompValN) of the short-term and the long-term smoothed comparison values may be a single value estimated per each frame (N) which is calculated as CrossCorr_CompValN=(ΣkCompValST
The signal comparator 506 may estimate another cross-correlation value of the comparison values for a single frame (“instantaneous comparison values”) and short-term smoothed comparison values. In some implementations, the cross-correlation value (CrossCorr_CompValN) of the comparison values for the frame N (“instantaneous comparison values for the frame N”) and the short-term smoothed comparison values (e.g., CompValST
The first resampled signal 530 may include fewer samples or more samples than the first audio signal 130. The second resampled signal 532 may include fewer samples or more samples than the second audio signal 132. Determining the comparison values 534 based on the fewer samples of the resampled signals (e.g., the first resampled signal 530 and the second resampled signal 532) may use fewer resources (e.g., time, number of operations, or both) than on samples of the original signals (e.g., the first audio signal 130 and the second audio signal 132). Determining the comparison values 534 based on the more samples of the resampled signals (e.g., the first resampled signal 530 and the second resampled signal 532) may increase precision than on samples of the original signals (e.g., the first audio signal 130 and the second audio signal 132). The signal comparator 506 may provide the comparison values 534, the tentative mismatch value 536, or both, to the interpolator 510.
The interpolator 510 may extend the tentative mismatch value 536. For example, the interpolator 510 may generate an interpolated mismatch value 538. For example, the interpolator 510 may generate interpolated comparison values corresponding to mismatch values that are proximate to the tentative mismatch value 536 by interpolating the comparison values 534. The interpolator 510 may determine the interpolated mismatch value 538 based on the interpolated comparison values and the comparison values 534. The comparison values 534 may be based on a coarser granularity of the mismatch values. For example, the comparison values 534 may be based on a first subset of a set of mismatch values so that a difference between a first mismatch value of the first subset and each second mismatch value of the first subset is greater than or equal to a threshold (e.g., ≥1). The threshold may be based on the resampling factor (D).
The interpolated comparison values may be based on a finer granularity of mismatch values that are proximate to the resampled tentative mismatch value 536. For example, the interpolated comparison values may be based on a second subset of the set of mismatch values so that a difference between a highest mismatch value of the second subset and the resampled tentative mismatch value 536 is less than the threshold (e.g., ≥1), and a difference between a lowest mismatch value of the second subset and the resampled tentative mismatch value 536 is less than the threshold. Determining the comparison values 534 based on the coarser granularity (e.g., the first subset) of the set of mismatch values may use fewer resources (e.g., time, operations, or both) than determining the comparison values 534 based on a finer granularity (e.g., all) of the set of mismatch values. Determining the interpolated comparison values corresponding to the second subset of mismatch values may extend the tentative mismatch value 536 based on a finer granularity of a smaller set of mismatch values that are proximate to the tentative mismatch value 536 without determining comparison values corresponding to each mismatch value of the set of mismatch values. Thus, determining the tentative mismatch value 536 based on the first subset of mismatch values and determining the interpolated mismatch value 538 based on the interpolated comparison values may balance resource usage and refinement of the estimated mismatch value. The interpolator 510 may provide the interpolated mismatch value 538 to the shift refiner 511.
According to one implementation, the interpolator 510 may retrieve interpolated mismatch/comparison values for previous frames and may modify the interpolated mismatch/comparison value 538 based on a long-term smoothing operation using the interpolated mismatch/comparison values for previous frames. For example, the interpolated mismatch/comparison value 538 may include a long-term interpolated mismatch/comparison value InterValLT
The shift refiner 511 may generate an amended mismatch value 540 by refining the interpolated mismatch value 538. For example, the shift refiner 511 may determine whether the interpolated mismatch value 538 indicates that a change in a shift between the first audio signal 130 and the second audio signal 132 is greater than a shift change threshold. The change in the shift may be indicated by a difference between the interpolated mismatch value 538 and a first mismatch value associated with the frame 302 of
According to one implementation, the shift refiner may retrieve amended mismatch values for previous frames and may modify the amended mismatch value 540 based on a long-term smoothing operation using the amended mismatch values for previous frames. For example, the amended mismatch value 540 may include a long-term amended mismatch value AmendValLT
The shift change analyzer 512 may determine whether the amended mismatch value 540 indicates a switch or reverse in timing between the first audio signal 130 and the second audio signal 132, as described with reference to
The absolute shift generator 513 may generate the non-causal mismatch value 162 by applying an absolute function to the final mismatch value 116. The absolute shift generator 513 may provide the mismatch value 162 to the gain parameter generator 514.
The reference signal designator 508 may generate the reference signal indicator 164. For example, the reference signal indicator 164 may have a first value indicating that the first audio signal 130 is a reference signal or a second value indicating that the second audio signal 132 is the reference signal. The reference signal designator 508 may provide the reference signal indicator 164 to the gain parameter generator 514.
The reference signal designator 508 may further determine whether the final mismatch value 116 is equal to 0. For example, the reference signal designator 508 may, in response to determining that the final mismatch value 116 has the particular value (e.g., 0) indicating no time shift, leave the reference signal indicator 164 unchanged. To illustrate, the reference signal indicator 164 may indicate that the same audio signal (e.g., the first audio signal 130 or the second audio signal 132) is a reference signal associated with the frame 304 as with the frame 302.
The reference signal designator 508 may further determine that the final mismatch value 116 is non-zero, at 1202, determining whether the final mismatch value 116 is greater than 0, at 1206. For example, the reference signal designator 508 may, in response to determining that the final mismatch value 116 has a particular value (e.g., a non-zero value) indicating a time shift, determine whether the final mismatch value 116 has a first value (e.g., a positive value) indicating that the second audio signal 132 is delayed relative to the first audio signal 130 or a second value (e.g., a negative value) indicating that the first audio signal 130 is delayed relative to the second audio signal 132.
The gain parameter generator 514 may select samples of the target signal (e.g., the second audio signal 132) based on the non-causal mismatch value 162. To illustrate, the gain parameter generator 514 may select the samples 358-364 in response to determining that the non-causal mismatch value 162 has a first value (e.g., +X ms or +Y samples, where X and Y include positive real numbers). The gain parameter generator 514 may select the samples 354-360 in response to determining that the non-causal mismatch value 162 has a second value (e.g., −X ms or −Y samples). The gain parameter generator 514 may select the samples 356-362 in response to determining that the non-causal mismatch value 162 has a value (e.g., 0) indicating no time shift.
The gain parameter generator 514 may determine whether the first audio signal 130 is the reference signal or the second audio signal 132 is the reference signal based on the reference signal indicator 164. The gain parameter generator 514 may generate the gain parameter 160 based on the samples 326-332 of the frame 304 and the selected samples (e.g., the samples 354-360, the samples 356-362, or the samples 358-364) of the second audio signal 132, as described with reference to
The gain parameter generator 514 may provide the gain parameter 160, the reference signal indicator 164, the non-causal mismatch value 162, or a combination thereof, to the signal generator 516. The signal generator 516 may generate the encoded signals 102, as described with reference to
The temporal equalizer 108 may store the first resampled signal 530, the second resampled signal 532, the comparison values 534, the tentative mismatch value 536, the interpolated mismatch value 538, the amended mismatch value 540, the non-causal mismatch value 162, the reference signal indicator 164, the final mismatch value 116, the gain parameter 160, the first encoded signal frame 564, the second encoded signal frame 566, or a combination thereof, in the memory 153. For example, the analysis data 190 may include the first resampled signal 530, the second resampled signal 532, the comparison values 534, the tentative mismatch value 536, the interpolated mismatch value 538, the amended mismatch value 540, the non-causal mismatch value 162, the reference signal indicator 164, the final mismatch value 116, the gain parameter 160, the first encoded signal frame 564, the second encoded signal frame 566, or a combination thereof.
The smoothing techniques described above may substantially normalize the shift estimate between voiced frames, unvoiced frames, and transition frames. Normalized shift estimates may reduce sample repetition and artifact skipping at frame boundaries. Additionally, normalized shift estimates may result in reduced side channel energies, which may improve coding efficiency.
Referring to
The memory 153 may store a plurality of mismatch values 660. The mismatch values 660 may include a first mismatch value 664 (e.g., −X ms or −Y samples, where X and Y include positive real numbers), a second mismatch value 666 (e.g., +X ms or +Y samples, where X and Y include positive real numbers), or both. The mismatch values 660 may range from a lower mismatch value (e.g., a minimum mismatch value, T_MIN) to a higher mismatch value (e.g., a maximum mismatch value, T_MAX). The mismatch values 660 may indicate an expected temporal shift (e.g., a maximum expected temporal shift) between the first audio signal 130 and the second audio signal 132.
During operation, the signal comparator 506 may determine the comparison values 534 based on the first samples 620 and the mismatch values 660 applied to the second samples 650. For example, the samples 626-632 may correspond to a first time (t). To illustrate, the input interface(s) 112 of
The samples 654-660 may correspond to the second time (t−1). For example, the input interface(s) 112 may receive the samples 654-660 at approximately the second time (t−1). The signal comparator 506 may determine a first comparison value 614 (e.g., a difference value or a cross-correlation value) corresponding to the first mismatch value 664 based on the samples 626-632 and the samples 654-660. For example, the first comparison value 614 may correspond to an absolute value of cross-correlation of the samples 626-632 and the samples 654-660. As another example, the first comparison value 614 may indicate a difference between the samples 626-632 and the samples 654-660.
The second mismatch value 666 (e.g., +X ms or +Y samples, where X and Y include positive real numbers) may correspond to a third time (t+1). The samples 658-664 may correspond to the third time (t+1). For example, the input interface(s) 112 may receive the samples 658-664 at approximately the third time (t+1). The signal comparator 506 may determine a second comparison value 616 (e.g., a difference value or a cross-correlation value) corresponding to the second mismatch value 666 based on the samples 626-632 and the samples 658-664. For example, the second comparison value 616 may correspond to an absolute value of cross-correlation of the samples 626-632 and the samples 658-664. As another example, the second comparison value 616 may indicate a difference between the samples 626-632 and the samples 658-664. The signal comparator 506 may store the comparison values 534 in the memory 153. For example, the analysis data 190 may include the comparison values 534.
The signal comparator 506 may identify a selected comparison value 636 of the comparison values 534 that has a higher (or lower) value than other values of the comparison values 534. For example, the signal comparator 506 may select the second comparison value 616 as the selected comparison value 636 in response to determining that the second comparison value 616 is greater than or equal to the first comparison value 614. In some implementations, the comparison values 534 may correspond to cross-correlation values. The signal comparator 506 may, in response to determining that the second comparison value 616 is greater than the first comparison value 614, determine that the samples 626-632 have a higher correlation with the samples 658-664 than with the samples 654-660. The signal comparator 506 may select the second comparison value 616 that indicates the higher correlation as the selected comparison value 636. In other implementations, the comparison values 534 may correspond to difference values. The signal comparator 506 may, in response to determining that the second comparison value 616 is lower than the first comparison value 614, determine that the samples 626-632 have a greater similarity with (e.g., a lower difference to) the samples 658-664 than the samples 654-660. The signal comparator 506 may select the second comparison value 616 that indicates a lower difference as the selected comparison value 636.
The selected comparison value 636 may indicate a higher correlation (or a lower difference) than the other values of the comparison values 534. The signal comparator 506 may identify the tentative mismatch value 536 of the mismatch values 660 that corresponds to the selected comparison value 636. For example, the signal comparator 506 may identify the second mismatch value 666 as the tentative mismatch value 536 in response to determining that the second mismatch value 666 corresponds to the selected comparison value 636 (e.g., the second comparison value 616).
Referring to
The reference channel (“Ref(n)”) 701 may correspond to a first audio signal 130 and may include a plurality of reference frames including a frame N 710 of the reference channel 701. The target channel (“Targ(n)”) 701 may correspond to a second audio signal 132 and may include a plurality of target frames including a frame N 720 of the target channel 702. The encoder 114 or temporal equalizer 108 may estimate comparison values 730 for the frame N 710 of the reference channel 701 and for the frame N 720 of the target channel 702. Each comparison value may be indicative of an amount of temporal mismatch, or a measure of the similarity or dissimilarity between the reference frame N 710 of the reference channel 701 and a corresponding target frame N 720 of a target channel 702. In some implementations, cross-correlation values between the reference frame and the target frame may be used to measure the similarity of the two frames as a function of the lag of one frame relative to the other. For example, the comparison values for frame N (CompValN(k)) 735 may be the cross-correlation values between the frame N 710 of the reference channel and the frame N 720 of the target channel.
The encoder 114 or temporal equalizer 108 may smooth the comparison values to generate short-term smoothed comparison values. The short-term smoothed comparison values (e.g., CompValST
In alternative implementations, a non-uniform weighting may be applied to the plurality of comparison values for the frame N and previous frames.
The encoder 114 or temporal equalizer 108 may smooth the comparison values to generate first long-term smoothed comparison values 755 for the frame N based on a smoothing parameter. The smoothing may be performed such that first long-term smoothed comparison values CompValLT
The encoder 114 or temporal equalizer 108 may calculate a cross-correlation value of the comparison values and the short-term smoothed comparison values. For example, the encoder 114 or temporal equalizer 108 may calculate a cross-correlation value (CrossCorr_CompValN) 765 of the comparison values CompValN(k) 735 for the frame N 710720 and short-term smoothed comparison values CompValST
Alternatively, the encoder 114 or temporal equalizer 108 may calculate a cross-correlation value of the short-term and the long-term smoothed comparison values. In some implementations, the cross-correlation value (CrossCorr_CompValN) 765 of the short-term smoothed comparison values CompValST
The encoder 114 or temporal equalizer 108 may compare the cross-correlation value of the comparison values (CrossCorr_CompValN) 765 with a threshold, and may adjust a whole or some part of the first long-term smoothed comparison values 755. In some implementations, the encoder 114 or temporal equalizer 108 may increase (or boost or bias) certain values of a subset of the first long-term smoothed comparison values 755 in response to the determination that the cross-correlation value of the comparison values (CrossCorr_CompValN) 765 exceeds the threshold. For example, when the cross-correlation value of the comparison values (CrossCorr_CompValN) is bigger than or equal to a threshold (e.g., 0.8), it may indicate the cross-correlation value between comparison values is quite strong or high, indicating small or no variations of temporal shift values between adjacent frames. Thus, the estimated temporal shift value of the current frame (e.g., frame N) cannot be too far off from the temporal shift values of the previous frame (e.g., frame N−1) or the temporal shift values of any other previous frames. The temporal shift values may be one of a tentative mismatch value 536, an interpolated mismatch value 538, an amended mismatch value 540, a final mismatch value 116, or a non-causal mismatch value 162. Therefore, the encoder 114 or temporal equalizer 108 may increase (or boost or bias) certain values of a subset of the first long-term smoothed comparison values 755, for example, by a factor of 1.2 (20% boost or increase) to generate a second long-term smoothed comparison values. This boosting or biasing may be implemented by multiplying a scaling factor or by adding an offset to the values within the subset of the first long-term smoothed comparison values 755.
In some implementations, the encoder 114 or temporal equalizer 108 may boost or bias the subset of the first long long-term smoothed comparison values 755 such that the subset may include an index corresponding to the temporal shift value of the previous frame (e.g., frame N−1). Additionally, or alternatively the subset may further include an index around the vicinity of the temporal shift value of the previous frame (e.g., frame N−1). For example, the vicinity may mean within −delta (e.g., delta is in the range of 1-5 samples in a preferred embodiment) and +delta of the temporal shift value of the previous frame (e.g., frame N−1).
Referring to
The x-axis of the graphs 830840850860 represents negative shift value to positive shift value and the y-axis of the graphs 830840850860 represents comparison values (e.g., cross-correlation values). In some implementation, the y-axis of the graphs 830840850860 in the example 800 may illustrate the long-term smoothed comparison values CompValLT
The example 800 illustrates cases showing that a subset of the long-term smoothed comparison values (e.g., the first long-term smoothed comparison values CompValLT
The Case #1 in
The Case #3 in
Four cases in
Referring to
The method 900 includes calculating a gain parameter (gD) for a previous frame (e.g., frame N−1), at 910. The gain parameter in 900 may be a gain parameter 160 in
The encoder 114 or temporal equalizer 108 may compare the gain parameter with a threshold (e.g., Thr1 or Thr2), at 920950. When the gain parameter 160 (gD), based on one or more of the Equations 1a-1f, is greater than 1, it may indicate that the first audio signal 130 (or left channel) is a leading channel (“a reference channel”) and thus it is more likely that shift values (“temporal shift values”) would be positive values. The temporal shift values may be one of a tentative mismatch value 536, an interpolated mismatch value 538, an amended mismatch value 540, a final mismatch value 116, or a non-causal mismatch value 162. Therefore, it may be advantageous to emphasize (or increase or boost or bias) the values in the positive shift side and/or deemphasize (or decrease) the values in the negative shift side.
When the gain parameter 160 (gD), which is calculated based on one or more of Equations 1a-1f, is greater than 1, it may mean that the first audio signal 130 (or left channel) is a leading channel (“a reference channel”) and thus it is more likely that shift values (“temporal shift values”) would be a positive value. The temporal shift values may be one of a tentative mismatch value 536, an interpolated mismatch value 538, an amended mismatch value 540, a final mismatch value 116, or a non-causal mismatch value 162. Therefore, the likelihood of determining a correct non-causal shift value may be advantageously improved by emphasizing (or increasing or boosting or biasing) the values in the positive shift side and/or by deemphasizing (or decreasing) the values in the negative shift side.
When the gain parameter 160 (gD), which is calculated based on one or more of Equations 1a-1f, is less than 1, it may mean that the second audio signal 130 (or right channel) is a leading channel (“a reference channel”) and thus it is more likely that shift values (“temporal shift values”) would be a negative value, the likelihood of determining a correct non-causal shift value may be advantageously improved by emphasizing (or increasing or boosting or biasing) the values in the negative shift side and/or deemphasizing (or decreasing) the values in the positive shift side.
In some implementations, the encoder 114 or temporal equalizer 108 may compare the gain parameter 160 (gD) with a first threshold (e.g., Thr1=1.2) or another threshold (e.g., Thr2=0.8). For illustration purpose,
The encoder 114 or temporal equalizer 108 may adjust a first subset of the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to the comparison result. For example, when the gain parameter 160 (gD) is greater than a first threshold (e.g., Thr1=1.2), the method 900 may adjust a subset of the first long-term smoothed comparison values by at least one among emphasizing positive shift side (e.g., Case #2830930) and deemphasizing negative shift side (e.g., Case #3840940) to avoid spurious jumps in signs (positive or negative) of temporal shift values between adjacent frames. In some implementations, both Case #2 (e.g., positive shift side emphasis) and Case #3 (negative shift side deemphasis) may be executed in any order between them. Alternatively, when Case #2 (e.g., positive shift side emphasis) was selected to emphasize the positive shift side, the values of the other side (e.g., negative side) may be zeroed out, instead of executing Case #3, to reduce the risk of detecting incorrect sign of temporal shift values.
Additionally, when the gain parameter 160 (gD) is less than a second threshold (e.g., Thr2=0.8), the method 900 may adjust a subset of the first long-term smoothed comparison values by at least one among emphasizing negative shift side (e.g., Case #1860960) and deemphasizing positive shift side (e.g., Case #4870970) to avoid spurious jumps in signs (positive or negative) of temporal shift values between adjacent frames. In some implementations, both Case #1 (e.g., negative shift side emphasis) and Case #4 (positive shift side deemphasis) may be executed in any order between them. Alternatively, when Case #1 (e.g., negative shift side emphasis) was selected to emphasize the negative shift side, the values of the other side (e.g., positive side) may be zeroed out, instead of executing Case #4, to reduce the risk of detecting incorrect sign of temporal shift values.
Although, the method 900 shows an adjustment may be performed, based on the gain parameter 160 (gD), on values of a subset of the first long-term smoothed comparison values, adjustment alternatively may be performed on either an instantaneous comparison values or values of a subset of the short-term smoothed comparison values. In some implementations, adjusting values may be performed using a smooth window (e.g., a smooth scaling window) over multiple lag values. In other implementations, the length of a smooth window may be adaptively changed for example based on the value of cross-correlation of comparison values. For example, the encoder 114 or temporal equalizer 108 may adjust the length of a smooth window based on a cross-correlation value (CrossCorr_CompValN) 765 of an instantaneous comparison values CompValN(k) 735 for the frame N 710720 and short-term smoothed comparison values CompValST
Referring to
The cross-correlation represented in each graph 1002, 1004, 1006 may be substantially different. For example, the graph 1002 illustrates that a peak cross-correlation between a voiced frame captured by the first microphone 146 of
According to
Referring to
The method 1100 includes estimating comparison values at an encoder, at 1110. Each comparison value may be indicative of an amount of temporal mismatch, or a measure of the similarity or dissimilarity between a first reference frame of a reference channel and a corresponding first target frame of a target channel, at 1110. In some implementations, cross-correlation function between the reference frame and the target frame may be used to measure the similarity of the two frames as a function of the lag of one frame relative to the other. For example, referring to
The method 1100 includes smoothing the comparison values to generate short-term smoothed comparison values, at 1115. For example, the encoder 114 or temporal equalizer 108 may smooth the comparison values to generate short-term smoothed comparison values. The short-term smoothed comparison values (e.g., CompValST
In some implementations, a non-uniform weighting may be applied to the plurality of comparison values for the current and previous frames. In other implementations, the short-term comparison values may be the same as the comparison values generated in the frame being processed (CompValN(k)).
The method 1100 includes smoothing the comparison values to generate first long-term smoothed comparison values based on a smoothing parameter, at 1120. For example, the encoder 114 or temporal equalizer 108 may smooth the comparison values to generate smoothed comparison values based on historical comparison value data and a smoothing parameter. The smoothing may be performed such that a long-term smoothed comparison values CompValLT
According to one implementation, the smoothing parameter may be adaptive. For example, the method 1100 may include adapting the smoothing parameter based on a correlation of short-term smoothed comparison values to long-term smoothed comparison values. As the value of α increases, the amount of smoothing in the long-term smoothed comparison value increases. A value of the smoothing parameter (α) may be adjusted based on short-term energy indicators of input channels and long-term energy indicators of the input channels. Additionally, the value of the smoothing parameter (α) may be reduced if the short-term energy indicators are greater than the long-term energy indicators. According to another implementation, a value of the smoothing parameter (α) is adjusted based on a correlation of short-term smoothed comparison values to long-term smoothed comparison values. Additionally, the value of the smoothing parameter (α) may be increased if the correlation exceeds a threshold. According to another implementation, the comparison values may be cross-correlation values of down-sampled reference channels and corresponding down-sampled target channel.
The method 1100 includes calculating a cross-correlation value between the comparison values and the short-term smoothed comparison values, at 1125. For example, the encoder 114 or temporal equalizer 108 may calculate a cross-correlation value of the comparison values (CrossCorr_CompValN) 765 between the comparison values for a single frame (“instantaneous comparison values” CompValN(k)) 735 and short-term smoothed comparison values (CompValST
In alternative implementations, the method 1100 may include calculating a cross-correlation value between the short-term smoothed comparison values and the long-term smoothed comparison values, at 1125. For example, the encoder 114 or temporal equalizer 108 may calculate a cross-correlation value of the comparison values (CrossCorr_CompValN) 765 between short-term smoothed comparison values (CompValST
The method 1100 includes comparing the cross-correlation value with a threshold, at 1130. For example, the encoder 114 or temporal equalizer 108 may compare the cross-correlation value (CrossCorr_CompValN) 765 with a threshold. The method 1100 also includes adjusting the first long-term smoothed comparison values to generate second long-term smoothed comparison values, in response to determination that the cross-correlation value exceeds the threshold, at 1135. For example, the encoder 114 or temporal equalizer 108 may adjust a whole or some part of the first long-term smoothed comparison values 755 based on the comparison result. In some implementations, the encoder 114 or temporal equalizer 108 may increase (or boost or bias) certain values of a subset of the first long-term smoothed comparison values 755 in response to the determination that the cross-correlation value of the comparison values (CrossCorr_CompValN) 765 exceeds the threshold. For example, when the cross-correlation value of the comparison values (CrossCorr_CompValN) is bigger than or equal to a threshold (e.g., 0.8), it may indicate the cross-correlation value between comparison values is quite strong or high, indicating small or no variations of temporal shift values between adjacent frames. Thus, the estimated temporal shift value of the current frame (e.g., frame N) cannot be too far off from the temporal shift values of the previous frame (e.g., frame N−1) or the temporal shift values of any other previous frames. The temporal shift values may be one of a tentative mismatch value 536, an interpolated mismatch value 538, an amended mismatch value 540, a final mismatch value 116, or a non-causal mismatch value 162. Therefore, the encoder 114 or temporal equalizer 108 may increase (or boost or bias) certain values of a subset of the first long-term smoothed comparison values 755, for example, by a factor of 1.2 (20% boost or increase) to generate a second long-term smoothed comparison values. This boosting or biasing may be implemented by multiplying a scaling factor or by adding an offset to the values within the subset of the first long-term smoothed comparison values 755. In some implementations, the encoder 114 or temporal equalizer 108 may boost or bias the subset of the first long long-term smoothed comparison values 755 such that the subset may include an index corresponding to the temporal shift value of the previous frame (e.g., frame N−1). Additionally, or alternatively the subset may further include an index around the vicinity of the temporal shift value of the previous frame (e.g., frame N−1). For example, the vicinity may mean within −delta (e.g., delta is in the range of 1-5 samples in a preferred embodiment) and +delta of the temporal shift value of the previous frame (e.g., frame N−1).
The method 1100 includes estimating a tentative shift value based on the second long-term smoothed comparison values, at 1140. For example, the encoder 114 or temporal equalizer 108 may estimate a tentative shift value 536 based on the second long-term smoothed comparison values. The method 1100 also includes determining a non-causal shift value based on the tentative shift value, at 1145. For example, the encoder 114 or temporal equalizer 108 may determine a non-causal shift value (e.g., the non-causal mismatch value 162) based at least in part on the tentative shift value (e.g., the tentative mismatch value 536, the interpolated mismatch value 538, the amended mismatch value 540, or final mismatch value 116).
The method 1100 includes non-causally shifting a particular target channel by the non-causal shift value to generate an adjusted particular target channel that is temporally aligned with a particular reference channel, at 1150. For example, the encoder 114 or temporal equalizer 108 may non-causally shift the target channel by the non-causal shift value (e.g., the non-causal mismatch value 162) to generate an adjusted target channel that is temporally aligned with the reference channel. The method 1100 also includes generating at least one of a mid-band channel or a side-band channel based on the particular reference channel and the adjusted particular target channel, at 1155. For example, referring to
Referring to
The method 1200 includes estimating comparison values at an encoder, at 1210. For example, the method at 1210 may be similar to the method at 1110, as described with reference to
The method 1200 includes calculating a gain parameter from a previous reference frame of a reference channel and a corresponding previous target frame of a target channel, at 1225. In some implementations, the gain parameter from the previous frame may be based on an energy of the previous reference frame and an energy of the previous target frame. In some implementations, the encoder 114 or temporal equalizer 108 may generate or calculate the gain parameter 160 (e.g., a codec gain parameter or target gain) based on samples of the target channel and based on samples of the reference channel. For example, the temporal equalizer 108 may select samples of the second audio signal 132 based on the non-causal mismatch value 162. Alternatively, the temporal equalizer 108 may select samples of the second audio signal 132 independent of the non-causal mismatch value 162. The temporal equalizer 108 may, in response to determining that the first audio signal 130 is the reference channel, determine the gain parameter 160 of the selected samples based on the first samples of the first frame 131 of the first audio signal 130. Alternatively, the temporal equalizer 108 may, in response to determining that the second audio signal 132 is the reference channel, determine the gain parameter 160 based on an energy of a reference frame of the reference channel and an energy of a target frame of the target channel. As an example, the gain parameter 160 may be calculated or generated based on one or more of the Equations 1a, 1b, 1c, 1d, 1e, or 1f. In some implementations, the gain parameter 160 (gD) may be modified or smoothed over a plurality of frames by any known smoothing algorithms or alternatively by hysteresis to avoid large jumps in gain between frames.
The method 1200 also includes comparing the gain parameter with a first threshold, at 1230. For example, the encoder 114 or temporal equalizer 108 may compare the gain parameter with a first threshold (e.g., Thr1 or Thr2), at 1230. When the gain parameter 160 (gD), based on one or more of the Equations 1a-1f, is greater than 1, it may indicate that the first audio signal 130 (or left channel) is a leading channel (“a reference channel”) and thus it is more likely that shift values (“temporal shift values”) would be positive values. The temporal shift values may be one of a tentative mismatch value 536, an interpolated mismatch value 538, an amended mismatch value 540, a final mismatch value 116, or a non-causal mismatch value 162. Therefore, it may be advantageous to emphasize (or increase or boost or bias) the values in the positive shift side and/or deemphasize (or decrease) the values in the negative shift side. In some implementations, the encoder 114 or temporal equalizer 108 may compare the gain parameter 160 (gD) with a first threshold (e.g., Thr1=1.2) or another threshold (e.g., Thr2=0.8), as described with reference to
The method 1200 also includes adjusting a first subset of the first long-term smoothed comparison values, in response to the comparison result, to generate second long-term smoothed comparison values, at 1235. For example, the encoder 114 or temporal equalizer 108 may adjust a first subset of the first long-term smoothed comparison values CompValLT
Returning to
Four cases in
The method 1200 includes estimating a tentative shift value based on the second long-term smoothed comparison values, at 1240. For example, the method at 1240 may be similar to the method at 1140, as described with reference to
Referring to
In a particular embodiment, the device 1300 includes a processor 1306 (e.g., a central processing unit (CPU)). The device 1300 may include one or more additional processors 1310 (e.g., one or more digital signal processors (DSPs)). The processors 1310 may include a media (e.g., speech and music) coder-decoder (CODEC) 1308, and an echo canceller 1312. The media CODEC 1308 may include the decoder 118, the encoder 114, or both, of
The device 1300 may include a memory 153 and a CODEC 1334. Although the media CODEC 1308 is illustrated as a component of the processors 1310 (e.g., dedicated circuitry and/or executable programming code), in other embodiments one or more components of the media CODEC 1308, such as the decoder 118, the encoder 114, or both, may be included in the processor 1306, the CODEC 1334, another processing component, or a combination thereof.
The device 1300 may include the transmitter 110 coupled to an antenna 1342. The device 1300 may include a display 1328 coupled to a display controller 1326. One or more speakers 1348 may be coupled to the CODEC 1334. One or more microphones 1346 may be coupled, via the input interface(s) 112, to the CODEC 1334. In a particular implementation, the speakers 1348 may include the first loudspeaker 142, the second loudspeaker 144 of
The memory 153 may include instructions 1360 executable by the processor 1306, the processors 1310, the CODEC 1334, another processing unit of the device 1300, or a combination thereof, to perform one or more operations described with reference to
One or more components of the device 1300 may be implemented via dedicated hardware (e.g., circuitry), by a processor executing instructions to perform one or more tasks, or a combination thereof. As an example, the memory 153 or one or more components of the processor 1306, the processors 1310, and/or the CODEC 1334 may be a memory device, such as a random access memory (RAM), magneto-resistive random access memory (MRAM), spin-torque transfer MRAM (STT-MRAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, or a compact disc read-only memory (CD-ROM). The memory device may include instructions (e.g., the instructions 1360) that, when executed by a computer (e.g., a processor in the CODEC 1334, the processor 1306, and/or the processors 1310), may cause the computer to perform one or more operations described with reference to
In a particular embodiment, the device 1300 may be included in a system-in-package or system-on-chip device (e.g., a mobile station modem (MSM)) 1322. In a particular embodiment, the processor 1306, the processors 1310, the display controller 1326, the memory 153, the CODEC 1334, and the transmitter 110 are included in a system-in-package or the system-on-chip device 1322. In a particular embodiment, an input device 1330, such as a touchscreen and/or keypad, and a power supply 1344 are coupled to the system-on-chip device 1322. Moreover, in a particular embodiment, as illustrated in
The device 1300 may include a wireless telephone, a mobile communication device, a mobile phone, a smart phone, a cellular phone, a laptop computer, a desktop computer, a computer, a tablet computer, a set top box, a personal digital assistant (PDA), a display device, a television, a gaming console, a music player, a radio, a video player, an entertainment unit, a communication device, a fixed location data unit, a personal media player, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a decoder system, an encoder system, or any combination thereof.
In a particular implementation, one or more components of the systems described herein and the device 1300 may be integrated into a decoding system or apparatus (e.g., an electronic device, a CODEC, or a processor therein), into an encoding system or apparatus, or both. In other implementations, one or more components of the systems described herein and the device 1300 may be integrated into a wireless telephone, a tablet computer, a desktop computer, a laptop computer, a set top box, a music player, a video player, an entertainment unit, a television, a game console, a navigation device, a communication device, a personal digital assistant (PDA), a fixed location data unit, a personal media player, or another type of device.
It should be noted that various functions performed by the one or more components of the systems described herein and the device 1300 are described as being performed by certain components or modules. This division of components and modules is for illustration only. In an alternate implementation, a function performed by a particular component or module may be divided amongst multiple components or modules. Moreover, in an alternate implementation, two or more components or modules of the systems described herein may be integrated into a single component or module. Each component or module illustrated in systems described herein may be implemented using hardware (e.g., a field-programmable gate array (FPGA) device, an application-specific integrated circuit (ASIC), a DSP, a controller, etc.), software (e.g., instructions executable by a processor), or any combination thereof.
In conjunction with the described implementations, an apparatus includes means for capturing a reference channel. The reference channel may include a reference frame. For example, the means for capturing the first audio signal may include the first microphone 146 of
The apparatus may also include means for capturing a target channel. The target channel may include a target frame. For example, the means for capturing the second audio signal may include the second microphone 148 of
The apparatus may also include means for estimating a delay between the reference frame and the target frame. For example, the means for determining the delay may include the temporal equalizer 108, the encoder 114, the first device 104 of
The apparatus may also include means for estimating a temporal offset between the reference channel and the target channel based on the delay and based on historical delay data. For example, the means for estimating the temporal offset may include the temporal equalizer 108, the encoder 114, the first device 104 of
Referring to
The base station 1400 may be part of a wireless communication system. The wireless communication system may include multiple base stations and multiple wireless devices. The wireless communication system may be a Long Term Evolution (LTE) system, a Code Division Multiple Access (CDMA) system, a Global System for Mobile Communications (GSM) system, a wireless local area network (WLAN) system, or some other wireless system. A CDMA system may implement Wideband CDMA (WCDMA), CDMA 1×, Evolution-Data Optimized (EVDO), Time Division Synchronous CDMA (TD-SCDMA), or some other version of CDMA.
The wireless devices may also be referred to as user equipment (UE), a mobile station, a terminal, an access terminal, a subscriber unit, a station, etc. The wireless devices may include a cellular phone, a smartphone, a tablet, a wireless modem, a personal digital assistant (PDA), a handheld device, a laptop computer, a smartbook, a netbook, a tablet, a cordless phone, a wireless local loop (WLL) station, a Bluetooth device, etc. The wireless devices may include or correspond to the device 1400 of
Various functions may be performed by one or more components of the base station 1400 (and/or in other components not shown), such as sending and receiving messages and data (e.g., audio data). In a particular example, the base station 1400 includes a processor 1406 (e.g., a CPU). The base station 1400 may include a transcoder 1410. The transcoder 1410 may include an audio CODEC 1408. For example, the transcoder 1410 may include one or more components (e.g., circuitry) configured to perform operations of the audio CODEC 1408. As another example, the transcoder 1410 may be configured to execute one or more computer-readable instructions to perform the operations of the audio CODEC 1408. Although the audio CODEC 1408 is illustrated as a component of the transcoder 1410, in other examples one or more components of the audio CODEC 1408 may be included in the processor 1406, another processing component, or a combination thereof. For example, a decoder 1438 (e.g., a vocoder decoder) may be included in a receiver data processor 1464. As another example, an encoder 1436 (e.g., a vocoder encoder) may be included in a transmission data processor 1482.
The transcoder 1410 may function to transcode messages and data between two or more networks. The transcoder 1410 may be configured to convert message and audio data from a first format (e.g., a digital format) to a second format. To illustrate, the decoder 1438 may decode encoded signals having a first format and the encoder 1436 may encode the decoded signals into encoded signals having a second format. Additionally, or alternatively, the transcoder 1410 may be configured to perform data rate adaptation. For example, the transcoder 1410 may down-convert a data rate or up-convert the data rate without changing a format the audio data. To illustrate, the transcoder 1410 may down-convert 64 kbit/s signals into 16 kbit/s signals.
The audio CODEC 1408 may include the encoder 1436 and the decoder 1438. The encoder 1436 may include the encoder 114 of
The base station 1400 may include a memory 1432. The memory 1432, such as a computer-readable storage device, may include instructions. The instructions may include one or more instructions that are executable by the processor 1406, the transcoder 1410, or a combination thereof, to perform one or more operations described with reference to the methods and systems of
The base station 1400 may include a network connection 1460, such as backhaul connection. The network connection 1460 may be configured to communicate with a core network or one or more base stations of the wireless communication network. For example, the base station 1400 may receive a second data stream (e.g., messages or audio data) from a core network via the network connection 1460. The base station 1400 may process the second data stream to generate messages or audio data and provide the messages or the audio data to one or more wireless device via one or more antennas of the array of antennas or to another base station via the network connection 1460. In a particular implementation, the network connection 1460 may be a wide area network (WAN) connection, as an illustrative, non-limiting example. In some implementations, the core network may include or correspond to a Public Switched Telephone Network (PSTN), a packet backbone network, or both.
The base station 1400 may include a media gateway 1470 that is coupled to the network connection 1460 and the processor 1406. The media gateway 1470 may be configured to convert between media streams of different telecommunications technologies. For example, the media gateway 1470 may convert between different transmission protocols, different coding schemes, or both. To illustrate, the media gateway 1470 may convert from PCM signals to Real-Time Transport Protocol (RTP) signals, as an illustrative, non-limiting example. The media gateway 1470 may convert data between packet switched networks (e.g., a Voice Over Internet Protocol (VoIP) network, an IP Multimedia Subsystem (IMS), a fourth generation (4G) wireless network, such as LTE, WiMax, and UMB, etc.), circuit switched networks (e.g., a PSTN), and hybrid networks (e.g., a second generation (2G) wireless network, such as GSM, GPRS, and EDGE, a third generation (3G) wireless network, such as WCDMA, EV-DO, and HSPA, etc.).
Additionally, the media gateway 1470 may include a transcode and may be configured to transcode data when codecs are incompatible. For example, the media gateway 1470 may transcode between an Adaptive Multi-Rate (AMR) codec and a G.711 codec, as an illustrative, non-limiting example. The media gateway 1470 may include a router and a plurality of physical interfaces. In some implementations, the media gateway 1470 may also include a controller (not shown). In a particular implementation, the media gateway controller may be external to the media gateway 1470, external to the base station 1400, or both. The media gateway controller may control and coordinate operations of multiple media gateways. The media gateway 1470 may receive control signals from the media gateway controller and may function to bridge between different transmission technologies and may add service to end-user capabilities and connections.
The base station 1400 may include a demodulator 1462 that is coupled to the transceivers 1452, 1454, the receiver data processor 1464, and the processor 1406, and the receiver data processor 1464 may be coupled to the processor 1406. The demodulator 1462 may be configured to demodulate modulated signals received from the transceivers 1452, 1454 and to provide demodulated data to the receiver data processor 1464. The receiver data processor 1464 may be configured to extract a message or audio data from the demodulated data and send the message or the audio data to the processor 1406.
The base station 1400 may include a transmission data processor 1482 and a transmission multiple input-multiple output (MIMO) processor 1484. The transmission data processor 1482 may be coupled to the processor 1406 and the transmission MIMO processor 1484. The transmission MIMO processor 1484 may be coupled to the transceivers 1452, 1454 and the processor 1406. In some implementations, the transmission MIMO processor 1484 may be coupled to the media gateway 1470. The transmission data processor 1482 may be configured to receive the messages or the audio data from the processor 1406 and to code the messages or the audio data based on a coding scheme, such as CDMA or orthogonal frequency-division multiplexing (OFDM), as illustrative, non-limiting examples. The transmission data processor 1482 may provide the coded data to the transmission MIMO processor 1484.
The coded data may be multiplexed with other data, such as pilot data, using CDMA or OFDM techniques to generate multiplexed data. The multiplexed data may then be modulated (i.e., symbol mapped) by the transmission data processor 1482 based on a particular modulation scheme (e.g., Binary phase-shift keying (“BPSK”), Quadrature phase-shift keying (“QSPK”), M-ary phase-shift keying (“M-PSK”), M-ary Quadrature amplitude modulation (“M-QAM”), etc.) to generate modulation symbols. In a particular implementation, the coded data and other data may be modulated using different modulation schemes. The data rate, coding, and modulation for each data stream may be determined by instructions executed by processor 1406.
The transmission MIMO processor 1484 may be configured to receive the modulation symbols from the transmission data processor 1482 and may further process the modulation symbols and may perform beamforming on the data. For example, the transmission MIMO processor 1484 may apply beamforming weights to the modulation symbols. The beamforming weights may correspond to one or more antennas of the array of antennas from which the modulation symbols are transmitted.
During operation, the second antenna 1444 of the base station 1400 may receive a data stream 1414. The second transceiver 1454 may receive the data stream 1414 from the second antenna 1444 and may provide the data stream 1414 to the demodulator 1462. The demodulator 1462 may demodulate modulated signals of the data stream 1414 and provide demodulated data to the receiver data processor 1464. The receiver data processor 1464 may extract audio data from the demodulated data and provide the extracted audio data to the processor 1406.
The processor 1406 may provide the audio data to the transcoder 1410 for transcoding. The decoder 1438 of the transcoder 1410 may decode the audio data from a first format into decoded audio data and the encoder 1436 may encode the decoded audio data into a second format. In some implementations, the encoder 1436 may encode the audio data using a higher data rate (e.g., up-convert) or a lower data rate (e.g., down-convert) than received from the wireless device. In other implementations, the audio data may not be transcoded. Although transcoding (e.g., decoding and encoding) is illustrated as being performed by a transcoder 1410, the transcoding operations (e.g., decoding and encoding) may be performed by multiple components of the base station 1400. For example, decoding may be performed by the receiver data processor 1464 and encoding may be performed by the transmission data processor 1482. In other implementations, the processor 1406 may provide the audio data to the media gateway 1470 for conversion to another transmission protocol, coding scheme, or both. The media gateway 1470 may provide the converted data to another base station or core network via the network connection 1460.
The encoder 1436 may estimate a delay between the reference frame (e.g., the first frame 131) and the target frame (e.g., the second frame 133). The encoder 1436 may also estimate a temporal offset between the reference channel (e.g., the first audio signal 130) and the target channel (e.g., the second audio signal 132) based on the delay and based on historical delay data. The encoder 1436 may quantize and encode the temporal offset (or the final shift) value at a different resolution based on the CODEC sample rate to reduce (or minimize) the impact on the overall delay of the system. In one example implementation, the encoder may estimate and use the temporal offset with a higher resolution for multi-channel downmix purposes at the encoder, however, the encoder may quantize and transmit at a lower resolution for use at the decoder. The decoder 118 may generate the first output signal 126 and the second output signal 128 by decoding encoded signals based on the reference signal indicator 164, the non-causal shift value 162, the gain parameter 160, or a combination thereof. Encoded audio data generated at the encoder 1436, such as transcoded data, may be provided to the transmission data processor 1482 or the network connection 1460 via the processor 1406.
The transcoded audio data from the transcoder 1410 may be provided to the transmission data processor 1482 for coding according to a modulation scheme, such as OFDM, to generate the modulation symbols. The transmission data processor 1482 may provide the modulation symbols to the transmission MIMO processor 1484 for further processing and beamforming. The transmission MIMO processor 1484 may apply beamforming weights and may provide the modulation symbols to one or more antennas of the array of antennas, such as the first antenna 1442 via the first transceiver 1452. Thus, the base station 1400 may provide a transcoded data stream 1416, that corresponds to the data stream 1414 received from the wireless device, to another wireless device. The transcoded data stream 1416 may have a different encoding format, data rate, or both, than the data stream 1414. In other implementations, the transcoded data stream 1416 may be provided to the network connection 1460 for transmission to another base station or a core network.
The base station 1400 may therefore include a computer-readable storage device (e.g., the memory 1432) storing instructions that, when executed by a processor (e.g., the processor 1406 or the transcoder 1410), cause the processor to perform operations including estimating a delay between the reference frame and the target frame. The operations also include estimating a temporal offset between the reference channel and the target channel based on the delay and based on historical delay data.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software executed by a processing device such as a hardware processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or executable software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in a memory device, such as random access memory (RAM), magneto-resistive random access memory (MRAM), spin-torque transfer MRAM (STT-MRAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, or a compact disc read-only memory (CD-ROM). An exemplary memory device is coupled to the processor such that the processor can read information from, and write information to, the memory device. In the alternative, the memory device may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or a user terminal.
The previous description of the disclosed implementations is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.
The present application claims priority from U.S. Provisional Patent Application No. 62/556,653 entitled “TEMPORAL OFFSET ESTIMATION,” filed Sep. 11, 2017, which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6539357 | Sinha | Mar 2003 | B1 |
7502743 | Thumpudi | Mar 2009 | B2 |
9361896 | Disch | Jun 2016 | B2 |
9449604 | Virette | Sep 2016 | B2 |
10304468 | Atti et al. | May 2019 | B2 |
20050216262 | Fejzo | Sep 2005 | A1 |
20070067166 | Pan | Mar 2007 | A1 |
20070162278 | Miyasaka | Jul 2007 | A1 |
20080002842 | Neusinger | Jan 2008 | A1 |
20090326962 | Chen | Dec 2009 | A1 |
20100073572 | Burns | Mar 2010 | A1 |
20120053714 | Wu et al. | Mar 2012 | A1 |
20120314776 | Shimizu | Dec 2012 | A1 |
20150332680 | Crockett | Nov 2015 | A1 |
20170116997 | Gibbs et al. | Apr 2017 | A1 |
20170180906 | Chebiyyam | Jun 2017 | A1 |
20180233154 | Vaillancourt | Aug 2018 | A1 |
Entry |
---|
International Search Report and Written Opinion—PCT/US2018/050242—ISA/EPO—dated Nov. 7, 2018. |
Number | Date | Country | |
---|---|---|---|
20190080703 A1 | Mar 2019 | US |
Number | Date | Country | |
---|---|---|---|
62556653 | Sep 2017 | US |