The present invention generally relates to audio upmixing, more specifically, to generating higher channel surround sound audio signals from stereo audio signals.
Monophonic sound (or “mono”) refers to sound systems that utilize a single loudspeaker (or “speaker”) for reproduction. In contrast, stereophonic sound (or “stereo”) uses two separate audio channels to reproduce sound from two loudspeakers on the left and right side of the listener.
Surround sound is a broad term used to describe sound reproduction that uses more than two audio channels. Surround sound systems are generally described using the format A.B, or A.B.C, where A is the number of speakers at the listener's height (the listening plane), B is the number of subwoofers, and C is the number of overhead speakers. For example, a 5.1 surround sound system has 6 audio channels, where 5 are allocated to the listening plane speakers, and 1 is allocated to the subwoofer (which may or may not be at the listening plane). As an additional example, 7.1.4 surround sound such as that found in Dolby Atmos audio systems allocates 7 channels to listening plane speakers, 1 channel to a subwoofer, and 4 channels to overhead speakers.
Audio tracks can be made for particular speaker layouts. A track may have one or more audio channels depending on the particular speaker layout it was mixed for. “Upmixing” as used herein refers to the process of converting an audio track having M channels to an audio track having N channels, where N>M. “Downmixing,” in contrast, refers to the process of converting an audio track having Y channels to an audio track having X channels, where X<Y.
Systems and methods for audio in accordance with embodiments of the invention are illustrated. One embodiment includes a method for upmixing audio, including receiving an audio track which includes an input plurality of channels, each channel having an encoded audio signal, decoding the audio signal, calculating a first frequency spectrum for a low frequency component of the signal using a first window, calculating a second frequency spectrum for a high frequency component of the signal using a second window, determining at least one direct signal by estimating panning coefficients, estimating at least one ambient signal based on the at least one direct signal; and generating an output plurality of channels based on the at least one direct signal and the at least one ambient signal.
In another embodiment, the second plurality of channels comprises more channels than the first plurality of channels.
In a further embodiment, the method further includes determining a spatial representation of the audio track.
In still another embodiment, the input plurality of channels comprises two channels.
In a still further embodiment, the two channels comprise a right and left channel.
In yet another embodiment, the output plurality of channels comprises a center channel.
In a yet further embodiment, the center channel is determined using the at least one direct signal and the panning coefficients.
In another additional embodiment, a decorrelation method is applied to the resulting surround channels.
In a further additional embodiment, a decorrelation method is applied to the resulting left and right channels.
In another embodiment again, the low frequency component comprises frequencies up to 1000 Hz.
In a further embodiment again, calculating the first frequency spectrum and calculating the second frequency spectrum comprises using a Short-time Fourier transform (STFT).
In still yet another embodiment, the first window has a length suitable for the STFT to produce 2048 frequency coefficients.
In a still yet further embodiment, the second window has a length suitable for the STFT to produce 128 frequency coefficients.
In still another additional embodiment, the method further includes smoothing the panning coefficients.
In a still further additional embodiment, a system for upmixing audio, including a processor, and a memory containing an upmixing application that configures the processor to receive an audio track comprising an input plurality of channels, each channel having an encoded audio signal, decode the audio signals, calculate a first frequency spectrum for a low frequency component of the signal using a first window, calculate a second frequency spectrum for a high frequency component of the signal using a second window, determine at least one direct signal by estimating panning coefficients, estimate at least one ambient signal based on the at least one direct signal, and generate an output plurality of channels based on the at least one direct signal and the at least one ambient signal.
In still another embodiment again, the second plurality of channels comprises more channels than the first plurality of channels.
In a still further embodiment again, the upmixing application further directs the processor to determine a spatial representation of the audio track.
In yet another additional embodiment, the input plurality of channels comprises two channels.
In a yet further additional embodiment, the two channels comprise a right and left channel.
In yet another embodiment again, the output plurality of channels comprises a center channel.
In a yet further embodiment again, the center channel is determined using the at least one direct signal and the panning coefficients.
In another additional embodiment again, the upmixing application further directs the processor to apply a decorrelation method to the resulting surround channels.
In a further additional embodiment again, the upmixing application further directs the processor to apply a decorrelation method to the resulting left and right channels
In still yet another additional embodiment, the low frequency component comprises frequencies up to 1000 Hz.
In another additional embodiment, to calculate the first frequency spectrum and the second frequency spectrum, the upmixing application directs the processor to use a Short-time Fourier transform (STFT).
In a further additional embodiment, the first window has a length suitable for the STFT to produce 2048 frequency coefficients.
In another embodiment again, the second window has a length suitable for the STFT to produce 128 frequency coefficients.
In a further embodiment again, the upmixing application further directs the processor to smooth the panning coefficients.
Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure.
The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
Advancements in film sound have resulted in an increase in the number of audio channels. As a result, home surround sound systems are becoming more commonplace. Where homes may previously have only had 2-channel stereo systems, 5.1 surround sound and even higher order surround sound systems are now ubiquitous. However, music catalogues, are rarely in a surround sound format. For example, recordings made by the Beatles, often cited as the most influential band of all time, are in mono and stereo. As such, surround sound systems, and even some stereo systems, are unable to provide a surround sound experience when playing back Beatles recordings.
To remedy this, systems and methods described herein provide audio upmixing techniques that enable lower channel audio to be converted into higher channel audio without introducing significant, if any, distortion. Conventional methodologies tend to focus more on cinema audio, and be suboptimal for music reproduction. Further, conventional methodologies can introduce artifacts and/or other distortions to the played back audio. For many applications, systems and methods described herein may need to be performed in near-real time, and therefore increased efficiency over existing methods is beneficial.
For example, home surround sound systems are often provided music as a source input that is not in 1:1 channel format with the speaker layout, but the listener expects for the music they've selected to be immediately played back from all the loudspeakers in their system. As such, a track may need to be upmixed into a higher number of channels immediately with as little lag as possible. Systems and methods described herein can upmix audio tracks to higher channel formats in near real time.
The Discrete Fourier Transform (DFT) is a mathematical method used to analyze the frequency content of audio signals. The Fast Fourier Transform (FFT) is an efficient computational implementation of the DFT that reduces the number of mathematical operations needed for the analysis. In many embodiments, the entire signal is not known in advance. For example, when music is streaming from the internet digital audio samples are arriving continuously in time. The Short-time Fourier Transform (STFT) can be used to determine frequency and phase content of specific time portions (time slices) of the audio signal. The STFT computes the FFT of consecutive time slices of the incoming signal and calculates the frequency content of the signal continuously in time. One issue with STFTs (and the Fourier Transform in general) is that the transform has a fixed resolution. Specifically, the number of coefficients used in the analysis (“FFT Length”) determines the frequency resolution of the analyzed frequency content of the signal. In the STFT case, the consecutive time slices are composed of a number of digital audio samples, N, and this slicing process is achieved through the use of a windowing function (“a window”). The number of audio samples per second is called the sampling rate, fs. When the number of coefficients of the FFT is set to be equal to the window size (N), the resulting spacing between analyzed frequencies (frequency resolution) of the FFT is fs/N. That implies that as the number of FFT coefficients (N) increases, the FFT has the ability to resolve frequencies that are closer together. However, an increase in the number of coefficients, N, implies that the size of the window used to create the time slices becomes larger. This results in a reduction of the ability to resolve rapid time changes of the audio signal. This time-frequency resolution tradeoff is one of the fundamental properties of the Fourier Transform. A wider window gives a better frequency resolution, but a worse time resolution. Conversely, a narrower window gives better time resolution, but a worse frequency resolution. An additional downside of using an STFT window that yields high frequency resolution is that significantly more computations are typically performed in order to analyze the frequency content. Systems and methods described herein can leverage this deficiency to increase computational efficiency while maintaining quality by extracting from the audio signals for each channel a number of frequency bands that can then be separately processed.
In various embodiments, the frequency bands are selected by identifying frequency ranges that benefit from high resolution in time and those that benefit from high resolution in frequency. The bands that benefit from high resolution in frequency tend to be lower frequency bands, which can be allocated more compute resources. The power spectra of lower frequency bands in musical audio signals tend to change much more slowly than higher frequencies, but changes in frequency within lower frequency bands are much more noticeable to the human ear (e.g. the perceived difference between a 50 Hz audio signal and a 53 Hz audio signal is significantly more noticeable than from the difference between a 5000 Hz audio signal and a 5003 Hz audio signal). As such, high resolution in frequency is typically more important than high resolution in time for low frequency audio signals in music. In contrast, the power spectra of higher frequency audio signals (where most melody instruments tend to reside, including the human voice) tend to change more rapidly in time, and so high resolution in time is typically more important than high resolution in frequency at higher frequency bands. As is discussed further below, extracting different frequency bands and determining the power spectra of the frequency bands by applying STFT processes using different length time windows to achieve different tradeoffs between frequency and time resolution can reduce processing load within a processing system (e.g. a CPU), and in many embodiments, increase the parallelizability of the processing. As a result, systems and methods in accordance with many embodiments of the invention can achieve low latency, near real-time upmixing of audio signals.
By way of example, turning now to
Audio upmixing processes can involve converting an audio track with a given number of channels to a version of the audio track with a higher number of channels. In many embodiments, audio upmixing processes described herein can operate in real time. For example, processes described herein can upmix a stereo audio stream to a 5.1 channel stream which is played back using speakers designed and/or placed to render 5.1 channel audio without noticeable latency to the user. As can be readily appreciated, a stereo to 5.1 upmix is merely an example, and any arbitrary number of channels can be upmixed using processes described herein. However, in order to provide a concrete example to enhance understanding, an upmix from stereo to 5.1 channel surround sound is used as an example below.
Turning now to
Same frequency band L and R channel pairs are split (230) into frames. In many embodiments, frames are generated using a sliding window. The window size can be dependent upon what frequency band is being processed. For example, a high frequency band may have a smaller window size (and therefore frame size) because, when performing an STFT (240) on the frame, high frequencies need high resolution in time but low resolution in frequency, whereas low frequencies need a low resolution in time but higher resolution in frequency.
In many embodiments, the window sizes are allocated such that the high frequency window yields a first number of spectral coefficients (e.g. 128 or fewer spectral coefficients), and the low frequency window yields a second larger number of spectral coefficients (e.g. 2048 or more spectral coefficients). The specific number of spectral frequency coefficients that are generated with respect to each frequency band (and the number of frequency bands) is largely dependent upon the requirements of specific applications in accordance with various embodiments of the invention, and may be tuned based on the particular piece of content and available computational resources. For example, different musical genres may be accounted for using different numbers of spectral coefficients. Indeed, in a number of embodiments the characteristics (e.g. genre) of the music can be specified and/or detected and parameters such as (but not limited to) frequency cutoff(s), and/or number(s) of spectral coefficients with respect to one or more of the frequency bands can be adapted based upon the characteristics of the music. Further, as noted above, multiple frequency bands can be generated, and therefore different window sizes can be used as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In numerous embodiments, the window utilized to determine the FFT of a given spectral band (e.g. using an STFT) operates in a sliding window fashion and may overlap previously processed samples from the signal. In some embodiments, the window contains between 40%-60% of samples from samples utilized to determine the FFT of the spectral band (e.g. using an STFT) during a previous time window. However, this number can be adjusted depending on the type of content being processed, the frequency band being processed, and/or any other parameter as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. This splitting can provide significant computational efficiency increases because, as noted, Fourier transforms break up a frequency range into spectral coefficients (or frequency sub-bands called bins), and processing requirements are roughly the square of the number of spectral coefficients.
In many embodiments, the Fourier transform is a Fast Fourier transform (FFT), which may be an implementation of a Short-time Fourier transform (STFT). The frequency components corresponding to the spectral coefficients can be assigned (250) to new channels. An inverse Fourier transform (e.g. an inverse STFT, called iSTFT) can be performed (260) on the spectral coefficients in each new channel to produce new audio signals for each channel. These new audio signals can then be output (270).
Assigning frequency components to new channels can be performed in a number of ways. Turning now to
Panning coefficients for the L and R channels are estimated (320). In many embodiments, the stereo signals are represented as a weighted sum of J source signals dj(n) and a term that corresponds to an uncorrelated ambient signal nL(n):
Panning coefficients aL
a
L
2
+a
R
2=1
In the frequency domain, after application of a Fourier transform (e.g. an STFT), the signal model is given as:
In many embodiments, it is assumed that at any given time instant b, and frequency band k, only one dominant source D is active in the track. In various embodiments, it is assumed that the ambient left and right signals have the same amplitude, but different phase (φ) due to variations in path lengths that arise from room acoustic reflections:
N
L(b, k)=N(b, k), NR(b, k)=ejϕ·N(b,k)
From the above, a simplified signal model can be written as:
N
L(b, k)=aL(b, k)D(b, k)+N(b, k)
N
R(b, k)=aR(b, k)D(b, k)+ejϕN(b, k)
However, it is to be understood that each equation is computed for each time frequency bin as above. As the magnitude of the ambient signal can be assumed to be significantly smaller than that of the direct signal, let:
|XL(b, k)|≈aL(b, k)|D(b, k)|
|XR(b, k)|≈aR(b, k)|D(b, k)|
when, which combined with the power summing condition of the panning coefficients, gives an estimate of each coefficient based on the magnitudes of the original left and right channels:
In many embodiments, the rate of change between consecutive STFT frames is too fast which can cause audible distortion. In order to resolve this, the estimates of the panning coefficients âL and âR are smoothed (330) over time. In numerous embodiments, smoothing is achieved using an exponential moving averaging filter:
â
L(b, k)=γL(b, k)ãL(b, k)+(1−γL(b, k)) ãL(b−1, k)
â
R(b, k)=γR(b, k)ãR(b, k)+(1−γR(b, k)) ãR(b−1, k)
where γ is a smoothing coefficient which can be tuned to minimize distortion. However, in some embodiments, smoothing can reduce variance which tends to pull audio towards the center channel. In various embodiments, this is rectified using a different smoothing coefficient (γ1 or γ2) with a decision-directed approach which reduces artifacts while preserving a wide sound stage. That is, the value for y may change for each STFT bin calculation. The decision-directed approach can be formalized as:
If ãL(b, k)>âL(b−1, k); then γL=γ1; else γL=γ2
If ãR(b, k)>âR(b−1, k); then γR=γ1; else γR=γ2
For notational simplicity, (b,k) is dropped in the equations below. Using the panning coefficients, direct and ambient components can be estimated (340). In many embodiments, using the panning coefficients in the above simplified signal model and solving for direct and ambient signals gives the following estimates:
With the estimate of the direct component from the generalized model above, a left, center and right channel can be derived (350) from the original stereo channels (L and R) using vector analysis:
X
L
=L+√{square root over (0.5)}C
X
R
=R+√{square root over (0.5)}C
In many embodiments, it is assumed that the ambient components are uncorrelated and that the L and R components do not usually contain a common dominant source, so:
L·R=0
which can be written using the above equation as:
(XL−√{square root over (05)}C)·(XR−√{square root over (0.5)}C)=0
This produces a quadratic equation for |C|. In many embodiments, the solution with the negative sign (for minimum energy) is selected to find |C| (but it is not required):
|C|=√{square root over (0.5)}(|XL+XR|−|XL−XR|)
The C channel component can be represented as a vector in the direction of the vector sum of XL+XR and is weighted by the magnitude estimate |C|:
In many embodiments, the center channel can alternatively be estimated instead by using: DL=aL×D and DR=aR×D to estimate |C| and C using the panning coefficients above. Once the center channel is determined, new L and R channels can be found by subtracting the Center channel from the original L and R:
L=X
L−√{square root over (0.5)}C
R=X
R=√{square root over (0.5)}C
Left and right surround channels are assigned (360) as the left and right ambient estimates above. In some embodiments, it is advantageous to further process the surround channels using decorrelation. While some degree of decorrelation is achieved through the addition of a phase rotation in one of the two channels, several other methods for decorrelation can be used. In some embodiments in which a realistic acoustic reproduction is desired, the L, R, and C channels are intended to be precisely localized by the listener while the surround channels (LS and RS) are intended to sound diffuse and not localizable. This can be achieved by adding a decorrelation processing block to the surround signals prior to directing them to the loudspeakers. Decorrelation methods include phase changes, frequency-dependent delay, frequency subband based randomization of phase, all-pass filters and other methods. These methods can be particularly advantageous when the surround channel is directed to a single loudspeaker behind the listener as is described in U.S. patent application Ser. No. 16/839,021 titled “Systems and Methods for Spatial Audio Rendering”. In some embodiments, decorrelation can be applied to the upmixed XL and XR signals to enhance the spatial impression of the track when all of the upmixed channels are reproduced from a single loudspeaker (as is described in U.S. patent application Ser. No. 16/839,021 titled “Systems and Methods for Spatial Audio Rendering”) placed in front of the listener.
While a particular method for upmixing and assigning frequencies to new channels are illustrated in
Upmixing systems in accordance with many embodiments of the system can upmix audio tracks in near real time to enable a pleasing live listening experience on surround sound audio setups being fed by suboptimal input channel configurations. In many embodiments, the upmixing is performed on streaming media content with an imperceptible amount of latency as experienced by the listener. However, upmixing systems can perform on any number of tracks provided in a non-live context as well.
Turning now to
Further, in many embodiments, the connected speaker layout may be a spatial audio system such as that described in U.S. patent application Ser. No. 16/839,021. In various embodiments, the audio upmixer can provide upmixed audio as input to a virtual speaker layout used to render spatial audio. An audio upmixer connected to an example spatial audio system in accordance with an embodiment of the invention is illustrated in
Turning now to
The audio upmixer 1000 further includes a memory 1030. The memory can be implemented using volatile memory, nonvolatile memory, or any combination thereof. The memory contains an upmixing application 1032 which can configure the processor to perform various audio upmixing processes. In many embodiments, the memory further contains audio data 1034 which describes one or more audio tracks, and/or a filter bank 1036. In many embodiments, the filter bank is a data structure that contains a list of different bandpass filters to use in splitting channels as described above. However, in many embodiments, the filter bank can be implemented as its own distinct circuit.
While particular audio upmixing systems are illustrated in
The current application claims the benefit of and priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/125,896 entitled “Systems and Methods for Audio Upmixing” filed Dec. 15, 2020, which is hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63125896 | Dec 2020 | US |