The invention relates to methods of enhancing audio imagery, and, in particular, though not exclusively, to audio up-mixing methods and devices.
The quality of loudspeaker audio has been increasing at a steady rate for over a century. In terms of timbre, there is a strong argument for saying recreation of a recorded sound is as good as it is going to get. However, the aspects of spatial quality have some way to go before an analogous plateau is reached. This discrepancy is due to the relatively recent arrival of multi-channel audio systems for home and vehicle use, providing the methods to reproduce sound in a way that seems engaging and aesthetically “natural.” Yet the vast majority of our musical recordings are stored with a two-channel stereo format that is recorded using two microphones.
There have been attempts at processing two-channel recordings so as to derive additional channels that contain reverberance information that can be played in an audio system including more than two loudspeakers. Such upmixing systems can be classified as spatial audio enhancers. Moreover, the goal of a commercial loudspeaker spatial audio system for music reproduction is to generally increase the enjoyment of the listening experience in a way that the listener can describe in terms of spatial aspects of the perceived sound. More generally, spatial audio enhancers take an audio recording, including one or more channels, and produce additional channels in order to enhance audio imagery. Examples, of previously developed spatial audio enhancers include the Dolby Pro Logic II™ system, the Maher “spatial enhancement” system, the Aarts/Irwan upmixer 2-to-5 channel upmixer, the Logic 7 2-to-7 upmixer and the Avendano/Jot upmixer.
At least one exemplary embodiment of the invention is related to a method of up-mixing a plurality of audio signals comprising: filtering one, a first one, of the plurality of audio signals with respect to a respective set of filtering coefficients generating a filtered first one; time-shifting a second, a second one, of the plurality of audio signals with respect to the filtered first one, generating a shifted second one; determining a respective first difference between the filtered first one and the shifted second one, wherein the respective first difference is an up-mixed audio signal; and adjusting the respective set of filtering coefficients based on the respective first difference so that the respective first difference is essentially orthogonal (i.e., about a zero correlation) to the first one.
In at least one exemplary embodiment each of the plurality of audio signals can include a source image component and a reverberance image component, where at least some of the respective source image components included in the plurality of audio signals are correlated with one another. In at least one further exemplary embodiment the plurality of audio signals includes a left front channel and a right rear channel and the respective first difference corresponds to a left rear channel including some portion of the respective reverberance image of the left front and right front channels.
At least one exemplary embodiment is directed to a method comprising: filtering the second one with respect to another respective set of filtering coefficients; time-shifting the first one with respect to the filtered second one, generating a shifted first one; determining a respective second difference between the filtered second one and the shifted first one; and, adjusting the another respective set of filtering coefficients based on the respective second difference so that the respective second difference is essentially orthogonal to the second one, and wherein the respective second difference corresponds to a right rear channel including some portion of the respective reverberance image of the left front and right front channels.
In at least one exemplary embodiment the first and second audio signals are adjacent audio channels.
In at least one exemplary embodiment the time-shifting includes one of delaying or advancing one audio signal with respect to another. In at least one exemplary embodiment a time-shift value is in the approximate range of 2 ms-10 ms.
In at least one exemplary embodiment the filtering of the first one includes equalizing the first one such that the respective difference is minimized. The respective set of filtering coefficients can also be adjusted according to one of the Least Means Squares (LMS) method or Normalized LMS (NLMS) method.
At least one exemplary embodiment is directed to a method comprising: determining a respective level of panning between a first and second audio signal; and, introducing cross-talk between the first and second audio signals if the level of panning is considered hard. For example, in at least one exemplary embodiment, the level of panning is considered hard if the first and second audio signals are essentially uncorrelated.
At least one exemplary embodiment is directed to a computer program including a computer usable program code configured to create at least one reverberance channel output from a plurality of audio signals, the computer usable program code including program instructions for: filtering the first one with respect to a respective set of filtering coefficients; time-shifting the second one with respect to the filtered first one; determining a respective first difference between the filtered first one and the time-shifted second one, where the respective first difference is a reverberance channel; and, adjusting the respective set of filtering coefficients based on the respective first difference so that the respective first difference is essentially orthogonal to the first one.
In at least one exemplary embodiment, the plurality of audio signals includes a left front channel and a right rear channel and the respective first difference corresponds to a left rear channel including some portion of the respective reverberance image of the left front and right front channels. In at least one exemplary embodiment, the computer usable program code also includes program instructions for: filtering the second one with respect to another respective set of filtering coefficients; time-shifting the first one with respect to the filtered second one of the plurality audio signals; determining a respective second difference between the filtered second one and the time-shifted first one; and, adjusting the another respective set of filtering coefficients based on the respective second difference so that the respective second difference is essentially orthogonal to the first one, and where the respective second difference corresponds to a right rear channel including some portion of the respective reverberance image of the left front and right front channels.
In at least one exemplary embodiment a device including the computer program also includes at least one port for receiving the plurality of audio signals.
In at least one exemplary embodiment a device including the computer program also includes a plurality of outputs for providing a respective plurality of output audio signals that includes some combination of the original plurality of audio signals and at least one reverberance channel signal.
In at least one exemplary embodiment a device including the computer program also includes a data storage device for storing a plurality of output audio signals that includes some combination of the original plurality of audio signals and at least one reverberance channel signal.
In at least one exemplary embodiment a device including the computer program also includes: a hard panning detector; a cross-talk inducer; and, where the computer usable program code also includes program instructions for employing the cross-talk inducer to inject cross-talk into some of the plurality of audio signals if hard panning is detected.
At least one exemplary embodiment is directed to creating an modified audio channel comprising: a plurality of audio channels including a first audio channel, a second audio channel and a third audio channel, wherein the third audio channel is a combination of the first and second audio channels produced by: filtering the first audio channel with respect to a respective set of filtering coefficients; time-shifting the second audio channel with respect to the filtered first audio channel; creating the third audio channel by determining a respective first difference between the filtered first audio channel and the time-shifted second audio channel, where the respective first difference is the third audio channel; and, adjusting the respective set of filtering coefficients based on the respective first difference so that the third audio channel is essentially orthogonal to the first audio channel.
Further areas of applicability of exemplary embodiments of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating exemplary embodiments of the invention, are intended for purposes of illustration only and are not intended to limited the scope of the invention.
Exemplary embodiments of present invention will become more fully understood from the detailed description and the accompanying drawings, wherein:
The following description of exemplary embodiment(s) is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Exemplary embodiments are directed to or can be operatively used on various wired or wireless audio devices. Additionally, exemplary embodiments can be used with digital and non-digital acoustic systems. Additionally various receivers and microphones can be used, for example MEMs transducers, diaphragm transducers, for examples Knowle's FG and EG series transducers.
Processes, techniques, apparatus, and materials as known by one of ordinary skill in the art may not be discussed in detail but are intended to be part of the enabling description where appropriate. For example the correlation of signals and the computer code to check correlation is intended to fall within the scope of at least one exemplary embodiment.
Notice that similar reference numerals and letters refer to similar items in the following figures, and thus once an item is defined in one figure, it may not be discussed or further defined in the following figures.
At least one exemplary embodiment is directed to a new spatial audio enhancing system including a novel Adaptive Sound Upmixing System (ASUS). In some specific exemplary embodiments the ASUS provided converts a two-channel recording into an audio signal including four channels that can be played over four different loudspeakers. In other specific exemplary embodiments the ASUS provided converts a two-channel recording into an audio signal including five channels that can be played over five different loudspeakers. In even other specific embodiments the ASUS provided converts a five-channel recording (such as those for DVD's) into an audio signal including eight channels that can be played over eight different loudspeakers. More generally, in view of this disclosure those skilled in the art will be able to adapt the ASUS to process and provide an arbitrary number of audio channels both at the input and the output.
In at least one exemplary embodiment, the ASUS is for sound reproduction, using multi-channel home theater or automotive loudspeaker systems, where the original recording has fewer channels than those available in the multi-channel system. Multi-channel systems typically have four or five loudspeakers. However, keeping in mind that two-channel recordings are created using two microphones, an underlying aspect of the invention is that the audio imagery created be consistent with that in a conventional two-loudspeaker sound scene created using the same recording. The general maxim governing the reproduction of a sound recording is that the mixing intentions of the sound engineer are to be respected. Accordingly, in some exemplary embodiments of the invention the aforementioned general maxim translates into meaning that the spatial imagery associated with the recorded musical instruments remains essentially the same in the upmixed sound scene. The enhancement is therefore in terms of the imagery that contributes to the listeners' sense of the recording space, which is known as reverberance imagery. In quantitative terms the reverberance imagery is generally considered the sound reflections impinging on a point that can be modeled as a stochastic ergodic function, such as random noise. Put another way, at least one exemplary embodiment is arranged so that in operation there is an attempt made to substantially separate and independently deliver to a listener all those reverberance components from a recording of a live musical performance that enable the listener to describe the perception of reverberance.
Features of at least one exemplary embodiment can be embodied in a number of forms. For example, various features can be embodied in a suitable combination of hardware, software and firmware. In particular, some exemplary embodiments include, without limitation, entirely hardware, entirely software, entirely firmware or some suitable combination of hardware, software and firmware. In at least one exemplary embodiment, features can be implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Additionally and/or alternatively, features can be embodied in the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
A computer-readable medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor and/or solid-state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include, without limitation, compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
In accordance with features of at least one exemplary embodiment, a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (i.e. I/O devices)—including but not limited to keyboards, displays, pointing devices, etc. —can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable communication between multiple data processing systems, remote printers, or storage devices through intervening private or public networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
Referring to
More specifically, in addition to the ASUS 13 the surround sound system 10 includes an audio source 11, respective left and right front speakers 21 and 23, respective left and right rear speakers 25 and 27, and respective left and right delay elements 22 and 24.
The left and right delay elements 22 and 24 are respectively connected between the audio source 11 and the left and right from speakers 21 and 23 so that the left (L) and right (R) audio channels are delivered to the left and right front speakers. The left (L) and right (R) audio channels are also coupled to the ASUS 13, which performs the upmixing function to produce left reverberance (LS) and right reverberance (RS) channels that are in turn delivered to the left and right rear speakers 25 and 27.
In operation, the ASUS 13 receives the left (L) and right (R) audio channels and produces the new left reverberance (LS) and right reverberance (RS) channels, which are not a part of the original two-channel recording. In turn, each of the as the speakers 21, 23, 25 and 27 is provided with a corresponding one of the respective audio channels [L, R, LS, RS] and auditory images are created. Specifically, a first auditory image corresponds to a source image 31 produced primarily by the left and right front speakers 21 and 23; a second auditory image corresponds to a first reverberance image 33 produced primarily by the left front and left rear speakers 21 and 25; and, a third auditory image corresponds to a second reverberance image 35 produced primarily by the right front and right rear speakers 23 and 27.
With continued reference to
For now, with reference to
According to the principles of pair-wise panning if the source components SL and SR are coherent (i.e. with a high absolute cross-correlation peak at a lag less than about 1 ms) then radiation of these signals with two loudspeakers either in front (as with a conventional 2/0 loudspeaker system) or to the side of the listener will create a phantom source image 31 between the speakers 21 and 23. The same applies to the radiation of the reverberance components; so if RS could be extracted from the right channel and radiated from the rear-right speaker 27, a listener would perceive the second reverberance image 35 on the right-hand side, as shown in
The two subjective design criteria regarding source and reverberance imagery are now translated into a method which can be undertaken empirically on the output signals of the ASUS 13:
1. Spatial distortion of the source image in the upmixed scene can be minimized.
To maximize the source image fidelity in the upmixed audio scene, source image components Ls and Rs should not be radiated from the rear loudspeakers in the upmixed sound scene. If they were, then they could perceptually interact with the source image components radiated from the front loudspeakers and cause the source image to be distorted. Therefore, all those sound components which contribute to the formation of a source image should be removed from the rear loudspeaker signals, yet those source image components radiated from the front loudspeakers should be maintained. A way of measuring this in electronic terms is to ensure that the signal RS is uncorrelated with signal L, and that LS is uncorrelated with R. For a signal sampled at time n, this is mathematically expressed in (4.1):
The lag range N should be equal to 10-20 ms (500-1000 samples for a 44.1 kHz sample-rate digital system), as it is the early sound after the direct-path sound which primarily contributes to spatial aspects of source imagery (such as source width) and the latter part to reverberance imagery. For lag times (k) greater than 20 ms or so, the two signals may be somewhat correlated at low frequencies—as explained later.
2. Reverberance imagery should have a homogenous distribution in the horizontal plane; in particular, reverberance image directional strength should be high from lateral (+90 degrees) directions.
The implication of this statement is that in order to create new reverberance images to the side of the listener, the side loudspeaker channels (e.g. R and RS) should have some degree of correlation. Under such circumstances, pair-wise amplitude panning could occur between the two loudspeakers; with the perceptual consequence that the reverberance image would be pulled away from the side loudspeakers and to a region between them. This is summarized in (4.2):
Again, N would be equal to 10-20 ms in many embodiments.
Regarding the degree of correlation between the two rear channels (i.e. the “extracted ambiance” signals), the optimal relationship is not as straightforward as with the above two electronic criteria. Although low-frequency interaural coherence is conducive for enveloping, close-sounding and wide auditory imagery, this does not necessarily mean the rear loudspeaker channels should be uncorrelated de facto. The correlation between two locations in a reverberant field is dependant on the distance between them and is frequency dependant. For instance, at 100 Hz the measuring points in a reverberant field must by approximately 1.7 m apart to have a coherence of zero (assuming the Schroeder frequency of the hall is less than 100 Hz). Microphone-pair recordings in concert halls therefore rarely have total decorrelation at low-frequencies. Furthermore, for sound reproduced with a loudspeaker pair in normal echoic rooms, due to loudspeaker cross-talk head diffraction and room reflections, the interaural coherence at low frequencies is close to unity regardless of the interchannel coherence of the loudspeaker signals.
Before describing a specific exemplary embodiment of the novel ASUS 13 provided in accordance with features of at lest one exemplary embodiment, it is useful to first look at the impulse response model of an example recording environment. Turning to
As noted above the ASUS 13 can be adapted for any number input channels (>2) In the description of the ASUS 13 herein, it is assumed that the two input signals are directly from the microphone pair M161 and M263; therefore the recording media can be eliminated from the discussion to the time being. These two signals from each microphone at sample time n are m1(n) and m2(n). As discussed in the electronic design criteria, the goal of the ASUS 13 is to remove those sound-image components in the two mike signals which are correlated (i.e. the source image components) leaving the reverberance-image components to be radiated from the rear speakers 25 and 27 shown in
With continued reference to
The IR is affected by the level of the excitation signal due to non-linearities in the mechanical, electronic or acoustic parts involved in the IR measurement (e.g. an IR measured using loudspeakers is affected in a non-linear way by the signal level). An impulse response can also apply to the time-domain output of an (digital) electronic system when excited with a signal shaped liked a Kronecker delta function. Therefore, to avoid confusion the term acoustic impulse response will be used to refer to any impulse response which involves the transmission of the excitation signal through air, as distinguished from a purely electronic IR.
As noted above, in a recording of a solo musical performance using two microphones M161 and M263, there are three acoustic impulse responses 51, 52 and 53: the intermicrophone impulse response IRm1-m2 53; and the two impulse responses between the sound source and the two microphones 51 and 52 (IRS-m1 and IRS-m2). All three IR's can change due to various factors, and. these factors can be distinguished as being related to either the sound source or to its surrounding environment:
Clearly, the first two factors which affect the acoustic IR's in the above list are source-related and the second two are environment related, with the source-related factors only affecting the source-mike IR. These factors will be investigated later with a real-time system, however, the algorithm for the ASUS will be described for time-invariant IR's and stationary source signals. The word stationary here means that the statistical properties of the microphone signals (such as mean and autocorrelation) are invariant over time i.e. they are both strictly stationary and wide sense stationary. Of course, when dealing with live musical instruments the signals at the microphones are non-stationary; it will be shown later how time-varying signals such as recorded music affect the performance of the algorithm. Finally, for the time-being any sound in the room which is caused by sources other than our single source S is ignored; that is, a noise-free (or at least, very low noise) acoustic and electronic environment is assumed. For the foregoing analysis in this section, these three major assumptions are summarized:
The time-domain acoustic transfer function between two locations in an enclosed space—in particular between a radiated acoustic signal and a microphone diaphragm—can be modeled as a two-part IR.
In this model the L-length acoustic IR is represented as two decaying time sequences: one of which is defined between sample times n=0 and n=Lr−1, the other between n=Lr and n=L. The first of these sequences represents the IR from the direct sound and early-reflections (ER's), and the other sequence represents the reverberation: accordingly called the “direct-path” and “reverberant-path” components of the IR. In acoustical terms, reflected sound can be thought of as consisting of two parts: early reflections (ER's) and reverberation (reverb). ER.s are defined as “those reflections which arrive at the car via a predictable, non-stochastic directional path, generally within 80 ms of the direct sound” whereas reverberation is generally considered to be sound reflections impinging on a point (e.g. microphone) which can be modeled as a stochastic process, with a Gaussian distribution and a mean of zero.
The source signals involved in the described filtering processes are also modeled as discrete-time stochastic processes. This means a random process whose time evolution can (only) be described using probabilistic laws; it is not possible to define exactly how the process will evolve once it has started, but it can be modeled according to a number of statistical criteria.
As discussed; it is the direct-component of the IR which affects source imagery, such as perceived source direction, width and distance, and the reverberant-component which affects reverberance imagery, such as envelopment and feeling for the size of the room. The time boundary between these two components is called the mixing time: “The mixing time defines how long it takes for there to be no memory of the initial state of the system. There is statistically equal energy in all regions of the space in the concert hall) after the mixing time [creating a diffuse sound field]”. The mixing time is approximated by (4.3):
Lr≈√{square root over (V)}(ms), (4.3)
where V is the volume of the room (in m3).
The mixing time can also be defined in terms of the local statistics of the impulse response. Individual, late-arriving sound reflections in a room impinging upon a point (say, a microphone capsule) will give a pressure which can be modeled as being statistically independent from each other; that is, they are independent identically distributed (IID). According to the central limit theorem, the summation of many IID signals gives a Gaussian distribution. The distribution can therefore be used as a basis for determining the mixing time.
After establishing the two-component acoustic IR model, the input signals m1(n) and m2(n) can be described by the acoustic convolution between the sound source s(n) and the Lr-length direct-path coefficients summed with the convolution of s(n) with the (L−Lr)-length reverberant-path coefficients. The convolution is undertaken acoustically but to simplify the mathematics we will consider that all signals are electronic as if there is a direct mapping of pressure to voltage, sampled at time (n). Furthermore, for simplicity the two microphone signals m1 and m2 are not referred to explicitly, instead each system is generalized using the subscripts i and j, where i or j=1 or 2 and i≠j. This convolution can therefore be written as:
A vector formulation of the convolution in (4.4) is now developed, as vector representations of discrete summations are visually more simple to understand and will be used throughout this chapter to describe the ASUS. In-keeping with convention, vectors will always be represented as bold text, contrasted with the italic text style used to represent discrete signal samples in the time-domain.
As mentioned, the direct-path IR coefficients are the first Lr samples of the L-length IR between the source and two microphones, and the reverberant path IR coefficients are the remaining (L−Lr) samples of these IR's. The time-varying source samples and time-invariant IR's are now defined as the vectors:
sd(n)=[s(n),s(n−1), . . . , s(n−Lr+1)]T
sr(n)=[s(n−Lr),s(n−Lr−1), . . . , s(n−L)]T
di=[di.o,di, . . . , di-1]T
ri=[ri.o,ri, . . . , ri/L-L-1]T.
And the acoustic convolutions between the radiated acoustic source and the early and reverberant-path IR's in (4.4) can now be written as:
mi(n)=sdT(n)di+srT(n)ri, (4.5)
For convenience, the early and reverberant path convolutions are replaced with:
sdi(n)=sdT(n)di
and
sri(n)=srT(n)ri, (4.6)
So (4.5) becomes:
mi(n)=sdi(n)+sri(n). (4.7)
With the following definitions for the last L samples of the early and reverberant path sound arriving at time n:
sdi(n)=[sdi(n),sdi(n−1), . . . , sdi(n−L+1)]T
sri(n)=[sri(n),sri(n−1), . . . , sri(n−L+1)]T,
the following assumptions about these early and reverberant path sounds are expressed using the statistical expectation operator E {.}:
The early part of both IR's (“direct-path”) are at least partially correlated:
E{diT(n)dj(n)}≠0,
E{sdiT(n)sdj(n)}≠0,
The late part of each IR (the “reverberant path”) are uncorrelated with each other:
E{riT(n)rj(n)}=0,
E{sriT(n)sdi(n)}=0,
The two reverberant path IR's are uncorrelated with both early parts:
E{riT(n)di(n)}=0,
E{sriT(n)sdi(n)}=0,
The reverberant path IR is decaying random noise with a normal distribution and a mean of zero:
E{ri(n)}=0,
E{sri(n)}=0,
One possible function of any sound reproduction system is to play-back a sound recording. In a convention two-channel sound reproduction system (i.e. commonly referred to as a stereo system) having two speakers the microphone signals m1(n) and m2(n) are played for the listener(s) using left (L) and right speakers (R). With reference to 2B, and with continued reference to
The left channel (L) is coupled in parallel to a delay element 77, an adaptive filter 71 and another delay element 73. Similarly, the right channel (R) is coupled in parallel to a delay element 78, an adaptive filter, and another delay element 74. The output of the delay element 77, being simply a delayed version of the left channel signal, is coupled to the front left speaker 21. Similarly, the output of the delay element 78, being simply a delayed version of the right channel signal, is coupled to the front right speaker 23.
In order to produce the reverberance channels for the left and right rear speakers 25 and 27, outputs of the adaptive filters are subtracted from delayed versions of signals from the corresponding adjacent front channel. Thus, in order to create the right reverberance channel RS the output of the adaptive filter 71, which produces a filtered version of the left channel signal, is subtracted from a delayed version of the right channel signal provided by the delay element 74 by way of the summer 75. Likewise, in order to create the left reverberance channel LS the output of the adaptive filter 72, which produces a filtered version of the right channel signal, is subtracted from a delayed version of the left channel signal provided by the delay element 73 by way of the summer 76.
The adaptive filters 71 and 72 are similar although not necessarily identical. To reiterate, in operation the ASUS 13, in some specific embodiments, operates in such a way that diagonally opposite speaker signals (e.g. L and RS) are uncorrelated. For example, referring to
Each input signal m1 and m2 is filtered by an M-sample length filter (w21 and w12, respectively). As mentioned, these filters model the early component of the impulse response between the two microphone signals, so ideally M=Lr. However, for the foregoing analysis there are no assumptions about “knowing” Lr a priori, so we will just call the time-domain filter size M. A delay is added to each input channel mi before the filtered signal yi is subtracted. This is to allow for non-minimum phase impulse responses which can occur if the sound source is closer to one microphone than the other. However, for the foregoing analysis we will not consider this delay as it makes the mathematical description more straight-forward (and it would make no difference to the theory if it was included).
The filtering of signal mj by the adaptive filter wij gives signal yi(n). This subscript notation may seem confusing, but helps describing the loudspeaker output signals because signal mi and ei are both phase-coherent (have a nonzero correlation) and are reproduced by loudspeakers on the same side (e.g. signals mi and ei are both reproduced with loudspeakers on the left-hand side). This filtering processing is shown in (4.11):
which with the following definitions:
mi(n)=[mi(n),mi(n−1), . . . , mi(n−M+1)]T
wij=[wojo,wij, . . . , wij,M-1]T
allow the linear convolution to be written in vector form as:
yi(n)=mjT(n)wij. (4.12)
If we look at filter w12 in
ei(n)=mi(n)−yi(n). (4.13)
The output signal is conventionally called an error signal as it can be interpreted as being a mismatch between yi and mi caused by the filter coefficients wij being “not-good enough” to model mi as a linear transformation of mj; these terms are used for the sake of convention and these two error signals are the output signals of the system which are reproduced with separate loudspeakers behind the listener.
If the filter coefficients wij can be adapted so as to approximate the early part of the inter-microphone impulse response, then the correlated sound component will be removed and the “left-over” signal will be the reverberant (or reverberance-image) component in the mj channel, plus a filtered version of the reverberant component in the mi channel. In this case, the error signal will be smaller than the original level of mj. The “goal” of the algorithm which changes the adaptive filter coefficients can therefore be interpreted as to minimize the level of the error signals. This level can simply be calculated as a power estimate of the output signal ei, which is an average of the squares of the individual samples, and it is for this reason that the algorithm is called the Least Mean Square (LMS) algorithm. This goal is formally expressed as a “performance index” or “cost” scaler J, where for a given filter vector wij:
Ji(wij)=E{ei2(n)}, (4.14)
and E {.} is the statistical expectation operator. The requirement for the algorithm is to determine the operating conditions for which J attains its minimum value; this state of the adaptive filter is called the “optimal state”.
When a filter is in the optimal state, the rate of change in the error signal level (i.e. J) with respect to the filter coeffcients w will be minimal. This rate of change (or gradient operator) is a M-length vector ∇, and applying it to the cost function J gives:
The right-hand-side of (4.15) is expanded using partial derivatives in terms of the error signal e(n) from (4.14):
and the general solution to this differential equation, for any filter state, can be obtained by first substituting (4.12) into (4.13):
and differentiating with respect to wij(n):
So (4.16) is solved as:
∇Ji(wij)=−2E{mj(n)ei(n)}. (4.19)
Updating the filter vector wij(n) from time n−1 to time n is done by multiplying the negative of the gradient operator by a constant scaler μ. The expectation operator in equation (4.19) is replaced with a vector multiplication and the filter update (or the steepest descent gradient algorithm) is:
wij(n)=wij(n−1)+μmj(n)ei(n), (4.20)
It should be noted that the adaptive filtering algorithm which is used (i.e. based on the LMS algorithm) is chosen because of its relative mathematical simplicity compared with others.
From the filter update equation (4.20) it can be seen that the adjustment from wij (n−1) to wij(n) is proportional to the filtered input vector mj(n). When the filter has converged to the optimal solution, the gradient ∇ in (4˜15) should be zero but the actual ∇ will be equal to μmj(n)ei(n). This product may be not equal to zero and results in gradient noise which is proportional to the level of mj(n). This undesirable consequence can be mitigated by normalizing the gradient estimation with another scaler which is inversely proportional to the power of mj(n), and the algorithm is therefore called the Normalized Least-Mean-Square (NLMS) algorithm. The tap-weight adaptation is then:
When the input signals m1(n) and m2(n) are very small, inverting the power estimate could become computationally problematic. Therefore a small constant δ is added to the power estimate in the denominator of the gradient estimate—a process called regularization. How the regularization parameter affects filter convergence properties is investigated empirically with a variety of input signals in the next chapter.
As mentioned, when the “optimal state” is attained the gradient operator is equal to zero, so under these conditions at sample time n, (4.19) becomes:
E{mj(n)ei(n)}=0M×1. (4.22)
This last statement represents the Principle of Orthogonality (PoO). The elegant relationship means that when the optimal filter state is attained, e1 (the rear-left loudspeaker signal) is uncorrelated with m2 (the front-right loudspeaker signal). This means that when the adaptive filter is in its optimal solution, diagonally opposite loudspeaker signals are uncorrelated: Quod Erat Demonstrandum.
Under such a condition, distortion of the source image is minimized because signal ei contains reverberance-image components which are unique to mi, and as the source image is only affected by correlated components within mi and mj (by definition; correlated components within an approximately 20 ms window), then a radiated signal which is uncorrelated with either mi or mj can not contain a sound component which affects source imagery. This is a very important idea behind the ASUS, and the degree to which the PoO operates was by measuring both the electronic correlation between signals mj and ei and also the subjective differences in auditory spatial imagery of the source image within a conventional 2/0 audio scene and an upmixed audio scene created with the ASUS.
For optimal state conditions, using (4.17) to rewrite (4.22) and then expanding gives:
These equations—called the normal equations because they are constructed using the equations supporting the corollary to the principle of orthogonality—can now be written in terms of the correlation between the input signals mj and mi, which is called the M-by-1 vector r:
·rmjmj=E{mj(n)mj(n)}
and the autocorrelation of each signal is the M-by-M matrix R:
·Rmjmj=E{mj(n)miT(n)}.
This allows (4.23) to be expressed as:
0M×1=rmjmj−Rmjmjwij. (4.24)
The filter in this state is called the Wiener solution and the normal equation
becomes:
wij=Rmjmj−1rmjmj (4.25)
For the sake of further clarity, the above description can be summarized using simplified flow-charts depicting only the broad and general steps of the operation of an ASUS in accordance with features of at least one exemplary embodiment. To that end
Referring first to
Turning to
In some exemplary embodiments the created reverberance channels are stored on a data storage medium such as a CD, DVD, flash memory, a computer hard-drive and the like. To that end,
The system 200 includes a user interface 203, a controller 201, and an ASUS 213. The system 200 is functionally connectable to an audio source 205 having a number (N) of audio channel signals and storage device 207 for storing the original audio channel signals N and the upmixed reverberance channel signals (M) (i.e. on which the N+M are recorded). In operation a user uses the user interface 203 to control the process of upmixing and recording using the controller 201 and the ASUS 213. Those skilled in the art will understand that a workable system includes a suitable combination of associated structural elements, mechanical systems, hardware, firmware and software that is employed to support the function and operation of the. Such items include, without limitation, wiring, sensors, regulators, mounting brackets, and electromechanical controllers. At least one exemplary embodiment is directed to a method including: determining the level of panning between first and second audio signals, where the level of panning is considered hard if the first and second audio signals are essentially uncorrelated; and adjusting the introduced cross-talk to improve upmixing quality. For example . . . is an example of an improved upmixing quality.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all modifications, equivalent structures and functions of the relevant exemplary embodiments. Thus, the description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the exemplary embodiments of the present invention. Such variations are not to be regarded as a departure from the spirit and scope of the present invention.
This application claims the benefit of U.S. provisional patent application No. 60/823,156 filed on 22 Aug. 2006. The disclosure of which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6246773 | Eastty | Jun 2001 | B1 |
6405163 | Laroche | Jun 2002 | B1 |
6421447 | Chu | Jul 2002 | B1 |
6496584 | Irwan et al. | Dec 2002 | B2 |
6614910 | Clemow et al. | Sep 2003 | B1 |
6882731 | Irwan et al. | Apr 2005 | B2 |
7107211 | Griesinger | Sep 2006 | B2 |
7356152 | Vernon et al. | Apr 2008 | B2 |
7522733 | Kraemer et al. | Apr 2009 | B2 |
7650000 | Kawana et al. | Jan 2010 | B2 |
7764805 | Tomita et al. | Jul 2010 | B2 |
7920711 | Takashima et al. | Apr 2011 | B2 |
20050008170 | Pfaffinger et al. | Jan 2005 | A1 |
20060093164 | Reams et al. | May 2006 | A1 |
20070041592 | Avendano et al. | Feb 2007 | A1 |
Number | Date | Country |
---|---|---|
1507441 | Feb 2005 | EP |
06217400 | Aug 1994 | JP |
6335093 | Dec 1994 | JP |
10056699 | Feb 1998 | JP |
2004266692 | Sep 2004 | JP |
2006217210 | Aug 2006 | JP |
Number | Date | Country | |
---|---|---|---|
20080137887 A1 | Jun 2008 | US |
Number | Date | Country | |
---|---|---|---|
60823156 | Aug 2006 | US |