The present invention relates to source separation, signal enhancement, and signal processing. The present invention further relates to a method for extracting target mid and side audio signals from a stereo audio signal. The present invention further relates to a processing device employing the aforementioned method.
In the field of audio processing, audio source separation is an important in several applications. In one simple application, a separated speech audio signal is kept or provided with additional gain whereas the background audio signal is omitted or attenuated relative to the speech audio signal. This can enhance the intelligibility of the speech audio signal.
In audio source separation, the signal of interest which is targeted for estimation or extraction is termed the target signal. The target signal is not limited to speech and may be any general audio signal (such as a musical instrument) or even a plurality of audio signals which are to be separated from an audio mix comprising noise and/or additional audio signals which are not of interest.
For stereo audio signals which comprise two audio signals, associated with a left and right audio channel respectively, one method of processing, which also implicitly performs some source separation, is to extract mid and side audio signals from the left and right audio signals, wherein the mid and side audio signals are proportional to the sum and difference of the left and right audio signals respectively. The mid signal will emphasize audio components that are equal in magnitude and in-phase between the left and right audio signals, whereas the side audio signal will attenuate or eliminate such signals. Accordingly, calculation of mid and side signals from the left and right audio signals constitutes an efficient method for respectively boosting or attenuating in-phase center-panned sources. Mid and side signals can further be converted back to left and right (conventional stereo) audio signals.
To the extent that a center-panned source is of interest, it is noted that source separation has been implicitly performed: the mid signal will contain this source while the side signal will not. The side audio signal will comprise mainly potentially uninteresting background audio signals which could be attenuated or omitted to enhance the intelligibility of the center panned in-phase audio source.
A drawback of conventional mid and side signal calculation is that when the desired audio source is not center-panned, the implicit source separation based on mid/side audio signal separation fails. To address this, specific mid and side signal calculation techniques have been developed for stereo audio signals with specific panning properties, techniques which are capable of successfully targeting stationary non-center panned audio sources. See e.g. Stereo Music Source Separation via Bayesian Modeling, by Master, Aaron, Ph.D. dissertation, Stanford University, 2006.
While these techniques alleviate some of the shortcomings of basic mid and side signal extraction, such panning-specific extraction techniques are still incapable of separating more general audio sources present in a stereo audio signal, such as moving audio sources, sources with reverberation, or sources which are spatially dominant at various points in time-frequency space.
It is therefore a purpose of this disclosure to provide such an improved method and an audio processing system for performing enhanced audio source separation for stereo audio signals.
According to a first aspect of the invention there is provided a method for extracting a target mid audio signal from a stereo audio signal, the stereo audio signal comprising a left audio signal and a right audio signal. The method comprises the steps of obtaining a plurality of consecutive time segments of the stereo audio signal, wherein each time segment comprises a representation of a portion of the stereo audio signal, and obtaining, for each frequency band of a plurality of frequency bands of each time segment of the stereo audio signal, at least one of a target panning parameter and a target phase difference parameter. The target panning parameter represents a distribution over the time segment of a magnitude ratio between the left and right audio signals in the frequency band and the target phase difference parameter represents a distribution over the time segment of the phase difference between the left and right audio signals of the stereo audio signal.
The method further comprises extracting, for each time segment and each frequency band, a partial mid signal representation, wherein the partial mid signal representation is based on a weighted sum of the left and right audio signal, wherein a weight of each of the left and right audio signals is based on at least one of the target panning parameter (Θ) and the target phase difference parameter of each frequency band and time segment, and forming the target mid audio signal by combining the partial mid signal representations for each frequency band and time segment.
Obtaining at least one of the target panning parameter and target phase difference parameter may comprise receiving, accessing or determining the target panning parameter and/or target phase difference parameter. At least one of the target panning parameter and/or target phase difference parameter may be replaced with a default value for at least one time segment and frequency band.
With consecutive time segments it is meant segments that describe a time portion of the audio signal wherein later time segments describe later time portions of the audio signal and earlier time segments describe earlier time portions of the audio signal. The consecutive time segments may be overlapping in time or non-overlapping in time.
The representation of a portion of the stereo audio signal may be any time domain representation or any frequency domain representation. The frequency domain representation may be any linear time-frequency domain such as a Short-Time Fourier Transform, STFT, representation or a Quadrature Mirror Filter, QMF, representation.
The invention is at least partially based on the understanding that by obtaining a target panning parameter and/or target phase difference parameter for each time segment and frequency band, a target mid audio signal may be extracted which at all times targets a time and/or frequency variant audio source in the stereo audio signal. Moreover, the method allows extraction of a target mid audio signal targeting two or more audio sources simultaneously which in the stereo audio signal are separated in frequency, as the target mid audio signal is created using individual target panning parameters for each time segment and frequency band.
In some implementations, the weight of the left and right audio signals is based on the target panning parameter such that the left or right audio signal with a greater magnitude is provided with a greater weight.
That is, the left or right audio signal which is associated with a greater magnitude or power (which is indicated by the target panning parameter) will be provided with a greater weight and contribute more to the formation of the target mid audio signal.
In some implementations, the method further comprises extracting, for each time segment and frequency band, a partial side signal representation, wherein the partial side signal representation is based on a weighted difference between the left and right audio signals, wherein a weight of each of the left and right audio signals is based on at least one of the target panning parameter and the target phase difference parameter of each frequency band and time segment and forming a target side audio signal by combining each partial side signal representation for each frequency band and time segment.
In other words, a target side audio signal may be formed in parallel with the target mid audio signal and the target mid and side audio signals form a complete representation of the stereo audio signal. It is understood that any implementation relating to the formation or processing of the target mid audio signal may be performed analogously with the forming and processing of the target side audio signal as will be described in the below.
If the stereo audio signal comprises e.g. an ideal center panned audio source, or a single audio source panned under the constant power law, the target mid audio signal will ideally capture all signal energy of the audio source and the target side audio signal will capture no such energy. For a general audio source(s) however, the target mid audio signal will capture the target audio source(s) and, likely, some other non-target sounds while the target side audio signal will capture non-target sounds while likely eliminating, or nearly eliminating, the target source(s). Inclusion of a target side audio signal in addition to the target mid audio signal ensures that all audio signals of the left and right audio signal pair are present in the target mid and side audio signal pair. For instance, this allows for lossless reconstruction of the left and right audio signals.
Any functions described in relation to a method may have corresponding features in a system or device and vice versa.
The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention.
Systems and methods disclosed in the present application may be implemented as software, firmware, hardware, or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.
The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.
Certain or all components may be implemented by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system (i.e. a computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.
The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
A detailed description of currently preferred embodiments is included in the below together with an introductory section of extraction of basic mid and side audio signals.
Stereo audio signals comprising a left L and a right R audio signal are alternatively represented as a mid audio signal M and a side audio signal S. Basic mid and side audio signals M, S are constructed from the left and right audio signals L, R using the analysis equations
and the original left and right audio signals L, R may be reconstructed from the mid and side audio signals M, S using the following synthesis equations
The mid audio signal M boosts audio signal features that are in-phase and centered in the stereo mix whereas the side audio signal S attenuates in-phase audio signal features. For instance, if the stereo audio signal contains a center-panned left and right audio signal pair L, R (equal magnitude, in-phase components in each of the left and right audio signals), the mid audio signal M will contain the stereo audio signal while the side audio signal S will eliminate it; a basic mid/side audio signal targets a center-panned audio signal for inclusion in the mid audio signal M and for exclusion in the side audio signal S.
The left, right, mid and side audio signals L, R, M, S may be represented in the time domain or frequency domain. The frequency domain representation may for example be a Short-Time Fourier Transform, STFT, representation or a Quadrature Modulated Filter, QMF representation. For example, the left and right audio signals L, R can be represented in the frequency domain as
wherein |L| and |R| denotes the signal magnitude for the left and right audio signals L, R respectively, and wherein ∠L and ∠R denotes the phase of the left and right audio signals L, R respectively. In general |L|, |R|, ∠L and ∠R are all functions of frequency, w, and time, t, however the time and frequency dependence ω, t will be excluded from all equations for compactness. There are, therefore, many values represented by each of the symbols |L|, |R|, ∠L and ∠R. Each such value corresponds to the audio content of the left or right signals, and does not necessarily characterize a single target source. The values may generally be described as “detected” because they are considered to be detected from a stereo input signal in many contexts.
The left and right audio signals L, R may alternatively be represented with a combination of a detected magnitude parameter U and detected panning parameter θ as
meaning that the left and right audio signals L, R from equations 5 and 6 may be represented in terms of U, θ instead of in terms of |L|, |R|. For instance, the left and right audio signals L, R may be expressed as
The detected magnitude parameter U and detected panning parameter θ forms a representation of the stereo audio signal which is referred to as a Stereo-Polar Magnitude (SPM) representation. While the SPM representation may be replaced with any equivalent representation, the SPM representation of the stereo audio signal will be utilized in the following description.
The detected panning parameter θ ranges from 0 to
wherein 0 indicates a stereo audio signal in which only the left audio signal L is non-zero and
indicates a stereo audio signal in which only the right audio signal R is non-zero. A detected panning parameter value of
indicates a centered stereo audio signal with equal magnitudes for the left and right audio signals L, R. While the detected panning parameter is derived from a stereo input signal, it also agrees mathematically with a model for source mixing. For an audio source Sx with magnitude |Sx|, phase ψx, and a known panning parameter Θx, the left and right audio signals L, R may be expressed as
It can be shown that the detected panning parameter θ for time-frequency points in a signal created with equation 11 and 12 will equal Θx of the audio source Sx, as:
Mid and Side Audio Signals with Fixed Panning
The basic mid and side audio signals M, S are extracted by weighting each of the left and right audio signals L, R equally to target an audio source assumed to exist in equal proportions in the left and right audio signals L, R (a centered audio source). For non-centered audio sources, the weighting of the left and right audio signals L, R in equations 1 and 2 should be adjusted.
For instance, the coefficients of the left and right audio signals L, R in equation 1 and 2 may be changed from 0.5 under the constraint that the sum of the coefficients should equal 1. This enables the coefficients for the left and right audio signals L, R in equation 1 to be e.g. 0.7 and 0.3 (and equivalently for the side signal construction in equation 2) for an audio source which appears mainly in the left audio signal. However, this leads to an undesirable situation wherein the magnitude of the mid and side audio signals M, S varies considerably when the audio source moves between the left and right audio signals L, R (assuming constant power law mixing).
To circumvent this issue the mid and side audio signals M, S may be extracted from the left and right audio signals L, R in accordance with
for an audio source Sx with a known or estimated “target” panning parameter Θx wherein the left and right audio signals L, R can be reconstructed as
For example, if Θx=0, indicating an audio source appearing only in the left audio signal, the mid audio signal M becomes equal to the left audio signal L and the side audio signal S is equal to the right audio signal R, and vice versa for a
Similarly, a mid audio signal M for an audio source appearing with equal magnitudes in the left and right audio signals L, R
weights the left and right audio signals L, R equally. Furthermore, an audio source located center to left and center to right will result in a weighting of the left and right audio signals L, R with Θx in the range of 0 to
and
to
respectively. The coefficients used to weight the left and right audio signals L, R to construct the mid and side audio signals M, S and vice versa are thus based on a fixed target panning parameter θx,
An aspect of the present invention relates to creation of a target mid and/or side audio signal M, S based on at least one of a target panning parameter Θ and a target phase difference parameter Φ obtained for each time segment and frequency band of a stereo audio signal. The target panning parameter Θ represents an estimate or indication of the magnitude ratio between the left and right audio signals L, R in each time segment and frequency band, for sounds corresponding to the target source or sources. The target phase difference parameter Φ represents an estimate or indication of the phase difference between the left and right audio signals L, R in each time segment and frequency band, for sounds corresponding to the target source or sources.
In some implementations, the panning parameter Θ and/or phase difference parameter Φ of each time segment and frequency band 111, 112 is the median, mean, mode, numbered percentile, maximum or minimum panning parameter Θ and/or phase difference parameter Φ of the time segment. In general, the detected panning parameter θ and/or detected phase difference parameter ϕ exists for each sample of the stereo audio signal. However, the target panning parameter Θ and/or phase difference parameter Φ used to create the target mid and side audio signals M, S are not necessarily of such fine granularity and may represent a statistical feature (e.g. the average) of multiple samples, such as an average of all samples within a predetermined time segment. For example, whereas the detected panning parameter θ and/or detected phase difference parameter ϕ may assume different values more than one thousand times per second the target panning parameter Θ and/or phase difference parameter Φ only change a few times per second.
The extraction unit 12 obtains the left and right audio signals L, R (stereo audio signal) as a plurality of consecutive time segments, each time segment comprising a representation of a portion of the left and right audio signals R, L. Alternatively, the extraction unit 12 is configured to divide the left and right audio signals R, L into the plurality of consecutive time segments. Each time segment comprises two or more frequency bands of the stereo audio signal wherein each frequency band in each time segment is associated with at least one of an obtained target panning parameter θ and an obtained target phase difference parameter Φ.
In the embodiment shown in
With the left and right audio signals L, R and a target panning parameter Θ and/or target phase difference parameter Φ the extraction unit 12 extracts the target mid audio signal M and the target side audio signal S. In some implementations, the extraction unit 12 extracts only one of the target mid M and side S audio signal, such as only the mid audio signal M.
To allow modeling of the interchannel phase difference between the left and right audio signals L, R the genuine source phase of a target audio source is defined as being proportionally closer to the left or right audio signal which exhibits the greater magnitude. That is, for a stereo audio signal which is dominated in power or magnitude by the left audio signal L the phase of left audio signal L will be closer to the genuine source phase of the audio source compared to the phase of the right audio signal R, and vice versa for a stereo audio signal dominated by the right audio signal R. An audio source Sx is modeled to appear in the left and right audio signals L, R according to the following mixing equations
wherein ψx is the phase of the audio source Sx and Φx is the phase difference parameter for the target source.
Based on the mixing model described in equations 17 and 18 above, the extraction device 12 may employ the following analysis equations to target a source with target panning parameter Θ and target phase difference parameter Φ.
It is understood that Θ and Φ can vary with time and frequency, corresponding with the characteristics of a targeted source. The mid and side signals of equations 20 and 21 may be termed “target mid and side signals.” Therefore M and S may be understood as signals comprised of separate signal components corresponding to the various values of Θ and Φ for each time segment and frequency band 111, 112. The weighting of the phase difference parameter Φ for the left and right audio signals L, R with sin2 Θ and cos2 Θ is merely exemplary and other weighting functions based on O are possible. For instance, sin2 Θx may be replaced with
and cos2 Θ may be replaced with
Similarly, the model of equations 18 and 19 which uses weighting of the phase difference parameter Φ for the left and right audio signals L, R with sin2 Θx and cos2 Θx is also merely exemplary.
In equation 20 the target mid audio signal M is based on a weighted sum of the left and right audio signals L, R wherein the weight of the left audio signal is cos(Θ)ei(−Φ sin
Similarly, in equation 20 the target side audio signal S is based on weighted difference between the left and right audio signals L, R wherein the weight of the left audio signal is sin(Θ)ei(−Φ sin
The weights of the left and right audio signals L, R in the weighted sum and weighted difference of equations 20 and 21 used to extract the mid and side audio signals M, S are complex valued and comprises a real valued magnitude factor, e.g. cos(Θ), and an complex valued phase factor, e.g. ei(−Φ sin
For instance, the real valued magnitude factor may be based on Θ such that the left or right audio signal with a greater magnitude is provided with a greater weight in the weighted sum. Moreover, the complex valued factor may be based on Φ such that a greater phase difference Φ means that there is a greater difference between the respective phases of the weights in the weighted sum. In some implementations, the complex valued factor of each weight in the weighted sum is based on both Φ and Θ, e.g. one of the weights is ei(−Φ sin
Analogously, the real valued magnitude factor in the weighted difference (for the target side signal extraction) may be based on the target panning parameter Θ such that the left or right audio signal with a greater magnitude is provided with a smaller weight in the weighted difference. Moreover, the complex valued factor may be based on Φ such that a greater phase difference Φ means that there is a greater difference between the respective phases of the weights in the weighted difference. In some implementations, the complex valued factor of each weight in the weighted difference is based on both Φ and Θ, e.g. one of the weights is sin(Θ)ei(−Φ sin
In some implementations, only one of the target panning and phase difference parameter Θ, Φ is obtained for at least one time segment and frequency band 111, 112 wherein the other one is assigned a default value. For instance, if only the target panning parameter Θ is obtained the target phase difference parameter Φ is assumed to be Φ=0 and if only the target phase difference parameter Φ is obtained the target panning parameter Θ is assumed to be
These default values assume an audio source centered in the stereo audio signal with no interchannel phase difference, however other default values are possible.
The extraction device 12 obtains the target mid audio signal M by combining partial mid signal components or representations M(t, n) of each frequency band in each time segment. Similarly, extraction device 12 obtains the target side audio signal S by combining partial side signal representations S(t, n) of each frequency band in each time segment. A partial mid signal component or representation M(t, n) is created for each frequency band 1 . . . . B of a time segment t and by combining all the partial mid signal representations M(t, n) for time segment, t, a target mid audio signal time segment M(t) is created wherein a sequence of target mid audio signal time segments M(t) forms the target mid audio signal M. The target side audio signal S is formed in an analogous manner. Combining the partial mid signal representations M(t, n) for frequency bands 1 . . . . B may comprise processing the partial target mid audio signals M(t, n) with a synthesis filterbank.
With further reference to the flow chart in
At step S3 a partial mid signal representation M(t, n), is extracted by the mid/side extractor 12 (e.g. in accordance with equation 19 in the above) wherein the partial mid signal representation is based on a weighted sum of the left and right audio signals L, R. The weight of each of the left and right audio signals L, R is based on at least one of the target panning parameter Θ and the target phase difference parameter Φ of each frequency band 111, 112 and time segment. Analogously, a partial side signal representation S(t, n) is extracted at step S31 (e.g. in accordance with equation 20 in the above) wherein the partial side signal representation is based on a weighted difference of the left and right audio signals L, R.
In some implementations, the target side audio signal S is not computed and the extractor unit 12 is e.g. an extractor unit 12 configured to extract exclusively the target mid audio signal M. Alternatively, the target side audio signal S is computed but scaled down at step S51 by a small value, or zero, prior to reconstruction of the left and right audio signals L, R at step S6 (e.g. using the reconstruction equations 21 and 22 in the below). As the target mid audio signal M in some cases is expected to capture substantially all energy of the target audio source the target side audio signal S may comprise mainly undesirable noise or background audio which can be omitted or attenuated.
Following step S3 and step S31 the method goes to step S4 and S41 comprising forming target mid and side audio signals M, S by combining the partial mid and side signal representations M(t, n), S(t, n) into target mid and side audio signals M, S respectively. The method continues with step S5 and S51 comprising performing processing of the target mid and/or side audio signal M, S and examples of processing involves attenuating the target side audio signal S or providing the target mid audio signal M to a mono source separator as will be described in connection to
Lastly, the method goes to step S6 comprising reconstructing the left and right audio signals L, R from the target mid and side audio signals M, S. The reconstruction of the left and right audio signals L, R from the target mid and side audio signals M, S is performed by a synthesis arrangement as will be described more in detail in connection to
As an example of the operation of the analysis arrangement 10, an audio source of the left and right audio signals L, R with time varying panning is considered. The audio source is panned, under the constant power law, at a constant rate from fully right to fully left as time t advances from 0 to 10 seconds. For basic mid and side audio signals extracted for this audio source, e.g. in accordance with equation 1 and 2 in the above, the mid audio signal M will capture substantially all energy of the audio source at t=5 and substantially no energy at t=0 and t=10 whereas the side audio signal S will capture substantially all the energy of the audio source at t=0 and t=10 and substantially no energy at t=5. The target mid audio signal M, extracted with varying panning parameters and/or phase difference parameters Θ, Φ for each time segment and frequency band will contain substantially all of the audio source energy for any t∈[0, 10] whereas the target side audio signal S will contain substantially no energy of the of the time varying audio source.
The constant power law ensures that the audio signal power (which is proportional to the sum of the square of the amplitudes) of the left and right audio signals L, R remains constant for all panning angles. For instance, the audio signal amplitudes of the left and right audio signal are scaled with a scaling factor which depends on the panning angle to ensure that the square of the sum of the left and right audio signal is equal for all panning angles. Different from the linear panning law, which ensures that the sum of the signal amplitudes is constant which allows the total signal power to vary with the panning angle, the constant power panning law will ensure that the perceived audio signal power is constant for all panning angles. The constant power law is e.g. described in “Loudness Concepts & Pan Laws” by Anders Øland and Roger Dannenberg, Introduction to Computer Music, Carnegie Mellon University.
This demonstrates how the target mid and side audio signals M, S is capable of targeting and separating audio source(s) that are varying in time and frequency in a stereo audio signal. Additionally, if a second audio source of a separate frequency is moved between left and right audio signals L, R at a rate different from the first audio source (or is stationary e.g. at a center panning), the target mid audio signal M targets the first and second audio source in individual frequency bands meaning that substantially all energy from both audio sources will be present in the target mid audio signal M despite the audio sources being of different frequency while also being shifted between the left and right audio signals L, R at different rates.
In
The target phase difference parameter Φ may be calculated, e.g. by the parameter extractor 14, as the typical or dominant phase difference between the left and right audio signals L, R in each time segment and frequency band 111, 112. For instance, in the time-frequency domain (e.g. in the STFT domain), detected phase differences may be calculated for each STFT tile as ϕ=Arg(R/L), and by analyzing the distribution on ϕ, a value of Φ may be estimated to be a dominant or typical value for the time segment and frequency band. Similarly, detected panning parameters θ may be calculated as θ=arctan (|R|/|Z|) for each STFT tile, and by analyzing the distribution thereupon, a values of Θ may be estimated to be a dominant or typical value for the time segment and frequency band, wherein L and R represent the left and right audio signals L, R in the time segment and frequency band 111, 112. Typical or dominant values estimated from a distribution may be the median, mean, mode, numbered percentile, minimum or maximum values.
In “Dialog Enhancement via Spatio-Level Filtering and Classification” (Convention Paper 10427 of the 149th Convention of the Audio Engineering Society by Master, A et al.) a method for obtaining indicators of the average value and spread of the panning and interchannel phase difference labelled thetaMid, thetaWidth, phiMid and phiWidth is proposed. The panning parameter Θ used by the analysis arrangement 10 could be based on the thetaMiddle proposed in this convention paper and, similarly, the target phase difference parameter Φ used by the analysis arrangement 10 could be based on phiMiddle of the so called “Shift and Squeeze” or “S&S” parameters proposed in this paper.
For each time segment and frequency band within it, the audio processing system may create a 51-bin histogram on θ weighted by the magnitude U squared, i.e. U2. The system does the same for the detected panning parameter ϕ, as well as a version of ϕ ranging from 0 to 2*pi, which is called ϕ2. However, these histograms each use 102 bins. The histograms are each smoothed over their given dimensions and vs time segments. For the smoothed θ histograms, the system detects the target panning parameter as the highest peak, which is called thetaMiddle, and also the width around this peak necessary to capture 40% of energy in the histogram, which is called theta Width. It does the same for ϕ and ϕ2, recording phiMiddle, phi2Middle, phiWidth and phi2Width, but requiring 80% energy capture for widths. The system records final values for phiMiddle and phiWidth based on which had a higher concentration in phi space as indicated by a smaller phiWidth value
For instance, the parameter extractor 14 in
With reference to
As indicated in
The duration of each time segment may correspond to 10 ms to 1 s of the stereo audio signal. In some implementations each time segment corresponds to a plurality of frames of the stereo audio signal, such as 10 frames, wherein each frame e.g. represents 10 to 200 ms of the stereo audio signal, such as 50 to 100 ms of the stereo audio signal. Experiments have shown that using 10 overlapping frames to represent one time segment, wherein each frame is 50 to 100 ms long with 75% overlap, is a good trade-off between rapid response, parameter stability, parameter reliability and computational expense for typical dialog sources. However, time segments of longer or shorter duration are also considered, specifically for audio sources other than dialog, such as music.
With further reference to
Reconstruction of Left and Right Audio Signals from Target Mid and Side Audio Signals
for the target mid and side audio signals M, S extracted with equations 20 and 21. The target panning parameter Θ and/or target phase difference parameter Φ obtained by the synthesis arrangement is here equal to the target panning parameter Θ and/or phase difference parameter Φ determined or obtained by the analysis arrangement. Alternatively, the target panning parameter Θ and/or phase difference parameter Φ used by the synthesis arrangement are alternative parameters ΘALT, ΦALT and/or at least one of Θ and Φ is unspecified and replaced with a default value for at least one time segment and frequency band.
As seen in equation 22 and 23, the reconstruction of the left and right audio signals L, R is based on a weighted sum and weighted difference of the target mid and side audio signals M, S. However, these equations may be modified to downweight or eliminate the contribution from the side signal. For instance, the left and right audio signals L, R may be reconstructed from only the target mid audio signal M with a mid weighting factor of cos(÷)ei(−Φ sin
Similarly, the left and right audio signals L, R may be reconstructed by taking the target side audio signal S into account. In such cases the left audio signal L is based on a weighted sum of the target mid audio signal M and target side audio signal S wherein the weights, cos(Θ)ei(−Φ sin
and ΦALT=ΦCEN=0 for all time segments and frequency bands to create a center-panned stereo audio signal comprising a centered left and right audio signals LCEN, RCEN with no inter-channel phase difference.
The alternative mid and side weighing factors may be defined as the mid and side weighting factors described in equation 22 and 23, with Θ, Φ replaced with the alternative counterparts ΘALT, ΦALT obtained for each time segment and frequency band. In some implementations, the alternative left and right audio signals LALT, RALT are created using only the target mid audio signal M and the alternative mid weighting factors (being e.g. the coefficients of the target mid audio signal M in equation 22 and 23 with ΦALT, ΦALT replacing Θ, Φ). Alternatively, the alternative left and right audio signals LALT, RALT are created using both the target mid and target side audio signal M, S and the alternative mid and side weighting factors (being e.g. the coefficients of the target mid audio signal M and target side audio signal S in equation 22 and 23 respectively, with ΘALT, ΦALT replacing Θ, Φ).
In the following,
The target mid audio signal M is provided to a mono source separation system 30 configured to separate an audio source in the target mid audio signal M and produce a source separated mid audio signal Msep. For instance, the mono source separation system 30 produces a source separated mid audio signal Msep with enhanced intelligibility of at least one audio source present in the target mid audio signal M.
The source separated mid audio signal Msep is provided together with the target side audio signal S to a mid/side processor 40 to produce processed mid and side audio signals M′, S′. The mid/side processor 40 may apply a gain to the source separated mid audio signal Msep and/or attenuate the target side signal S. In some cases, the mid/side processor 40 sets the target side audio signal S to zero. The processed mid and side audio signals M′, S′ are then provided to a synthesis arrangement 20 comprising a left and right audio signals reconstruction unit 22 which reconstructs a processed left and right audio signals L′, R′ based on the processed mid and side audio signals M′, S′ and the panning and/or phase difference parameters Θ, Φ obtained by the analysis arrangement 10.
In some implementations, the source separated mid audio signal Msep is provided directly to the synthesis arrangement 20 which reconstructs a processed left and right audio signals L′, R′ based on at least the source separated mid audio signal Msep (and optionally the target side audio signal S) and the target panning and/or phase difference parameters Θ, Φ.
Alternatively, the mono source separation system 30 is omitted and the target mid audio signal M and target side audio signal S is provided directly to the mid and side audio processing unit 40 which extracts processed mid and side audio signals M′, S′ from the target mid and side audio signals M, S.
The analysis arrangement 10a obtains an arbitrary left and right audio signals L, R and the target panning and/or phase difference parameters Θ, Φ for each time segment and frequency band and extracts a target mid audio signal M and a target side audio signal S. The target mid and side audio signal M, S is provided to the synthesis arrangement 20a which also obtains a set of alternative centered panning and/or phase difference parameters ΘCEN, ΦCEN indicating center panned stereo audio signal. Accordingly, the analysis arrangement 10a and synthesis arrangement 20a creates a center panned stereo signal comprising a centered left and right audio signals LCEN, RCEN from an arbitrary original stereo signal.
The center panned alternative left and right audio signals LCEN, RCEN are provided to a stereo processing system 50 configured to perform stereo source separation on center panned stereo sources to output processed centered left and right audio signals L′CEN, R′CEN. The processed centered left and right audio signals L′CEN, R′CEN features e.g. enhanced intelligibility of at least one audio source present in the centered left and right audio signals LCEN, RCEN.
The processed centered left and right audio signals L′CEN, R′CEN are then provided to a second analysis arrangement 10b which extracts a processed mid and side audio signals M′, S′ using the processed centered left and right audio signals L′CEN, R′CEN and the centered panning and/or phase difference parameters ΘCEN, ΦCEN. Lastly, a second synthesis arrangement 20b utilizes the original panning and/or phase difference parameters Θ, Φ obtained or determined by the first analysis arrangement 10a to reconstruct processed left and right audio signals L′, R′.
It is noted that the centered panning and/or phase difference parameters ΘCEN, ΦCEN indicating center panning with no interchannel phase difference is one alternative of among many possible values of alternative panning and/or phase difference parameters ΘALT, ΦALT. For instance, the centered alternative parameters ΘCEN, ΦCEN may be replaced with arbitrary alternative parameters ΘALT, ΦALT which e.g. indicates a strictly right or left stereo audio signal with or without a non-zero phase difference.
It is further noted that the target mid and side audio signals M, S fed between the first analysis arrangement 10a and first synthesis arrangement 20a may be subject to mono signal separation processing and/or mid and side processing as described in connection to
In some implementations, the alternative (e.g. centered) left and right audio signals LCEN, RCEN are created by the first synthesis arrangement 20a using only the target mid audio signal M. Analogously, the second analysis arrangement 10b may be configured to extract only the processed target mid audio signal M′.
In the implementation shown in
The stereo softmask estimator 60 is configured to determine or obtain the panning and/or phase difference parameter Θ, Φ and provide these parameters to the analysis arrangement 10. Alternatively, the panning and/or phase difference parameter Θ, Φ are obtained from elsewhere by the analysis arrangement 10 or determined by the analysis arrangement 10 based on the left and right audio signals L, R.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.
It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. The person skilled in the art realizes that the present invention by no means is limited to the preferred embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, the stereo processing system in
Various features and aspects will be appreciated from the following enumerated exemplary embodiments (“EEEs”):
Number | Date | Country | Kind |
---|---|---|---|
22183794.1 | Aug 2022 | EP | regional |
This application claims priority of the following priority application: U.S. provisional application 63/318,226, filed 9 Mar. 2022, U.S. provisional application 63/423,786, filed 8 Nov. 2022, and European Patent application no. 22183794.1, filed 8 Jul. 2022.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/063717 | 3/3/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63318226 | Mar 2022 | US | |
63423786 | Nov 2022 | US |