1. Field of the Invention
The present invention relates to the field of audio and signal processing, and, more particularly, to eliminating an unwanted signal from a mixture of a desired signal and an unwanted signal.
2. Description of the Related Art
A voice sample can be a mixture of a desired signal and an unwanted signal. For example, the desired signal may be a voice, and the unwanted signal may be background music. If the background music is of a sufficient auditory level in relation to the auditory level of the voice, the desired signal may be masked by the background music such that the desired signal cannot be clearly understood. Therefore, it would be advantageous to eliminate or reduce the unwanted signal from the recording such that the desired signal can be more clearly understood.
Classical techniques for eliminating an unwanted signal are the Widrow-Hoff techniques. The Widrow-Hoff techniques are prone to certain errors. It is sensitive to errors in phase estimates of a filter and an unwanted signal. It is also unreliable if a side signal and a mixture are not aligned properly.
In one aspect of the present invention, a method for eliminating or reducing an unwanted signal from a recorded mixture of a desired signal and an unwanted signal given an original recording of the unwanted signal is provided. The method includes aligning the recorded mixture and the original recording; computing a time-frequency representation of the recorded mixture to create a time-frequency recorded mixture; computing a time-frequency representation of the redefined original recording to create a time-frequency redefined original recording; determining a segment of time when only the redefined original recording is present in the recorded mixture; computing a value α(ω); generating a time-frequency mask using the value α(ω), the time-frequency recorded mixture and the time-frequency redefined original recording; applying the time-frequency mask on the recorded mixture to compute a time-frequency desired signal; and inverting the time-frequency desired signal to create a desired signal.
In another aspect of the present invention, a machine-readable medium having instructions stored thereon for execution by a processor to perform a method for eliminating or reducing an unwanted signal from a recorded mixture of a desired signal and an unwanted signal given an original recording of the unwanted signal is provided. The medium contains instructions for aligning the recorded mixture and the original recording; computing a time-frequency representation of the recorded mixture to create a time-frequency recorded mixture; computing a time-frequency representation of the redefined original recording to create a time-frequency redefined original recording; determining a segment of time when only the redefined original recording is present in the recorded mixture; computing a value α(ω); generating a time-frequency mask using the value α(ω), the time-frequency recorded mixture and the time-frequency redefined original recording; applying the time-frequency mask on the recorded mixture to compute a time-frequency desired signal; and inverting the time-frequency desired signal to create a desired signal.
In yet another embodiment of the present invention, a method for eliminating or reducing an unwanted signal from a recorded mixture of a desired signal and an unwanted signal given an original recording of the unwanted signal is provided. The method includes aligning the recorded mixture and the original recording; computing a time-scale representation of the recorded mixture to create a time-scale recorded mixture; computing a time-scale representation of the redefined original recording to create a time-scale redefined original recording; determining a segment of time when only the redefined original recording is present in the recorded mixture; computing a value α(ω); generating a time-scale mask using the value α(ω), the time-scale recorded mixture and the time-scale redefined original recording; applying the time-scale mask on the recorded mixture to compute a time-scale desired signal; and inverting the time-scale desired signal to create a desired signal.
The invention may be understood by reference to the following description taken in conjunction with the accompanying drawings, in which like reference numerals identify like elements, and in which:
Illustrative embodiments of the invention are described below. In the interest of clarity, not all features of an actual implementation are described in this specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
It is to be understood that the systems and methods described herein may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In particular, at least a portion of the present invention is preferably implemented as an application comprising program instructions that are tangibly embodied on one or more program storage devices (e.g., hard disk, magnetic floppy disk, RAM, ROM, CD ROM, etc.) and executable by any device or machine comprising suitable architecture, such as a general purpose digital computer having a processor, memory, and input/output interfaces. It is to be further understood that, because some of the constituent system components and process steps depicted in the accompanying Figures are preferably implemented in software, the connections between system modules (or the logic flow of method steps) may differ depending upon the manner in which the present invention is programmed. Given the teachers herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations of the present invention.
A method is presented for eliminating an unwanted signal (e.g., background music, interference, etc.) from a mixture of a desired signal and the unwanted signal via time-frequency masking. Given a mixture of the desired signal and the unwanted signal, the goal of the present invention is to eliminate or at least reduce the effects of the unwanted signal to obtain an estimate of the desired signal. For example, although not so limited, the desired signal can be voice and the unwanted signal could be music. The goal, therefore, would be to eliminate or at least reduce the music from the mixture.
The method requires a side information signal, which is a signal with related instantaneous spectral powers to the unwanted signal. Such a signal is often available. For example, in the scenario where the unwanted signal is music from a digital recording (e.g., a compact disc) or an analog recording (e.g., a cassette tape), the original digital or analog recording can serve as the side information signal.
The method comprises three general steps, which are further elaborated through the present disclosure. First, the mixture and the side information signal are roughly aligned so that sounds in each occur approximately at the same time. Second, an estimate of the relationship (i.e., spectral weights) between the instantaneous spectral powers of the side information signal and its presence in the mixture is computed using a section of the mixture which contains little to no contribution from the desired signal but a relatively large contribution from the unwanted signal. Third, a time-frequency mask is created comparing the weighted instantaneous spectral powers of the side information Signal to the mixture instantaneous spectral powers. Time-frequency points which are likely dominated by the unwanted signal are suppressed to remove the unwanted signal from the mixture. The result is a clearer desired signal.
Consider a recording of a mixture of a desired signal, s(t), and an unwanted signal, r(t),
x(t)=s(t)+r(t).
Although the present invention is not so limited, it is assumed solely for discussion purposes that the desired signal is voice and the unwanted signal is music. It is further assumed that the music signal in the recording was played on a stereo or the like, and that the original recording (i.e., the side information signal) is available, for example in the form of a cassette tape or compact disc. The original recording can be referred to as r0(t). The unwanted signal r(t) and original recording version r0(t) are clearly related, although in general r(t)≠r0(t) because r(t) has been altered by the recording process, as is known to those skilled in the art. That is, r(t) is a filtered version of r0(t) and this transforming filter is unknown. The goal of the present invention is to estimate s(t) given x(t) and r0(t).
The mixing in the time-frequency domain can be expressed using the windowed Fourier transform. The windowed Fourier transform of x is defined,
which is referred to as {circumflex over (x)}(t,ω). The mixture in the time-frequency domain is expressed,
{circumflex over (x)}(t,ω)=ŝ(t,ω)+{circumflex over (r)}(t,ω).
It is assumed that a filter process can be modeled as {circumflex over (r)}(t,ω)=h(ω){circumflex over (r)}0(t,ω), such that mixing is,
{circumflex over (x)}(t,ω)=ŝ(t,ω)+h(ω){circumflex over (r)}0(t,ω).
A time-frequency mask, m(t,ω), is created such that the mask preserves most of the desired source of power,
∥m(t,ω)ŝ(t,ω)∥2/∥m(t,ω){circumflex over (r)}(t,ω)∥2 ≈1,
and results in a high output signal to interference ratio,
∥m(t,ω)ŝ(t,ω)∥2>>∥m(t,ω){circumflex over (r)}(t,ω)∥2.
For such a mask, converting m(t,ω){circumflex over (x)}(t,ω) back into the time domain will create the desired signal, s(t). Thus, the goal of the estimated s(t) can be achieved by determining an appropriate time-frequency mask m(t,ω).
In one embodiment, the method described herein can be performed with the following steps:
where α is set to maximize intelligibility. Although not so limited, a default value can be α=2.
An alternate embodiment of the method described herein will now be presented. Referring now to
Referring again to
Referring again to
Referring again to
Referring again to
As shown herein, α(ω)=|h(ω)|. Referring now to
Referring again to
Referring now to
Referring again to
Referring again to
may be inverted. The result of computing the inverted equation is inverting s into the time domain. Referring now to
Although the embodiments illustrated herein show continuous time signals, it is understood that the present invention can be applied to sample signals. In discrete time, the windowed Fourier transform would be a windowed DFT (discrete time Fourier transform) and the estimates of the filter |h(ω)| would be finite sums over discrete time points for each frequency center. In another embodiment, the windowed Fourier transform can be replaced by a wavelet transform, which is a time-scale representation defined by:
The present invention differs from classical Widrow-Hoff techniques. By its design, the Widrow-Hoff algorithm estimates h(ω), and then, once estimated, the algorithm uses h(ω) to subtract a filtered-by-h signal r from x: x−h*r. Conversely, the method described herein uses only the modulus of h(ω), and therefore only the modulus of h is needed. As previously stated, the modulus of is h(ω) (i.e., |h(ω)|) is denoted by α(ω). Accordingly, the present invention does not estimate the phase but is based on instantaneous time-frequency magnitude estimates. As a result, the present invention is more robust to alignment errors than Widrow-Hoff techniques.
In an alternate embodiment of the present invention, time varying filter estimates (i.e., adaptive updates to α(ω)) may be implemented. This would require a manual segmentation of the data. More specifically, the data (i.e. the two recordings x and r) are split into segments of a particular time interval (e.g., five minutes). The method described herein is applied to each segment. In yet another embodiment of the present invention, the value of α(ω) may be set to 1.
In an alternate embodiment of the present invention, the original recording r0(t) is recorded in the same environment/set-up as the recorded mixture x(t). For example, this can be done by using the same recording device for recording the mixture (e.g., cassette tape recorder) and the same playing device for playing the unwanted signal (e.g., a CD player). The recording device and the playing device would be placed in approximately the same physical location in a room of similar geometric structure and materials. The recording device records the original recording r0(t) being played by the playing device. The original recording r0(t) is used to compute an estimate of |{circumflex over (r)}(t,ω)|. That is, the original recording r0(t) would serve the role of α(ω){circumflex over (r)}(t,ω) in the time-frequency mask generation.
In an alternate embodiment of the present invention, the following time-frequency mask may be used:
m(t,ω)=1{α(ω)|{circumflex over (r)}
where β is set to maximize intelligibility of the output signal. A default choice of β can be determined from statistics of α(ω){circumflex over (r)}(t,ω) and {circumflex over (x)}(t,ω).
The particular embodiments disclosed above are illustrative only, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the invention. Accordingly, the protection sought herein is as set forth in the claims below.
Number | Name | Date | Kind |
---|---|---|---|
5874916 | Desiardins | Feb 1999 | A |
7158933 | Balan et al. | Jan 2007 | B2 |
20020126856 | Krasny et al. | Sep 2002 | A1 |
20020172378 | Bizjak | Nov 2002 | A1 |
Number | Date | Country | |
---|---|---|---|
20040136544 A1 | Jul 2004 | US |