SOURCE SEPARATION AND REMIXING IN SIGNAL PROCESSING

TECHNICAL FIELD OF THE INVENTION

The present invention relates to a method and audio processing system for source separation and remixing.

BACKGROUND OF THE INVENTION

Recorded audio signals may comprise a representation of one or more audio sources in addition to a noise component. Especially for User Generated Content (UGC) it is in general true that many individual audio sources will be picked up in addition to a noise audio component (such as white noise) when recording audio.

Consider e.g. a user recording the audio track of a video, recording a podcast or making a phone call using a headset or smartphone from the sidewalk of a busy street or in a forest during windy conditions. The recorded audio signal from the busy street could for instance, in addition to the voice of the user, include the voices of other nearby pedestrians, the ringtone of a nearby pedestrian's cellphone, the sound of passing cars or busses, sounds from a nearby construction site, the sound of a siren from an emergency vehicle and the noise component. Similarly, the recorded audio signal from the forest could for instance include the voice of the user, birdsong, the sound of an airplane passing above, the sound of the wind rattling the leaves and noise.

The recorded audio signal will comprise audio from all of these recorded sound sources which makes a desired audio signal, e.g. the voice of the user recording a video or making a phone call, less intelligible. To this end, neural network models for speech separation have been proposed which are capable of receiving an audio signal comprising recorded speech alongside other audio sources and noise as an input and output either a processed audio signal with enhanced speech intelligibility or a speech isolation filter (often referred to as a “mask”) for suppressing the non-speech audio components of audio signal. Accordingly, by using neural network models the intelligibility of speech present in audio signals can be enhanced allowing users to record audio signals at many locations.

In other situations, especially for Professionally Generated Content (PGC) such as the recording of an audio track for a movie, all audio sources, or at least additional audio sources in addition to the recorded voice may be of interest. For instance, for a movie audio track which is recorded in a forest during windy conditions the sound of a voice, the sound of the rattling leaves and birdsong are desired audio signal components whereas the sound of an airplane passing above is an undesired audio signal component. Accordingly, a neural network for speech separation may be used to the enhance the intelligibility of the voice whereby individually recorded audio signals containing only birdsong and only the sound of rattling leaves are mixed with the intelligibility enhanced speech to achieve a desired mix of audio sources for the movie audio track. Wherein the final mix has enhanced speech intelligibility but also comprises birdsong and the sound of rattling leaves, but not the sound of a passing airplane, which provides a desirable and believable ambience effect.

GENERAL DISCLOSURE OF THE INVENTION

A draw back with the prior solutions is that while many neural network models perform well in terms of removing noise components each model is trained to remove a specific type of predetermined noise. Due to different definitions of noise, a single neural network model will perform well if the definition of noise used to train the model overlaps with the undesired noise which is to be removed. However, as soon as the trained model is applied to remove noise which is defined differently from the noise definition used during training the noise suppression performance decreases.

For instance, the trained speech separation model may be aggressive and trained to treat all audio signals components which are not speech as noise. Using such a speech separation on e.g. a movie audio track where speech, birdsong and the sound of leaves rattling are all desired audio signals will suppress the birdsong and the sound of the leaves rattling to isolate only the speech. On the other hand, using a less aggressive speech separation model, which e.g. is trained to predict and remove only the stationary background noise will suppress only the stationary background noise and not e.g. the unwanted sound of an airplane momentarily passing above (which is not an example of stationary background noise).

Thus, it is a purpose of the present disclosure to provide an enhanced method for audio processing which alleviates at least some of the draw backs of the above-mentioned existing solutions.

A first aspect of the present invention relates to a method of processing audio for source separation, the method comprising obtaining an audio signal including a mixture of speech content and noise content, determining speech content from the audio signal, determining stationary noise content from the audio signal, and determining non-speech content, from the audio signal, wherein the stationary noise content is a true subset of the non-speech content. The method further comprises, determining, based on a difference between the stationary noise content and the non-speech content a non-stationary noise content, obtaining a set of weighting factors comprising a weighting factor corresponding to each of the speech content, the stationary noise content, and the non-stationary noise content respectively, and forming a processed audio signal based on a combination of the speech content, the stationary noise content, and the non-stationary noise content weighted with the respective weighting factor.

With stationary noise content it is meant noise content which remains constant over time and which does not carry any interpretable information. White noise or thermal noise are both examples of stationary noise. Further examples of stationary noise are pink noise, Gaussian noise, any noise which e.g. is introduced by an audio amplifier and any noise with a time-independent distribution.

Non-speech may be defined as the difference between a clean speech audio signal (such as a speech signal recorded in an anechoic chamber with any stationary noise removed) and a clean speech audio signal with added disturbances (such as stationary noise or birdsong). That is, non-speech content comprises stationary noise but also other types of non-stationary noise such as birdsong or the sound of rain.

The first aspect of the invention is at least partially based on the understanding that by extracting the non-stationary noise as the difference between non-speech content and the stationary noise content two independent noise content types are obtained in addition to the independent speech content. This facilitates remixing as the relative magnitude of the three content types is adjusted by selecting a desired set of weighting coefficients. For example, by adjusting the three weighting coefficients the stationary noise content is omitted entirely, the non-stationary noise is attenuated but not omitted entirely and the speech content is amplified which results in a processed audio signal with enhanced speech intelligibility while also providing some amount of ambience (as at least a portion of the non-stationary noise content being kept).

In some implementations, determining the stationary noise content comprises providing the audio signal to a stationary noise isolator model trained to predict a stationary noise mask for removing stationary noise content from the audio signal and determining the stationary noise content based on the stationary noise mask and the audio signal.

Thus, an accurate trained model (e.g. implemented with a neural network) may be used to determine the stationary noise content given a representation of an audio signal. Stationary noise content may be defined precisely, and large amounts of training data is readily availible, may be recorded or created synthetically which means stationary noise isolator model can be trained to be very accurate.

Similarly, in some implementations determining the non-speech content comprises providing the audio signal to a speech isolator model trained to predict a noise mask for removing non-speech content from the audio signal; and determining non-speech content based on the noise mask and the audio signal.

Separating speech from arbitrary audio signals may be performed accurately with a model (e.g. implemented with a neural network) trained to predict mask for separating speech content provided a representation of an audio signal. Additionally, the same mask used to extract the speech content may also be used to extract non-speech content meaning that the same trained model may be used to determine both the speech content and the non-speech content.

While it is difficult to train a model to separate between different types of noise, such as stationary noise content and non-stationary noise content, some implementations of the first aspect of the present invention utilizes trained models adapted for separation of more distinctly different types of audio content, such as speech and stationary noise, and a subsequent manipulation of the separated audio content comprising to more accurately separate different types of noise. The manipulation comprising determining the difference between the stationary noise and the non-speech content.

In some implementations, the method further comprises bandpass filtering the non-stationary noise content with a bandpass filter configured to isolate a noise object in the non-stationary noise.

That is, while the non-stationary noise may comprise audio content associated with a plurality of non-stationary noise objects the application of a suitable bandpass filter will isolate at least one desired noise object. A benefit of applying the bandpass filter to the non-stationary noise content is that the filter will not let through any speech-content or stationary noise content as this is not present in the non-stationary noise content.

In some implementations, the bandpass filter has been obtained by analyzing an example audio signal wherein the method further comprises collecting an example audio signal, the example audio signal comprising at least one example of a noise object, determining the frequency distribution of the example audio signal and defining the bandpass filter based on the frequency distribution of the example audio signal.

To this end, the frequency distribution of any arbitrary non-stationary object(s) may be determined and used to generate a bandpass filter for the filtering the non-stationary noise.

According to a second aspect of the invention there is provided an audio processing system, the audio processing system comprising an audio content separation unit, the audio content separation unit being configured to obtain an audio signal, the audio signal including a mixture of speech content and noise content and determine, from the audio signal, speech content, stationary noise content, and non-speech content, wherein the stationary noise content is a true subset of the non-speech content. The audio content separation unit is further configured to determine, based on a difference between the stationary noise content and the non-speech content a non-stationary noise content, and the audio processing system further comprising a mixing unit configured to: obtain a set of weighting factors, comprising a weighting factor corresponding to each of the speech content, the stationary noise content, and the non-stationary noise content respectively, and forming a processed audio signal based on a combination of the speech content, the stationary noise content, and the non-stationary noise content weighted with the respective weighting factor.

According to a third aspect of the invention there is provided a non-transitory computer-readable medium storing instructions that, upon execution by one or more processors, cause the one or more processor to perform the method according to the first aspect of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Aspects of the present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments.

FIG. 1a-b illustrate an audio signal being separated into non-speech content, speech content, stationary noise content and residual content according to some implementations.

FIG. 2 illustrates different types of non-speech content which the audio processing system according to some implementations isolates from the audio signal.

FIG. 3a-c are block diagrams illustrating different audio processing systems for source separation according to some implementations.

FIG. 4 is a flowchart describing a method according to some implementations.

FIG. 5 is a block diagram illustrating an audio processing system according to some implementations, with a speech isolator model for separating at least two different types of speech content.

FIG. 6a-c show different alternatives of audio processing systems with a classifier and selector according to some implementations.

FIG. 7 shows an exemplary setup for training a stationary noise isolator model and a speech isolator model according to some implementations.

DETAILED DESCRIPTION OF CURRENTLY PREFERRED EMBODIMENTS

Systems and methods disclosed in the present application may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units: to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.

The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.

Certain or all components may be implemented by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system (i.e. a computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.

The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. FIG. 1a depicts schematically an audio signal S_in. The audio signal S_inis a mixture of a desired source s and noise n, wherein the desired source s e.g. is speech content. The audio signal Sin may be a mono audio signal, a stereo audio signal or even a multi-channel audio signal with more than two channels (e.g. the audio signal is 5.1 or 7.1.2 audio signal).

FIG. 1a depicts schematically an audio signal S_in. The audio signal S_inis a mixture of a desired source s and noise n, wherein the desired source s e.g. is speech content. The audio signal S_inmay be a mono audio signal, a stereo audio signal or even a multi-channel audio signal with more than two channels (e.g. the audio signal is 5.1 or 7.1.2 audio signal).

The audio signal S_in, which comprises a mixture of speech and noise content, may be referred to as x(k) in the time domain where k is the time sample index. Thus, x(k) may be expressed as

$\begin{matrix} x [k] = s [k] + n [k] & (1) \end{matrix}$

in the time domain. By transforming the time domain representation in equation 1 to the spectral domain it is derived that

$\begin{matrix} X_{m, f} = S_{m, f} + N_{m, f} & (2) \end{matrix}$

where X, S, N denote the time-frequency (T-F) representations of the audio signal mixture x(k), source s, and the noise n while the subscripts m and f denote the time frame index and frequency bin index respectively.

The audio signal S_inmay be provided to a trained model wherein the trained model has been trained to output a mask M₁, M₂for suppressing a certain type of noise wherein the mask M₁, M₂is typically defined as the magnitude ratio between the desired speech S_m,fand the audio signal mixture X_m,ffor each time frame and frequency bin. That is, the mask M is defined as

$\begin{matrix} M (m, f) = \frac{| S_{m, f} |}{| X_{m, f} |} . & (3) \end{matrix}$

Depending on the type and training of the mask predicting model the mask M₁, M₂may suppress different types of noise. While FIG. 1a depicts that a portion of the audio signal S_inis separated by the mask M₁, M₂this is merely a simple illustrative example and the illustration should not be interpreted to merely describe e.g. a time and frequency frame. It is clear from equation 3 that the mask M₁, M₂comprises a plurality of mask values, one for each time and frequency bin which in general is a real number between zero and one describing the extent to which each time and frequency bin should be suppressed.

With further reference to FIG. 1b an implementation is shown wherein the audio signal S_inis provided to a first trained model 11 trained to output a first mask M₁which suppresses all audio components of the audio signal S_inwhich is non-speech {circumflex over (N)}₁. Applying, mask M₁to the audio signal S_inleaves only what is considered by the trained model to be speech Ŝ1. This first trained model 11 may be used to perform aggressive speech intelligibility enhancement as all sounds not considered to be speech {circumflex over (N)}₁are removed by the mask M₁and, while this is suitable in some cases, this type of speech intelligibility enhancement is unsuitable in some cases. In the audio track of a video, for instance, where characters are speaking on a busy street the aggressive speech intelligibility enhancement will remove any traffic sounds from the street which are important for context and immersion.

The second trained model 12 is trained trained to output a mask M₂for suppressing only the stationary noise content {circumflex over (N)}₂of the audio signal Ŝ_inand leave all audio content which is not stationary noise content, which is referred to as the residual content Ŝ₂, unaffected. Applying the mask M₂to the audio signal S_ineffectively removes stationary noise, which remains constant over time (i.e. noise with a probability distribution which is constant over time), while other types of noise which are potentially undesired (e.g. the sound of nearby car revving its engine) are unaffected.

By using these two trained models simultaneously, the first model 11, a speech isolator model, trained to output a first mask M₁and the second model 12, a stationary noise isolator model, trained to output a second mask M₂, wherein the first mask M₁is for suppressing non-speech {circumflex over (N)}₁and the second mask M₂is for suppressing stationary noise {circumflex over (N)}₂four partial representations of the audio signal S_inmay be obtained. The estimated speech content Ŝ₁and non-speech content {circumflex over (N)}₁(i.e. noise such as birdsong and stationary noise) of the first model 11 are obtained as

$\begin{matrix} {\hat{S}}_{1} = X \times M_{1} & (4) \end{matrix}$

$\begin{matrix} {\hat{N}}_{1} = X \times (1 - M_{1}) = X - {\hat{S}}_{1} & (5) \end{matrix}$

and, similarly, the estimated residual content Ŝ₂(i.e. all content but the stationary noise content) and stationary noise content {circumflex over (N)}₂of the second model 12 is obtained as

$\begin{matrix} {\hat{S}}_{2} = X \times M_{2} & (6) \end{matrix}$

$\begin{matrix} {\hat{N}}_{2} = X \times (1 - M_{2}) = X - {\hat{S}}_{2} . & (7) \end{matrix}$

The output audio signal, S_out, can now be determined by combining Ŝ₁, Ŝ₂, {circumflex over (N)}₁and {circumflex over (N)}₂from equations 4, 5, 6 and 7 as:

$\begin{matrix} S_{o u t} = α_{1} \times {\hat{S}}_{1} + β_{1} \times {\hat{S}}_{2} + γ_{1} \times {\hat{N}}_{1} + μ_{1} \times {\hat{N}}_{2} & (8) \end{matrix}$

where α₁, β₁, γ₁, μ₁are weighting factors for each of the speech content Ŝ₁, the residual content Ŝ₂, the non-speech content {circumflex over (N)}₁and the stationary noise content {circumflex over (N)}₂respectively. Alternatively, the output audio signal S_outfrom equation 8 can be rewritten in terms of the input audio signal mix X, the speech content Ŝ₁and the residual content Ŝ₂as

$\begin{matrix} S_{o u t} = c_{1} \times {\hat{S}}_{1} + c_{2} \times {\hat{S}}_{2} + c_{3} \times X & (9) \end{matrix}$

wherein c₁, c₂, c₃is an alternative set of weighting factors. It is understood that the same output audio signal Sout may be acquired with both equation 8 and 9 which means that there exists a mapping between the weighting factors α₁, β₁, γ₁, μ₁and the weighting factors c₁, c₂, c₃. However, as will now be described, the representation from equation 8 has some properties which can be exploited.

The above audio signal components Ŝ₁, Ŝ₂, {circumflex over (N)}₁and {circumflex over (N)}₂from equation 8 are not independent as e.g. the speech content Ŝ₁may be comprised partially or wholly in the residual content Ŝ₂which means that it may not be possible to achieve a desired mix of the components from equation 8. To this end, the non-speech content {circumflex over (N)}₁and the stationary noise content {circumflex over (N)}₂are used to define a new type of noise content referred to as the non-stationary noise content {circumflex over (N)}_NSor the object noise content, which is defined as

$\begin{matrix} {\hat{N}}_{N S} = {\hat{N}}_{1} - {\hat{N}}_{2} . & (10) \end{matrix}$

and the stationary noise content {circumflex over (N)}₂is renamed {circumflex over (N)}_S, meaning that

$\begin{matrix} {\hat{N}}_{S} = {\hat{N}}_{2} . & (11) \end{matrix}$

The stationary noise content {circumflex over (N)}_Sand the non-stationary noise content {circumflex over (N)}_NSare independent parts of the audio signal Ŝ_in(as opposed to {circumflex over (N)}₁and {circumflex over (N)}₂which are dependent) wherein the stationary noise content {circumflex over (N)}_Scaptures e.g. white noise and the non-stationary noise content {circumflex over (N)}_NScaptures all content which is neither stationary noise content nor speech content. Examples of non-stationary noise {circumflex over (N)}_NSinclude birdsong, the sound of rattling leaves, the sound of cars, airplanes, helicopters and sirens, the sound of gusts of wind, the sound of rain or thunder. Each of these examples, in addition to other not mentioned examples, forms a respective noise object {circumflex over (N)}_OBJ,1. {circumflex over (N)}_OBJ,2wherein each noise object {circumflex over (N)}_OBJ,1. {circumflex over (N)}_OBJ,2is a true subset of the non-stationary noise content {circumflex over (N)}_NSand associated with a certain type of audio content or audio content with a certain audio source (e.g. a machine, animal or vehicle).

Accordingly, the audio signal components Ŝ₁, Ŝ₂, {circumflex over (N)}_Sand {circumflex over (N)}{circumflex over (N)}_Sare combined in a manner similar to equation 8, as

$\begin{matrix} S_{o u t} = α_{2} \times {\hat{S}}_{1} + β_{2} \times {\hat{S}}_{2} + γ_{2} \times {\hat{N}}_{S} + μ_{2} \times {\hat{N}}_{N S} & (12) \end{matrix}$

wherein α₂, β₂, γ₂, μ₂are weighting factors and γ₂and μ₂will influence the extent to which the stationary noise {circumflex over (N)}_Sand non-stationary noise {circumflex over (N)}_NSis introduced into the output audio signal S_out. For instance, if μ₂is high the non-stationary noise content such as the noise objects {circumflex over (N)}_OBJ,1, {circumflex over (N)}_OBJ,2will be emphasized in the processed audio signal Ŝ_outand if γ₂is set to zero the stationary noise is omitted entirely, whereby the balance between α₂, β₂, and μ₂will influence the relative volume of the non-stationary noise with respect the speech Ŝ₁and the residual Ŝ₂.

It is noted that the output signal S_outas calculated with equation 12 using Ŝ₁, Ŝ₂, {circumflex over (N)}_S. {circumflex over (N)}_NS, may alternatively be expressed in terms of Ŝ₁, Ŝ₂, {circumflex over (N)}₁, {circumflex over (N)}₂from equation 8 or in terms of Ŝ₁, Ŝ₂, X from equation 9. Accordingly, there exists a mapping between all three sets of weighting coefficients, namely the weighting coefficients α₂, β₂, γ₂, μ₂, the weighting coefficients α₁, β₁, γ₁, μ₁and the weighting coefficients c₁, c₂, c₃. However, the representation from equation 12 has the benefit of featuring three independent content types (if Ŝ₂is omitted) which facilitates more accurate remixing of the output audio signal Ŝ_out.

In some implementations, β₁or β₂is set to zero or the residual content Ŝ₂is omitted from equation 8 and 12 as Ŝ₂will involve some overlap between both the speech content Ŝ₁and the non-speech content {circumflex over (N)}₁as predicted by the first trained model 11.

With reference to FIG. 2 the different types of non-speech content {circumflex over (N)}₁are illustrated schematically. As seen the non-speech content {circumflex over (N)}₁comprises stationary noise content {circumflex over (N)}₂wherein the stationary noise content in turn comprises different forms of stationary noise content, such as white noise N_w. The difference between the stationary noise content {circumflex over (N)}₂={circumflex over (N)}_Sand the non-speech audio content {circumflex over (N)}₁defines the non-stationary noise content {circumflex over (N)}_NSwhich in turn comprises one or more noise objects {circumflex over (N)}_OBJ,1. {circumflex over (N)}_OBJ,2which are neither speech nor stationary noise content (e.g. birdsong).

FIG. 3a depicts a block diagram of an audio processing system 1, and with further reference to the flow chart of FIG. 4, a method for performing audio processing for source separation according to some implementations will now be described in detail.

At step S1 an audio signal comprising a mix of speech content and noise content is obtained and provided to an audio separation unit 10. The audio separation unit 10 comprises a a speech isolator model 11 trained to predict a mask M₁for separating the speech content Ŝ₁from the non-speech content {circumflex over (N)}₁in the audio signal. By applying the mask M₁to the audio signal, e.g. in accordance with equation 4 and 5 in the above, the speech content Ŝ₁and non-speech content {circumflex over (N)}₁is determined at step S2a and step S2c respectively.

Analogously, the audio signal is provided to the stationary noise isolator model 12 trained to predict a mask M₂for separating the residual audio content Ŝ₂from the stationary noise content {circumflex over (N)}₂. By applying the mask M₂to the audio signal, e.g. in accordance with equation 7 in the above, at least the stationary noise content {circumflex over (N)}₂is determined at step S2b.

At step S3 the non-stationary noise content {circumflex over (N)}_NSis determined by the audio separation unit 10 as the difference between the non-speech content {circumflex over (N)}₁predicted by the speech isolator model 11 and the stationary noise {circumflex over (N)}₂as predicted by the stationary noise isolator model 12. Alternatively, the audio separation unit 10 outputs the speech content Ŝ₁, the non-speech content {circumflex over (N)}₁and the stationary noise content {circumflex over (N)}₂whereby the non-stationary noise content {circumflex over (N)}_NSis determined by an auxiliary computation unit.

The method may then go to step S5 which comprises obtaining at least one weighting factor for each of the speech content Ŝ₁, the stationary noise content {circumflex over (N)}2={circumflex over (N)}_Sand the non-stationary noise content {circumflex over (N)}_NS. The weighting factors are e.g. predetermined or set by a user/mixing engineer to obtain a desired mix of the independent speech content Ŝ₁, stationary noise content {circumflex over (N)}₂-{circumflex over (N)}_Sand non-stationary noise content {circumflex over (N)}_NSin the output audio signal. Additionally, as will be described in the below, a selector may select or suggest a set of weighting coefficients based on the detected noise objects present in the audio signal.

At step S6 the speech content Ŝ₁, the stationary noise content {circumflex over (N)}₂={circumflex over (N)}_Sand the non-stationary noise content {circumflex over (N)}_NSare combined by the mixer unit 14 with their respective weighting factor to form the processed audio signal, e.g. in accordance with equation 12 in the above. That is, the different independent content types of the audio signal are remixed to form a processed output audio signal.

Optionally, as seen in the exemplary implementation in FIG. 3b, both the stationary noise content {circumflex over (N)}₂and the residual content Ŝ₂is determined at step S2b, e.g. by using equations 6 and 7 in the above, whereby both the stationary noise content {circumflex over (N)}₂and the residual content Ŝ₂are used in the combination at the mixer unit 14 with a respective weighting factor.

FIG. 3c shows another optional implementation, wherein the non-stationary noise is {circumflex over (N)}_NSprocessed with a bandpass filter 13 at step S3 prior to being fed to the mixer unit 14. Additionally, the filtered non-stationary noise may be smoothed with a smoothing kernel or smoothing filter (not shown) prior to being fed to the mixer unit 14. The implementation in FIG. 3c may e.g. be combined with other implementations, such as the implementation shown in FIG. 3b. Moreover, it is envisaged that both the non-stationary noise is N_NSand the non-stationary noise processed with the filter 13 may be provided to the mixing unit 14 as illustrated in FIG. 6a.

The filter 13 may in turn be determined by collecting an example audio signal, the example audio signal comprising at least one example of a (non-stationary) target noise object such as birdsong or a group of target noise objects such as traffic sounds, and determining the frequency distribution of the example audio signal. The frequency distribution of the example audio signal will reveal the energy distribution of the audio signal whereby a suitable bandpass filter 13 may be defined with a passband which allows at least a predetermined portion of the example audio signal to pass through. For instance, the bandpass filter 13 is defined to be as narrow as possible but still feature a passband which allows at least 50%, and preferably at least 70%, and most preferably at least 90% of the energy of the test signal to pass through. That is, the bandpass filter 13 will filter attenuate noise objects different the target noise object(s).

To obtain a more accurate bandpass filter 13, the example audio signal should comprise a clean example of the target noise object or group of noise objects. To this end the target audio signal may be manually cleaned to remove audio components or noise which is not an example of the target noise object(s) or cleaned with a reliable automatic process. Additionally, a longer example audio signal, with more/longer examples of the target noise object(s) is preferred to avoid averaging errors. For instance, the example audio signal comprises at least one hour, and preferably at least five hours and most preferably at least ten hours of noise object audio content.

As an illustrative example, the target noise object is birdsong whereby an example audio signal with ten hours of clean birdsong is obtained and the frequency distribution determined. The frequency distribution reveals that most of the example signal energy is contained between 3 kHz and 7 kHz whereby a bandpass filter 13 with a passband between 3 kHz and 7 kHz, and a stopband which starts at 1 KHz and 9 kHz respectively, is defined to separate the birdsong from other noise objects present in the non-stationary noise Ŝ_NS.

FIG. 5 depicts an audio processing system 1 identical to the audio processing system described in connection to FIG. 3a aside from the presence of a different type of speech isolator model 11′. The speech isolator model 11′ in FIG. 5 is trained to obtain an audio signal and predict at least two masks so as to isolate at least two different types of speech present in the audio signal. In the implementation shown, the speech isolator model 11′ predicts three masks to separate speech without reverberation, which is called dry speech, Ŝ_d, dry speech with early reverberation Ŝ_eand dry speech with early reverberation and with late reverberation Ŝ_d. The different speech types Ŝ_d, Ŝ_e, Ŝ_dare provided to the mixing unit 14 and added to the stationary noise content Ŝ_Sand the non-stationary noise content Ŝ_NSwith a respective weighting factor for each of the speech types. Accordingly, equation 12 (with or without the residual content Ŝ₂) which describes the formation of output audio signal, S_out, in the mixing unit 14 may be modified by replacing speech content S₁with Stot wherein

$\begin{matrix} {\hat{S}}_{t o t} = α_{1} \times {\hat{S}}_{d} + α_{2} \times {\hat{S}}_{e} + α_{1} \times {\hat{S}}_{1} & (13) \end{matrix}$

and wherein α₁, α₂, and α₃are weighting factors for each of the dry speech Ŝ_d, the dry speech and early reverberation Ŝ_eand the dry speech, early reverberation and late reverberation Ŝ₁.

Thus, by e.g. setting α₂and α₃to small values relative α₁the dry speech will be emphasized in the output audio signal S_outand by setting α₁and α₃to small values relative α₂the dry speech with early reverberation will be emphasized in the output audio signal S_out.

With late reverberation it is meant speech reverberation with a reverberation time which exceeds a predetermined threshold and with early reverberation it is meant speech reverberation with a time constant below the predetermined threshold.

The speech isolator model 11′ may comprise one trained model for each of the different speech types Ŝ_d. Ŝ_e, Ŝ_lor the speech isolator may comprise a single isolator model 11′ trained to predict one mask for separating each of the different types of speech Ŝ_d, Ŝ_e, Ŝ_l.

While the implementation of the audio processing system 1 in FIG. 5 extracts the speech types which differ in terms of reverberation it is envisaged that speech types which differ in other ways may be used as an alternative to, or in addition to, the speech types Ŝ_d, Ŝ_e, Ŝ_lwith different reverberation properties. For instance, the speech isolator model 11′ may be configured (trained) to separate at least two types of speech which differ in the at least one of: the gender of voice uttering the speech, the age of the voice uttering the speech, and the language of the speech.

FIGS. 6a, 6b and 6c each illustrates a block diagram of an audio processing system 1 comprising a classifier 15 according to some implementations which now will be described in more detail.

In FIG. 6a the classifier 15 receives the audio signal and the classifier 15 is trained predict the presence of at least noise object in the audio signal. The classifier 15 may further be trained to predict the presence of at least noise object in the audio signal, wherein the at least one noise object being at least one noise object of a predetermined set of noise objects. For example, the classifier 15 may be trained to predict the presence of at least one of birdsong, traffic sounds, wind sounds, rain sounds, thunder sounds, siren sounds, airplane sounds, helicopter sounds and machine sound (such as the sound of a washing machine, drill, or lawnmower) in the audio signal. Based on the at least one noise object which is predicted to be present in the audio signal the selector 16 selects filter data 172a, 172b, 172c associated with the predicted noise object and applies a filter 13′ as described by the selected filter data 172a, 172b, 172c to the non-stationary noise. For example, the classifier 15 predicts that birdsong is present in the audio signal whereby a birdsong filter 13′ selected by the selector 16 to be applied to the non-stationary noise content {circumflex over (N)}_NS.

To this end, the classifier 15 may be a neural network trained to predict the presence of at least one noise object given a representation of an audio signal. It is envisaged that the neural network predicts a likelihood of the audio signal comprising one or more predetermined noise objects, wherein the noise object associated with the greatest likelihood is the predicted noise object.

The selector 16 may retrieve the filter 13′ from a database 171 of different sets of filter data 172a, 172b, 172c wherein each set of filter data is associated with a noise object and describes a filter 13′ to be applied. For instance, for each noise object present in the predetermined set of noise objects which are possible outputs of the classifier 15 there is a corresponding set of filter data 172a, 172b, 172c in the database 171. Additionally, as seen in FIG. 6a the non-stationary noise {circumflex over (N)}_NSmay be provided to the mixing unit 14 in addition to the filtered non-stationary noise whereby each of the non-stationary noise {circumflex over (N)}_NSand the filtered non-stationary noise is provided with a respective weighting factor allowing the relative signal strength of the non-stationary noise {circumflex over (N)}_NSrelative to the filtered non-stationary noise to be modified as desired (e.g. by a user or mixing engineer).

In the exemplary embodiment shown in FIG. 6a the classifier 15 predicts birdsong as one noise object which is present in the audio signal and provides an indication of birdsong to the selector 16. The selector 16 accesses the database 171 and finds that filter data 172b describes a filter 13′ associated with birdsong (e.g. the filer with a passband between 3 kHz and 7 kHz as mentioned in the above) whereby the selector 16 selects filter data 172b and enables the birdsong filter 13′ to be applied to the non-stationary noise content {circumflex over (N)}_NS.

FIG. 6b depicts another audio processing system 1 comprising a classifier 15 according to some implementations. The classifier 15 predicts the presence of at least one noise object (e.g. the presence of at least one noise object of a predetermined set of noise objects) and provides the predicted noise object(s) to a selector 16. The selector 16 accesses a database 173 of trained noise object isolation models 174a, 174b, 174c and selects at least one trained noise object isolation model 174a trained to predict a mask for isolating the at least one predicted noise object {circumflex over (N)}_OBJ,1. The predicted mask of the selected noise object isolation model 174a is applied to the audio signal to obtain the noise object {circumflex over (N)}_OBJ,1. The noise object {circumflex over (N)}_OBJ,1is in turn provided to the mixing unit 14 and combined with the non-stationary noise {circumflex over (N)}_NS, stationary noise {circumflex over (N)}_Sand speech content S₁wherein each content type is provided with a respective weighting factor. Thus, the user or mixing engineer may set the weighting factors as desired and e.g. suppress the stationary and non-stationary noise {circumflex over (N)}_NS. {circumflex over (N)}_Sand amplify only the noise object {circumflex over (N)}_OBJ,1of the non-stationary noise and the speech content S₁.

While the audio processing system 1 in FIG. 6a and FIG. 6b uses a classifier 15 and selector 16 to select appropriate filter data 172a, 172b, 172c or noise object isolator model 174a, 174b, 174c it is envisaged that the classifier 15 and selector 16 may select more than one, such as two or more, filters or noise object isolator models if two or more noise objects are detected to be present in the audio signal by the classifier 15. Moreover, the filter or noise object isolator models may be associated with a group of noise objects rather than just a single noise object. For instance, there may be trained nature object isolator model or nature filter trained which is selected when the classifier 15 detects at least one of birdsong, the sound of rattling leaves or the sound of rain.

In connection to FIGS. 6a and 6b in the above it is explained how the classifier 15 and selector 16 is used to dynamically, and based on the content of the audio signal, change the filter 13′ to be applied to the non-stationary noise or which object noise isolator model 174a, 174b, 174c to use. Accordingly, the number of audio content types which are provided to the mixing unit 14 may change depending on the contents of the audio signal whereby the user or mixing engineer may select a desired relative signal strength for each of the components by selecting the 30) weighting factors manually. However, as shown in FIG. 6c the weighting factors may be determined automatically, e.g. selected by the selector 16 from a database 175 of weighting factor sets 176a, 176b, 176c based on which noise object(s) the classifier 15 predicts to be present in the audio signal. Each set 176a, 176b, 176c of weighting factors in the database comprising a value for at least each one of α₂, γ₂and μ₂.

For instance, if the classifier 15 predicts the presence of birdsong the selector 16 may select a set of weighting factors 176c which suppresses the stationary noise, amplifies the non-stationary noise and amplifies the speech content as birdsong is considered to not disturb the speech intelligibility while adding a pleasant ambiance. On the other hand, if the classifier 15 predicts the presence of wind sounds the selector 16 may select a different set of weighting factors 176a which suppresses the stationary noise and the non-stationary noise (which includes the wind sound) while amplifying the speech content as wind sounds is considered to not be an unwanted disturbance.

In this manner, the selector 16 automatically selects a suitable weighting factor set 176a, 176b, 176c for all audio signals according to a predetermined set of rules wherein a user or mixing engineer, optionally, provides some preferences to modify the rules. The preferences e.g. indicates a desire to suppress some noise objects more than others (e.g. suppress all manmade noise objects such as machine sounds and traffic sounds but keep all nature sounds such as birdsong, rain sound and thunder sound). Alternatively or additionally, the preferences e.g. indicates a desire to enhance speech intelligibility at the cost of less ambience wherein any reverberation and stationary noise is omitted entirely and any noise object is attenuated.

In some implementations (not shown) the classifier 15 may receive the non-stationary noise content {circumflex over (N)}_NS(instead of the entire audio signal) which has been extracted using the output of the stationary noise isolator model 12 and the speech isolator model 11. As the noise objects will be in the non-stationary noise content {circumflex over (N)}_NSthe classifier 15 can still correctly predict the presence of at least one noise object while the classification can be made more accurate due to the non-stationary noise {circumflex over (N)}_NSincluding only audio content being a true subset of the audio signal content.

FIG. 7 illustrates how the stationary noise isolator model 12 and the speech isolator model 11 may be trained to predict a corresponding mask M₁, M₂. Training data in the form speech is obtained from a speech database 179 wherein the speech database 179 comprises audio signals with clean speech audio signals corresponding to a multitude of different speakers, languages and signal bitrates. Similarly, noise training data is obtained from a noise database 177 wherein the noise comprises a plurality of non-speech sounds such as stationary noise of different types (e.g. white noise) and non-stationary noise of different types (such as rain sound or the sound of a barking dog). The training speech and noise data is combined in a mixer and provided to each of the stationary noise isolator model 12 and the speech isolator model 11 for training.

During training the internal weights and/or parameters of the isolation models 11, 12 are adjusted so as to predict mask M₁which accurately isolates the speech and mask M₂which accurately isolates the stationary noise. To accomplish this, the resulting audio signal after applying mask M₁is compared to a ground truth signal comprising the clean speech from the speech database 179 and the resulting audio signal after applying mask M₂is compared to a ground truth signal comprising only the stationary noise added from the noise database 177. By changing the internal weights and/or parameters of the isolation models 11, 12 so as to minimize discrepancies between the audio signal with the respective mask applied and the ground truth signal the models 11, 12 will gradually learn to predict masks M₁, M₂for accurate speech separation and stationary noise separation.

The one or more noise object isolator models 174a, 174b, 174c of the database 173 described in connection to FIG. 6b may obtained by a similar training setup. However, for a noise object isolator model 174a, 174b, 174c the ground truth signal will be a clean signal representing the noise object (such as the above mentioned example audio signal) and the training signal is the clean signal representing the noise object mixed with at least one of other noise objects, speech and stationary noise.

Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the embodiments of the invention utilizes more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the embodiments of the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

The person skilled in the art realizes that the aspects of the invention by no means is limited to the embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, while the classifier 15 and selector 16 of the implementations depicted in FIG. 6a, 6b, 6c are used to select filter data 172a, 172b, 172c, noise object separator model 174a, 174b, 174c or a set of weighting factors 176a. 176b, 176c it is envisaged that the classifier and selector may select two or all three of a filter(s), a noise object separator model(s) or a set of weighting factors simultaneously. For instance, while a noise object separator model 174a, 174b, 174c may be sufficient to separate a noise object a filter 13′ may be used to further enhance the quality of the isolation of the noise object.

Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs):

- EEE1. A method of processing audio, the method comprising:
  - receiving an audio signal including a mixture of speech content and noise content:
  - determining, from the noise content, background noise and object noise;
  - enhancing the speech content to generate speech enhanced audio, wherein enhancing the speech content comprises applying one or more first gains to the speech content, one or more second gains to the background noise, and one or more third gains to the object noise; and
- providing the speech enhanced audio to a downstream device.
- EEE2. The method of EEE 1, wherein determining the background noise and object noise comprises combining and remixing a type one noise and a type two noise, the type one noise and type two noise being defined in a noise database and each corresponding to a respective model for generating a respective mask for enhancing speech under a respective type of noise.
- EEE3. The method of EEE 2, wherein the background noise corresponds to the type one noise, and the object noise corresponds to a difference between the type one noise and the type two noise.
- EEE4. The method of EEE 2 or 3, wherein at least one of the one or more second gains or the one or more third gains are different from gains corresponding to the type one noise and type two noise as prescribed in the respective models.
- EEE5. A method of processing audio, comprising:
  - receiving audio mixtures; and
  - separating and remixing the audio mixtures based on particular types of sources.
- EEE6. The method of EEE 5, where the types of sources include at least one of noise or instrumental sound.
- EEE7. The method of EEE 5 or 6, comprising:
  - solving issues of overlap between types of sources by giving a definition of a type wherein difference information between types is used for remixing.
- EEE8. The method of any of EEEs 5 to 7, comprising performing post-processing, including extending from the particular types of sources to other types of sources.
- EEE9. The method of any of EEEs 5 to 8, comprising:
  - combining classifiers of the types of sources to indicate a new source type; and
  - performing separation and mixing using the new source type.
- EEE10. A system comprising:
  - one or more processors; and
  - a non-transitory computer-readable medium storing instructions that, upon execution by the one or more processors, cause the one or more processor to perform operations of 1-9.
- EEE11. A non-transitory computer-readable medium storing instructions that, upon execution by one or more processors, cause the one or more processor to perform operations of 1-9.

Number	Date	Country	Kind
PCT/CN2021/131462	Nov 2021	WO	international
22171560.0	May 2022	EP	regional

	Number	Date	Country
	63288996	Dec 2021	US
	63336824	Apr 2022	US

SOURCE SEPARATION AND REMIXING IN SIGNAL PROCESSING

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (2)

CROSS-REFERENCE TO RELATED APPLICATION

PCT Information

Provisional Applications (2)