The present application is based on PCT filing PCT/EP2020/051618, filed Jan. 23, 2020, which claims priority to EP 19153334.8, filed Jan. 23, 2019, the entire contents of each are incorporated herein by reference.
The present disclosure generally pertains to the field of audio processing, in particular to devices, methods and computer programs for source separation and mixing.
There is a lot of audio content available, for example, in the form of compact disks (CD), tapes, audio data files which can be downloaded from the internet, but also in the form of sound tracks of videos, e.g. stored on a digital video disk or the like, etc. Typically, audio content is already mixed, e.g. for a mono or stereo setting without keeping original audio source signals from the original audio sources which have been used for production of the audio content. However, there exist situations or applications where a mixing of the audio content is envisaged.
Although there generally exist techniques for mixing audio content, it is generally desirable to improve devices and methods for mixing of audio content.
According to a first aspect, the disclosure provides an electronic device comprising circuitry configured to perform source separation based on a received audio input to obtain a separated source, to perform onset detection on the separated source to obtain an onset detection signal and to mix the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
According to a second aspect, the disclosure provides a method comprising: performing source separation based on a received audio input to obtain a separated source; performing onset detection on the separated source to obtain an onset detection signal; and mixing the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
According to a third aspect, the disclosure provides a computer program comprising instructions, the instructions when executed on a processor causing the processor to perform source separation based on a received audio input to obtain a separated source, to perform onset detection on the separated source to obtain an onset detection signal and to mix the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
Further aspects are set forth in the dependent claims, the following description and the drawings.
Embodiments are explained by way of example with respect to the accompanying drawings, in which:
Before a detailed description of the embodiments under reference of
The embodiments disclose an electronic device comprising circuitry configured to perform source separation based on a received audio input to obtain a separated source, to perform onset detection on the separated source to obtain an onset detection signal and to mix the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
The circuitry of the electronic device may include a processor, may for example be CPU, a memory (RAM, ROM or the like), a memory and/or storage, interfaces, etc. Circuitry may comprise or may be connected with input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.)), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). Moreover, circuitry may comprise or may be connected with sensors for sensing still images or video image data (image sensor, camera sensor, video sensor, etc.), for sensing environmental parameters (e.g. radar, humidity, light, temperature), etc.
In audio source separation, an input signal comprising a number of sources (e.g. instruments, voices, or the like) is decomposed into separations. Audio source separation may be unsupervised (called “blind source separation”, BSS) or partly supervised. “Blind” means that the blind source separation does not necessarily have information about the original sources. For example, it may not necessarily know how many sources the original signal contained or which sound information of the input signal belong to which original source. The aim of blind source separation is to decompose the original signal separations without knowing the separations before. A blind source separation unit may use any of the blind source separation techniques known to the skilled person. In (blind) source separation, source signals may be searched that are minimally correlated or maximally independent in a probabilistic or information-theoretic sense or on the basis of a non-negative matrix factorization structural constraints on the audio source signals can be found. Methods for performing (blind) source separation are known to the skilled person and are based on, for example, principal components analysis, singular value decomposition, (in)dependent component analysis, non-negative matrix factorization, artificial neural networks, etc.
Although some embodiments use blind source separation for generating the separated audio source signals, the present disclosure is not limited to embodiments where no further information is used for the separation of the audio source signals, but in some embodiments, further information is used for generation of separated audio source signals. Such further information can be, for example, information about the mixing process, information about the type of audio sources included in the input audio content, information about a spatial position of audio sources included in the input audio content, etc.
The input signal can be an audio signal of any type. It can be in the form of analog signals, digital signals, it can origin from a compact disk, digital video disk, or the like, it can be a data file, such as a wave file, mp3-file or the like, and the present disclosure is not limited to a specific format of the input audio content. An input audio content may for example be a stereo audio signal having a first channel input audio signal and a second channel input audio signal, without that the present disclosure is limited to input audio contents with two audio channels. In other embodiments, the input audio content may include any number of channels, such as remixing of an 5.1 audio signal or the like. The input signal may comprise one or more source signals. In particular, the input signal may comprise several audio sources. An audio source can be any entity, which produces sound waves, for example, music instruments, voice, vocals, artificial generated sound, e.g. origin form a synthesizer, etc.
The input audio content may represent or include mixed audio sources, which means that the sound information is not separately available for all audio sources of the input audio content, but that the sound information for different audio sources, e.g., at least partially overlaps or is mixed.
The separations produced by blind source separation from the input signal may for example comprise a vocals separation, a bass separation, a drums separations and another separation. In the vocals separation all sounds belonging to human voices might be included, in the bass separation all noises below a predefined threshold frequency might be included, in the drums separation all noises belonging to the drums in a song/piece of music might be included and in the other separation, all remaining sounds might be included. Source separation obtained by a Music Source Separation (MSS) system may result in artefacts such as interference, crosstalk or noise.
Onset detection may be for example time-domain manipulation, which may be performed on a separated source selected from the source separation to obtain an onset detection signal. Onset may refer to the beginning of a musical note or other sound. It may be related to (but different from) the concept of a transient: all musical notes have an onset, but do not necessarily include an initial transient.
Onset detection is an active research area. For example, the MIREX annual competition features an Audio Onset Detection contest. Approaches to onset detection may operate in the time domain, frequency domain, phase domain, or complex domain, and may include looking for increases in spectral energy, changes in spectral energy distribution (spectral flux) or phase, changes in detected pitch —e.g. using a polyphonic pitch detection algorithm, spectral patterns recognizable by machine learning techniques such as neural networks, or the like. Alternatively, simpler techniques may exist, for example detecting increases in time-domain amplitude may lead to an unsatisfactorily high amount of false positives or false negatives, or the like.
The onset detection signal may indicate the attack phase of a sound (e.g. bass, hi-hat, snare), here the drums. As the analysis of the separated source may need some time, the onset detection may detect the onset later than it really is. That is, there may be an expected latency Δt of the onset detection signal. The expected time delay Δt may be a known, predefined parameter, which may be set in the latency compensation as a predefined parameter.
The circuitry may be configured to mix the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source. The mixing may be configured to perform mixing of one (e.g. drums separation) of the separated sources, here vocals, bass, drums and other to produce an enhanced separated source. Performing mixing based on the onset detection may enhance the separated source.
In some embodiments the circuitry may be further configured to perform latency compensation based on the received audio input to obtain a latency compensated audio signal and to perform latency compensation on the separated source on the separated source to obtain a latency compensated separated source.
In some embodiments the mixing of the audio signal with the separated source based on the onset detection signal may comprise mixing the latency compensated audio signal with the latency compensated separated source.
In some embodiments the circuitry may be further configured to generate a gain gDNN to be applied to the latency compensated separated source based on the onset detection signal and to generate a gain gOriginal to be applied to the latency compensated audio signal based on the onset detection signal.
In some embodiments the circuitry may be further configured to generate a gain modified latency compensated separated source based on the latency compensated separated source and to generate a gain modified latency compensated audio signal based on the latency compensated audio signal.
In some embodiments performing latency compensation on the separated source may comprise delaying the separated source by an expected latency in the onset detection.
In some embodiments performing latency compensation on the received audio input may comprise delaying the received audio input by an expected latency in the onset detection.
In some embodiments the circuitry may be further configured to perform an envelope enhancement on the latency compensated separated source to obtain an envelope enhanced separated source. This envelope enhancement may for example be any kind of gain envelope generator with attack, sustain and release parameters as known from the state of the art.
In some embodiments the mixing of the audio signal with the separated source may comprise mixing the latency compensated audio signal to the envelope enhanced separated source.
In some embodiments the circuitry may be further configured to perform averaging on the latency compensated audio signal to obtain an average audio signal.
In some embodiments the circuitry may be further configured to perform a rhythm analysis on the average audio signal to obtain a rhythm analysis result.
In some embodiments the circuitry may be further configured to perform dynamic equalization on the latency compensated audio signal and on the rhythm analysis result to obtain a dynamic equalized audio signal.
In some embodiments the mixing of the audio signal to the separated source comprises mixing the dynamic equalized audio signal with the latency compensated separated source.
The embodiments also disclose a method comprising: performing source separation based on a received audio input to obtain a separated source; performing onset detection on the separated source to obtain an onset detection signal; and mixing the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
According to a further aspect, the disclosure provides a computer program comprising instructions, the instructions when executed on a processor causing the processor to perform source separation based on a received audio input to obtain a separated source, to perform onset detection on the separated source to obtain an onset detection signal and to mix the audio signal with the separated source based on the onset detection signal to obtain an enhanced separated source.
Embodiments are now described by reference to the drawings.
First, source separation (also called “demixing”) is performed which decomposes a source audio signal 1 comprising multiple channels I and audio from multiple audio sources Source 1, Source 2, . . . Source K (e.g. instruments, voice, etc.) into “separations”, here into source estimates 2a-2d for each channel i, wherein K is an integer number and denotes the number of audio sources. In the embodiment here, the source audio signal 1 is a stereo signal having two channels i=1 and i=2. As the separation of the audio source signal may be imperfect, for example, due to the mixing of the audio sources, a residual signal 3 (r(n)) is generated in addition to the separated audio source signals 2a-2d. The residual signal may for example represent a difference between the input audio content and the sum of all separated audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its respective recorded sound waves. For input audio content having more than one audio channel, such as stereo or surround sound input audio content, also a spatial information for the audio sources is typically included or represented by the input audio content, e.g. by the proportion of the audio source signal included in the different audio channels. The separation of the input audio content 1 into separated audio source signals 2a-2d and a residual 3 is performed on the basis of blind source separation or other techniques which are able to separate audio sources.
In a second step, the separations 2a-2d and the possible residual 3 are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. On the basis of the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal on the basis of spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in
In the following, the number of audio channels of the input audio content is referred to as Min and the number of audio channels of the output audio content is referred to as Mout As the input audio content 1 in the example of
The separated source obtained during source separation 201, here the drums separation, is also transmitted to the latency compensation 203. At the latency compensation 203, the drums separation is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated drums separation. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the drums separation. Simultaneously with the source separation 201, the audio input is transmitted to the latency compensation 205. At the latency compensation 205, the audio input is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated audio signal. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the audio input.
The gain generator 204 is configured to generate a gain gDNN to be applied to the latency compensated separated source and a gain gOriginal to be applied on the latency compensated audio signal based on the onset detection signal. The function of the gain generator 204 will be described in more detail in
The present invention is not limited to this example. The source separation 201 could output also other separated sources, e.g. vocals separation, bass separation, other separation, or the like. Although in
In the middle part of
In the lower part of
Based on these gains gDNN and gOriginal the amplifiers and the mixer (206, 207, and 208 in
The length of the attack phase t0 to t1, the sustain phase t1 to t2, and the release phase t2 to t3 is set by the skilled person as a predefined parameter according to the specific requirements of the instrument at issue.
The separated source obtained during source separation 201, here the drums separation, is also transmitted to the latency compensation 203. At the latency compensation 203, the drums separation is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated drums separation. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the drums separation. The latency compensated drums separation obtained during latency compensation 203 is transmitted to the envelope enhancement 209. At the envelope enhancement 209, the latency enhanced separated source, here the drums separation is further enhanced based on the onset detection signal, obtained from the onset detection 202, to generate an envelope enhanced separated source, here drums separation. The envelope enhancement 209 further enhances the attack of e.g. the drums separation and further enhance the energy of the onset by applying envelope enhancement to the drums output (original DNN output). This envelope enhancement 209 can for example be any kind of gain envelope generator with attack, sustain and release parameters as known from the state of the art.
Simultaneously with the source separation 201, the audio input is transmitted to the latency compensation 205. At the latency compensation 205, the audio input is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated audio signal. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the audio input.
The gain generator 204 is configured to generate a gain gDNN to be applied to the onset enhanced separated source and a gain gOriginal to be applied on the latency compensated audio signal based on the onset detection signal. The function of the gain generator 204 described in more detail in
The present invention is not limited to this example. The source separation 201 could output also other separated sources, e.g. vocals separation, bass separation, other separation, or the like. Although in
The separated source obtained during source separation 201, here the drums separation, is also transmitted to the latency compensation 203. At the latency compensation 203, the drums separation is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated drums separation. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the drums separation. Simultaneously with the source separation 201, the audio input is transmitted to the latency compensation 205. At the latency compensation 205, the audio input is delayed by the expected latency Δt of the onset detection signal to generate a latency compensated audio signal. This has the effect that the latency Δt of the onset detection signal is compensated by a respective delay of the audio input. The latency compensated audio signal is transmitted to the averaging 210. At the averaging 210, the latency compensated audio signal is analyzed to produce an averaging parameter. The averaging 210 is configured to perform averaging on the latency compensated audio signal to obtain the averaging parameter. The averaging parameter is obtained by averaging several beats of the latency compensated audio signal to get a more stable frequency spectrum of the latency compensation 205 (mix buffer). The process of the averaging 210 will be described in more detail in
The latency compensated audio signal, obtained during latency compensation 205, is also transmitted to the dynamic equalization 211. At the dynamic equalization 211, the latency compensated audio signal is dynamic equalized based on the averaging parameter, calculated during averaging 210, to obtained dynamic equalized audio signal.
The gain generator 204 is configured to generate a gain gDNN to be applied to the latency compensated separated source and a gain gOriginal to be applied on the dynamic equalized audio signal based on the onset detection signal. The function of the gain generator 204 is described in more detail in
The present invention is not limited to this example. The source separation 201 could output also other separated sources, e.g. vocals separation, bass separation, other separation, or the like. Although in
The averaging 210 (see
Based on the rhythm analysis result, the dynamic equalization (211 in
The electronic system 1200 further comprises a data storage 1202 and a data memory 1203 (here a RAM). The data memory 1203 is arranged to temporarily store or cache data or computer instructions for processing by the processor 1201. The data storage 1202 is arranged as a long term storage, e.g. for recording sensor data obtained from the microphone array 1210 and provided to or retrieved from the CNN unit 1220. The data storage 1202 may also store audio data that represents audio messages, which the public announcement system may transport to people moving in the predefined space.
It should be noted that the description above is only an example configuration. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces, or the like.
It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding.
It should also be noted that the division of the electronic system of
All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example, on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.
In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below.
Number | Date | Country | Kind |
---|---|---|---|
19153334 | Jan 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/051618 | 1/23/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/152264 | 7/30/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20120294459 | Chapman et al. | Nov 2012 | A1 |
20140297012 | Kobayashi | Oct 2014 | A1 |
20160329061 | Heber et al. | Nov 2016 | A1 |
20180047372 | Scallie et al. | Feb 2018 | A1 |
20180088899 | Gillespie et al. | Mar 2018 | A1 |
20180176706 | Cardinaux | Jun 2018 | A1 |
Number | Date | Country |
---|---|---|
2015150066 | Oct 2015 | WO |
Entry |
---|
International Search Report and Written Opinion dated Feb. 21, 2020, received for PCT Application PCT/EP2020/051618, Filed on Jan. 23, 2020, 10 pages. |
Gillet et al., “Extraction and Remixing of Drum Tracks From Polyphonic Music Signals”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 16-19, 2005, pp. 315-318. |
Dittmar, “Source Separation and Restoration of Drum Sounds in Music Recordings”, Jun. 14, 2018, pp. 1-181. |
Number | Date | Country | |
---|---|---|---|
20220076687 A1 | Mar 2022 | US |