The present disclosure generally pertains to the field of audio processing, e.g. in music and broadcast production, distribution and transmission.
A digital audio workstation (DAW) is an electronic device or software application for recording, editing and producing audio files such as musical pieces, speech or sound effects. DAWs typically provide a user interface that allows the user to record, edit and mix multiple recordings and tracks into a final produced piece.
Music production involves the processes of recording, mixing and mastering. A computer-based DAW typically allows for multitrack recording of audio and provides controls for playing, recording and editing audio tracks.
Modern computer-based DAW support software plug-ins, each having its own functionality, which can expand the sound processing capabilities of the DAW. There exist for example software plug-ins for equalization, limiting, and compression. There also exist software plug-ins which provide audio effects such as reverb and echo. And there exist software plug-ins which provide sound sources to a DAW such as virtual instruments and samplers.
Digital audio processing may involve loudness evaluation, in particular short-term loudness evaluation (=envelope evaluation). The European Broadcasting Union (EBU), see reference [EBU 2011], has studied the needs of audio signal levels in production, distribution and transmission of broadcast programs.
There is a general need for providing better computer-implemented aid to a user in the process of recording, mixing and monitoring.
According to a first aspect the disclosure provides a method comprising determining an envelope of an audio file based on a double-windowing analysis of the audio file.
According to a further aspect the disclosure provides an electronic device comprising circuitry configured to determine an envelope of an audio file based on a double-windowing analysis of the audio file.
According to a further aspect the disclosure provides a computer program comprising instructions, which when executed on a processor cause the processor to determine an envelope of an audio file based on a double-windowing analysis of the audio file.
Further aspects are set forth in the dependent claims, the following description and the drawings.
Embodiments are explained by way of example with respect to the accompanying drawings, in which:
The following embodiments relate to a level and/or loudness evaluation framework, in particular to finding windowed (momentary or short-term) level and/or loudness values from an audio file.
The embodiments disclose a method comprising determining an envelope of an audio file based on a double-windowing analysis of the audio file.
A double-windowing analysis may comprise windowing the audio file to obtain a sequence of windows containing audio, and windowing each window of the sequence of windows to obtain, for each window, a respective sequence of sub-windows.
Windowing each window of the sequence of windows into sub-windows may result into a loudness curve, each value of the loudness curve being obtained from a respective window.
Windowing each window of the sequence of windows into sub-windows may result into a level curve, each value of the level curve being obtained from a respective window.
The methods as described above may for example be integrated into a windowed loudness evaluation of a file.
The methods as described above may for example be integrated into an envelope follower.
The methods as described above may for example be applied in an automatic audio mixing framework.
The methods may be computer-implemented methods. For example, the methods may be implemented as a software application, a digital audio workstation (DAW) software application, or the like. The methods may also be implemented as a software plug-in, e.g. for use in a digital audio workstation software.
The methods may for example be implemented in an electronic device comprising circuitry configured to perform the methods described above and below in more detail. The electronic device may for example be a computer, a desktop computer, a workstation, a digital audio workstation (DAW), or the like. The electronic device may also be a laptop, a tablet computer, a smartphone or the like. Circuitry of the electronic device may include one or more processors, one or more microprocessors, dedicated circuits, logic circuits, a memory (RAM, ROM, or the like), a storage, output means (display, e.g. liquid crystal, (organic) light emitting diode, etc.), loud speaker, an interface (e.g. touch screen, a wireless interface such as Bluetooth, infrared, audio interface, etc.), etc.
The European Broadcasting Union (EBU) provides specifications for the windowed loudness of audio content [EBU 2011]. The measure of the windowed loudness, or envelope, includes the windowing of psychoacoustically weighted audio, followed by the evaluation of the root-mean-square (RMS) power of the audio in each window.
The process of evaluating the RMS power of windowed audio can also be performed on unweighted audio, in which case the evaluation is the evaluation of windowed power instead of windowed loudness.
In cases where there exist transitions between low- and high-level audio inside windows, the windowing and evaluation of the RMS power of the audio in each window leads to errors in the estimation of both loudness and power.
Using shorter windows will result in the attenuation of the aforementioned artifacts. However, [EBU 2011] specifies window lengths. Particular measures such as “momentary loudness” and “short-term loudness” are performed with fixed window lengths.
The process of determining the envelope of input audio using double-windowing as described below in more detail uses the following method to attenuate the artifacts while not changing the window length.
In both
The signal loudness is measured using a standard window length. Each window's content is itself windowed into sub-windows. In a first and second possible implementation, the low-level sub-windows are discarded from the evaluation of the envelope. In a third possible implementation, the influence of the low-level sub-windows over the evaluation of loudness is minimized using a weighted mean.
At 500, an input audio 50 is windowed, resulting into a sequence of windows 51 containing audio.
Let {right arrow over (A)} be the input audio.
Let the nth window be written {right arrow over (W)}/(n).
Let Nwindow be the length of each window. Let hwindow be the hop size, with hwindow<Nwindow.
Typical values are Nwindow=0.1׃s samples, and hwindow=0.05׃s samples.
The nth window {right arrow over (W)}(n) contains the audio samples {right arrow over (A)}[1+((n−1)×hwindow)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)].
At 501, each weighted window (51) is itself windowed, resulting into a sequence of windows containing sub-windows, the sub-windows containing audio.
Let ω(n, ι) be the ιth subwindow of the nth window.
Let Nsub be the length of each sub-window. Let hsub be the hop size, with hsub<Nsub.
Typical values are Nsub=Nwindow/16 and hsub is 0.5×νNsub.
The ιth sub-window ω(n, ι) contains the values {right arrow over (A)}[1+((n−1)×hwindow)+1+((ι−1)×hsub)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)+Nsub+((ι−1)×hsub)]].
At 502, the RMS power of the content of each sub-window is evaluated.
At 503, the sub-windows for which the RMS power is inferior to a manually set threshold 53 are discarded, resulting into a sequence 54 of windows containing a subset of the sub-windows in 52. The threshold 53 may be the loudness of the background noise in the signal (see 75 in
At 504, the sub-windows from each window are concatenated into audio windows 56 that contain only audio from sub-windows whose RMS power is greater than the threshold 53.
Let χ[n, ι] be the RMS power of the sub-window ω(n, ι).
A window {right arrow over (W)}partial(n) is defined as the concatenation of the sub-windows ω(n, ι) for which the RMS power χ[n, ι] is greater than a threshold T.
At 505, the RMS power over the audio in each window 56 is evaluated, resulting into the envelope 59.
Each element {right arrow over (L)}[n] of the envelope {right arrow over (L)} is defined as the RMS power of each {right arrow over (W)}partial(n).
At 500, the input audio 50 is windowed, resulting into a sequence of windows 51 containing audio.
Let {right arrow over (A)} be the input audio.
Let the nth window be written {right arrow over (W)}(n).
Let Nwindow be the length of each window. Let hwindow be the hop size, with hwindow<Nwindow.
Typical values are Nwindow=0.1׃s samples, and hwindow=0.05׃s samples.
The nth window {right arrow over (W)}(n) contains the audio samples {right arrow over (A)}[1+((n−1)×hwindow)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)].
At 501, each weighted window 51 is itself windowed, resulting into a sequence of windows containing sub-windows, the sub-windows containing audio.
Let ω(n, ι) be the ιth subwindow of the nth window.
Let Nsub be the length of each sub-window. Let hsub be the hop size, with hsub<Nsub.
Typical values are Nsub=Nwindow/16 and hsub is 0.5×νNsub.
The ιth sub-window ω(n, ι) contains the values {right arrow over (A)}[1+((n−1)×hwindow)+1+((ι−1)×hsub)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)+Nsub+((ι−1)×hsub)]].
At 502, the RMS power of the content of each sub-window is evaluated.
At 503, the sub-windows for which the RMS power is inferior to a manually set threshold 53 are discarded, resulting into a sequence 54 of windows containing a subset of the sub-windows in 52. The threshold may be the loudness of the background noise in the signal.
At 506, the RMS power values are expressed on a linear scale, and for each window, the mean of the RMS power values of the remaining sub-windows are evaluated.
Let χ[n, ι] be the RMS power of the sub-window ω(n, ι), expressed on a linear scale.
The envelope {right arrow over (L)}[n] is evaluated as {right arrow over (L)}[n]=
At 507, the envelope is expressed in the logarithmic domain.
At 509, {right arrow over (L)} is expressed in the logarithmic domain, with {right arrow over (L)} being set to 20×log10({right arrow over (L)}).
The implementation according to
At 500, the input audio 50 is windowed, resulting into a sequence of windows 51 containing audio.
Let {right arrow over (A)} be the input audio.
Let the nth window be written {right arrow over (W)}(n).
Let Nwindow be the length of each window. Let hwindow be the hop size, with hwindow<Nwindow.
Typical values are Nwindow=0.1׃s samples, and hwindow=0.05׃s samples.
The nth window {right arrow over (W)}(n) contains the audio samples {right arrow over (A)}[1+((n−1)×hwindow)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)].
At 501, each weighted window 51 is itself windowed, resulting into a sequence of windows containing sub-windows, the sub-windows containing audio.
Let ω(n, ι) be the ιth sub-window of the nth window.
Let Nsub be the length of each sub-window. Let hsub be the hop size, with hsub<Nsub.
Typical values are Nsub=Nwindow/16 and hsub is 0.5×νNsub.
The ιth sub-window ω(n, ι) contains the values {right arrow over (A)}[1+((n−1)×hwindow)+1+((ι−1)×hsub)] to {right arrow over (A)}[Nwindow+((n−1)×hwindow)+Nsub+((ι−1)×hsub)]].
At 502, the RMS power of the content of each sub-window is evaluated.
At 508, each value of the envelope is evaluated as the weighted mean of the RMS values of the sub-windows in the window {right arrow over (A)}(n), with the RMS values being themselves the coefficients.
Let χ[n, ι] be the RMS power of the sub-window ω(n, ι), expressed on a linear scale.
For each n, {right arrow over (L)}[n] is set to
At 509, {right arrow over (L)} is expressed in the logarithmic domain, with {right arrow over (L)} being set to 20×log10({right arrow over (L)}).
Exemplifying standard window lengths range from 2{circumflex over ( )}14 to 2{circumflex over ( )}17 samples in 44 kHz, divided in 16 sub-windows.
An envelope follower is an algorithm that conforms the envelope of an audio file, the source envelope, to a target envelope, possibly of a target audio file, resulting into a new audio file.
At 700, the envelope (short-term level) is extracted from a source audio file 70, resulting into the envelope of the source 72.
At 701, the envelope (short-term level) is extracted from a source audio file 71, resulting into the envelope of the source 73.
Both envelopes are expressed in a logarithmic scale.
At 702, the source envelope 72 is subtracted from the target envelope 73, resulting into the gains 74 that are to be applied to the source 70 so that its envelope conforms to the target envelope 73.
In practice, as it would result in background noise being as loud as the signal, the gains shouldn't be applied when the source contains background noise only. At 707, the background noise level 75 is compared to the source envelope. At 704, the gains are applied if the source envelope 72 is greater than the background noise level 75.
The correctness of the evaluated envelopes has consequences on the performance of the envelope follower.
According to the example of
Then, at 802, the sum of track A′ and track B is determined in a conventional way to obtain output track O.
If a project contains multiple tracks, then the above described process can be iteratively applied to some or all of the audio tracks of the project so as to balance the level of the tracks in an automated way. If for example a project contains three tracks, track A, track B and track C, then in a first step the loudness of track A can be adapted to the loudness of track B to obtain a modified version of track A, denoted track A′, and track A′ and track B can be summed in a conventional way to obtain a track O as described with regard to
In all three implementations, the output signal loudness is a curve whose abscissa is the series of anchor time values, and whose ordinate is the series of measured loudness.
Steps 500, 501, 502, and 503 of this embodiment are identical to the embodiment of
At 504, the sub-windows from each window are concatenated into audio windows 56 that contain only audio from sub-windows whose RMS power is greater than the threshold 53.
At 505, the RMS power 58 (loudness) over the audio in each window 56, noted {right arrow over (L)}[n], is evaluated. Each element {right arrow over (L)}[n] of the envelope {right arrow over (L)} is defined as the RMS power of each {right arrow over (W)}partial(n) .
At 506, an anchor time 57 is evaluated for each nth window 54. The anchor time 57 is the mean position of the remaining sub-windows inside the respective window. It is noted {right arrow over (t)}(n). This anchor time 57 is evaluated as follows. First, an anchor time τ(n, ι) for each sub-window ω(n, ι) is defined as being the middle position of the samples in each ω(n, ι). The anchor time {right arrow over (t)}(n) is defined as the mean of the anchor times of the sub-windows remaining in {right arrow over (W)}partial(n). If all sub-windows are discarded, then the anchor time {right arrow over (t)}(n) is defined as the middle position of the window {right arrow over (W)}(n).
At 510 and 512, the output envelope is defined as the loudness sequence, i.e. values {right arrow over (L)}[n], set at respective times {right arrow over (t)}(n). The loudness values {right arrow over (L)}[n] (58) constitute the abscissa of the output envelope 59, and the anchor times {right arrow over (t)}(n) (57) constitute the ordinate of the output envelope (59).
Steps 500, 501, 502, and 503 of this embodiment are identical to the embodiment of
At 506, the RMS power values are expressed on a linear scale, and for each window, the mean of the RMS power values of the remaining sub-windows are evaluated.
Let χ[n, ι] be the RMS power of the sub-window ω(n, ι), expressed on a linear scale. The ordinate for the envelope, denoted as {right arrow over (L)}[n], is evaluated for each n as {right arrow over (L)}[n]=
At 507, the loudness sequence {right arrow over (L)}[n] (58) is expressed in the logarithmic domain, with {right arrow over (L)}[n] being set to 20×log10({right arrow over (L)}[n]).
At 511, an anchor time (57) for each nth window (51) is evaluated in a similar way as in the embodiment of
At 510 and 512, the output envelope is defined as the values {right arrow over (L)}[n] set at times {right arrow over (t)}(n). {right arrow over (L)}[n] (58) is the abscissa of the output envelope 59, {right arrow over (t)}(n) (57) is the ordinate of the output envelope 59.
At 508, each value of the envelope is evaluated as the weighted mean of the RMS values of the sub-windows in the window {right arrow over (A)}(n), with the RMS values being themselves the coefficients.
Let χ[n, ι] be the RMS power of the sub-window ω(n, ι), expressed on a linear scale.
For each n, {right arrow over (L)}[n] is set to
At 509, {right arrow over (L)}[n] (58) is expressed in the logarithmic domain, with {right arrow over (L)}[n] being set to 20×log10({right arrow over (L)}[n]).
At 511, an anchor time 57 for each nth window (51) is evaluated as a weighting position of the sub-windows. As in the embodiments of
where “o” is the term-by-term or Hadamard product. If the RMS of all sub-windows is zero, i.e. if for a given n, Σιχ[n, ι]=0, then the anchor time {right arrow over (t)}(n) is defined as the middle position of the window {right arrow over (W)}(n).
At 510 and 512, the output envelope is defined as the values {right arrow over (L)}[n] set at times {right arrow over (t)}(n). The loudness values {right arrow over (L)}[n] (58) constitute the abscissa of the output envelope 59, the anchor times {right arrow over (t)}(n) (57) constitute the ordinate of the output envelope 59.
In the following, an embodiment of an electronic device 130 is described under reference of
Embodiments which use software, firmware, programs, plugins or the like for performing the processes as described herein can be installed on computer 930, which is then configured to be suitable for the embodiment.
The computer 930 has a CPU 931 (Central Processing Unit), which can execute various types of procedures and methods as described herein, for example, in accordance with programs stored in a read-only memory (ROM) 932, stored in a storage 937 and loaded into a random access memory (RAM) 933, stored on a medium 940, which can be inserted in a respective drive 939, etc.
The CPU 931, the ROM 932 and the RAM 933 are connected with a bus 941, which in turn is connected to an input/output interface 934. The number of CPUs, memories and storages is only exemplary, and the skilled person will appreciate that the computer 930 can be adapted and configured accordingly for meeting specific requirements which arise when it functions as a base station, and user equipment.
At the input/output interface 934, several components are connected: an input 935, an output 936, the storage 937, a communication interface 938 and the drive 939, into which a medium 940 (compact disc, digital video disc, compact flash memory, or the like) can be inserted.
The input 935 can be a pointer device (mouse, graphic table, or the like), a keyboard, a microphone, a camera, a touchscreen, etc.
The output 936 can have a display (liquid crystal display, cathode ray tube display, light emittance diode display, etc.), loudspeakers, etc.
The storage 937 can have a hard disk, a solid state drive and the like.
The communication interface 938 can be adapted to communicate, for example, via a local area network (LAN), wireless local area network (WLAN), mobile telecommunications system (GSM, UMTS, LTE, etc.), Bluetooth, infrared, etc.
It should be noted that the description above only pertains to an example configuration of computer 930. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces or the like. For example, the communication interface 938 may support other radio access technologies than the mentioned WLAN, GSM, UMTS and LTE.
The methods as described herein are also implemented in some embodiments as a computer program causing a computer and/or a processor and/or a circuitry to perform the method, when being carried out on the computer and/or processor and/or circuitry. In some embodiments, also a non-transitory computer-readable recording medium is provided that stores therein a computer program product, which, when executed by a processor/circuitry, such as the processor/circuitry described above, causes the methods described herein to be performed.
It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding.
It should also be noted that the division of the control or circuitry of
All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.
In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below:
(1) A method comprising determining an envelope (59, 63, 64) of an audio file (50) based on a double-windowing analysis (603) of the audio file.
(2) The method of (1), wherein the double-windowing analysis (603) comprises windowing (500) the source audio file to obtain a sequence of windows (51) containing audio, and windowing (501) each window of the sequence of windows (51) to obtain, for each window (51), a respective sequence of sub-windows (52).
(3) The method of (2), wherein determining the envelope (59, 63, 64) from the sequence of windows (51) comprises discarding sub-windows (52) whose loudness is below a threshold (53) and evaluating the loudness of each window (51) over the remaining audio.
(4) The method of (2), wherein determining the envelope (56) from the sequence of windows (51) comprises determining, for each window, the weighted mean of the loudness of the sub-windows (52) in each window (51), where the coefficients are the loudness values of the sub-windows (52).
(5) The method of anyone of (1) to (4), further comprising determining a loudness curve (59) from the source audio file (31).
(6) The method of anyone of (1) to (5), wherein the method is applied in an envelope evaluation framework.
(7) The method of anyone of (1) to (6), wherein the method is applied in an envelope follower framework.
(8) The method of anyone of (1) to (7), wherein the method is applied in an automatic audio mixing framework.
(9) The method of anyone of (1) to (7), further comprising determining anchor times (57) for each window, and determine an output envelope (91) as a sequence of loudness values (58) set at respective anchor times (57).
(10) The method of (9) in which an anchor time (57) is evaluated as the mean position of sub-windows.
(11) The method of (9) in which an anchor time (57) is evaluated as a weighted mean position of sub-windows, with the weights being the loudness of the sub-windows, expressed on a linear scale.
(12) An electronic device comprising circuitry configured to determine an envelope (59, 63, 64) of an audio file (50) based on a double-windowing analysis (603) of the audio file.
(13) A computer program comprising instructions, which when executed on a processor cause the processor to determine an envelope of an audio file based on a double-windowing analysis of the audio file.
(14) A computer-readable medium storing instructions, which when executed on a processor cause the processor to determine an envelope of an audio file based on a double-windowing analysis of the audio file.
(15) An electronic device comprising circuitry configured to perform the method of anyone of (1) to (11).
(16) A computer program comprising instructions, which when executed on a processor cause the processor to perform the method of anyone of (1) to (11).
(17) A computer-readable medium storing instructions, which when executed on a processor cause the processor to perform the method of anyone of (1) to (11).
[EBU 2011] EBU-TECH 3341, “Loudness metering: ‘EBU mode’ metering to supplement loudness normalisation in accordance with EBU R 128.”, EBU/UER, August 2011.
Number | Date | Country | Kind |
---|---|---|---|
17195346.6 | Oct 2017 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2018/077228 | 10/5/2018 | WO | 00 |