The present invention relates to a method and device for introducing human interactions in audio sequences.
Post-processing has become an integral part of professional music production. A song, e.g. a pop or rock song or a film score is typically assembled from a multitude of different audio tracks representing musical instruments, vocals or a software instruments. In audio engineering, tracks are often combined where musicians have not actually played together. This may eventually be recognized by a listener.
It is therefore an object of the present invention to provide a method and a device for combining audio tracks, where the result sounds like a simultaneous recording of the individual tracks, even if they were recorded separately.
This object is achieved by a method and a device according to the independent claims. Advantageous embodiments are defined in the dependent claims.
According to the invention, determining these characteristics of scale-free (fractal) musical coupling in human play can be used to imitate the generic interaction between two musicians in arbitrary audio tracks, comprising, in particular, electronically generated rhythms.
More particularly, the interbeat intervals exhibit long-range correlations (LRC) when one or more audio tracks are modified and the interbeat intervals exhibit long-range cross-correlations (LRCC) when two or more audio tracks are modified.
A time series contains LRC if its power spectral density (PSD) asymptotically decays in a power law, p(f)˜1/fβ for small frequencies f and 0<β<2. The limits β=0 (β=2) indicate white noise (Brownian motion) while −2<β<0 indicates anti-correlations. In the literature, different normalizations for the power spectral frequency f can be found, which can be converted into one another. Here, f is measured in units of the Nyquist frequency (fNyquist=½ Hz), which is half the sampling rate of the time series.
Long-Range Cross-Correlations (LRCC) between two sequences of interbeat intervals, i.e. two non-stationary time series, exist if the covariance FDCCA (s) defined below asymptotically follows a power law F(s)˜sδ with 0.5<δ<1.5. In contrast, δ=0.5 indicates absence of LRCC.
The presence of such cross-correlations may be measured using a variant of detrended cross-correlation analysis (DCCA) [Podobnik B, Stanley H (2008), Detrended Cross-Correlation Analysis: A New Method for Analyzing Two Nonstationary Time Series. Phys. Rev. Lett. 100:084102]. Global detrending with a polynomial of degree k may be added as an initial step prior to DCCA, which has been shown crucial in analyzing slowly varying non-stationary signals [Podobnik B, et al. (2009), Quantifying cross-correlations using local and global detrending approaches. Eur. Phys. J. B 71:243-250.]. In fact, global detrending proved to be a crucial step to calculate the DCCA exponent of the non-stationary time series of interbeat intervals analyzed by the inventors. Without global detrending much larger DCCA exponents are obtained, i.e., spurious LRCC are detected that reflect global trends.
Given two time series Xn, Xn′, where n=1 . . . N, the DCCA method including prior global detrending thus consists of the following steps:
(1) Global detrending: fitting a polynomial of degree k to Xn and a polynomial to Xn′, where typically k=1 . . . 5. One may use k=3. It should carefully be checked that the obtained DCCA scaling exponents do not change significantly with k.
(2) Integrating the time series Rn=Σi=1n Xn and Rn′=Σi=1n Xn′.
(3) Dividing the series into windows of size s, (3) Least-squares fit {tilde over (R)}n and {tilde over (R)}n′ for both time series in each window.
(4) Calculating the detrended covariance
where Ns is the number of windows of size s.
For fractal scaling, FDCCA (s) α sδ with 0.5<δ<1.5. Absence of LRCC are indicated by δ=0.5. Another indicator of absence of LRCC is that the detrended covariance FDCCA (s) changes signs and fluctuates around zero as a function of the time scale s [Podobnik B, et al. (2009), Quantifying cross-correlations using local and global detrending approaches, Eur. Phys. J. B 71:243-250].
The invention may be embodied in a computer-implemented method or a device for combining a first and a second audio track, in a software plugin product, e.g. for a digital audio workstation (DAW) that, when executed, implements a method according to the invention, in an audio signal, comprising one or more audio tracks obtained by a method according to the invention and/or in a medium storing an audio signal according to the invention.
These and other aspects and advantages of the present invention are described more thoroughly in the following detailed description of embodiments of the invention and with reference to the drawing in which
The procedure to introduce human-like musical coupling in two audio tracks A and B is demonstrated using an instrumental version of the song ‘Billie Jean’ by Michael Jackson. The song Billie Jean was chosen because drum and bass tracks consist of a simple rhythmic and melodic pattern that is repeated continuously throughout the entire song. This leads to a steady beat in drum and bass, which is well suited to demonstrate their generic mutual interaction. For simplicity, all instruments were merged into two tracks: track A includes all drum and keyboard sounds, while track B includes the bass.
In step 110, the interbeat intervals of the first and the second audio track are determined. The interbeat intervals of tracks A and B read IA,t=Xt+T and IB,t=Yt+T, where T is the average interbeat interval given by the tempo (here, T=256 ms, which corresponds to 234 beats per minute in the eighth notes). In case the audio tracks are MIDI files, this may be done based on the ‘note on’ messages. In other case, known suitable beat detection procedures may be used.
If the time series Xt and Yt are long-range cross-correlated, a musical coupling between drum and bass tracks is obtained.
In step 120, the interbeat intervals of at least one of the first audio track A and the second audio track B are modified. Small deviations are added to the interbeat intervals in order to modify a long-range cross-correlation (LRCC) between the interbeat intervals of the first and the second audio track. More particularly, the interbeat intervals are modified in order to induce LRCC between the interbeat intervals of the two audio tracks with a power law exponent, also called DCCA exponent δ, which measures the strength of the LRCC. For δ=0.5, there are no LRCC, while the strength of the LRCC increases with δ.
More than two audio tracks can be modified by having each additional track responding to the average of all other tracks' deviations.
In particular, musical coupling between Xt and Yt is introduced using a two-component Autoregressive Fractionally Integrated Moving Average (ARFIMA) process with δ=0.9, (2), that generates two time series x1,2 which exhibit LRCC [Podobnik B, Stanley H (2008), Detrended Cross-Correlation Analysis: A New Method for Analyzing Two Nonstationary Time Series. Phys. Rev. Lett. 100:084102; Podobnik B, Wang D, Horvatić D, Grosse I, Stanley H E (2010), Time-lag cross-correlations in collective phenomena, Europhys. Lett. 90:68001].
The process is defined by
with Hurst exponents 0.5<αA,B<1, weights wn(d)=d Γ(n−d)/(Γ(1−d) Γ(n+1)), Gaussian white noise ξt,A and ξt,B and gamma function Γ. The coupling constant W ranges from 0.5 (maximum coupling between xt and yt) to 1 (no coupling). It has been shown analytically, that the cross-correlation exponent is given by δ=(αA+αB)/2.
The standard deviation chosen for Xt and Yt was 10 ms. The time series of deviations Xt and Yt for musical coupling are shown in
Introducing LRC in audio tracks is referred to as “humanizing”. For separately humanized sequences (i.e., without adding cross-correlations between the sequences), however, absence of LRCC is expectable. Indeed, when humanizing the time series of interbeat intervals separately (e.g., with an exponent β=0.9), the detrended covariance of Xt and Yt oscillates around zero, i.e., no LRCC are found.
All other characteristics, such as pitch, timbre and loudness remain unchanged.
In step 130, the combined audio tracks are stored in a non-volatile, computer-readable medium.
Other processes than the ARFIMA process that generate LRCC can also be used to induce musical coupling. More particularly, when two subjects A and B are synchronizing a rhythm, each person attempts to (partly) compensate for the deviations dn=tA,n=tB,n perceived between the two n'th beats when generating the n+1'th beat. This is reflected by the following model referred to as the Mutually Interacting Complex Systems (MICS) model
I
A,n=σACA,n+T+ξA,n−ξA,n-1−WAdn-1
I
B,n=σBCB,n+T+ξB,n−ξB,n-1+WBdn-1 (1)
where CA,n and CB,n are Gaussian distributed 1/fβ noise time series with exponents 0<βA,B<2, ξA,n and ξB,n is Gaussian white noise and T is the mean beat interval. We set d0=0. The model assumes that the generation of temporal intervals is composed of three parts: (i) an internal clock with 1/fβ noise errors, (ii) a motor program with white noise errors associated with moving a finger or limb, referred to in
The deviations dn which the musicians perceive and adapt to can be written as a sum over all previous interbeat intervals
thus involving all previous elements of the time series of IBIs of both musicians. Therefore, this model reflects that scale-free coupling of the two subjects emerges mainly through the adaptation to deviations between their beats.
The coupling strengths o<WA,B<2 describe the rate of compensation of a deviation in the generation of the next beat. In the limit WA=WB=0 and βA=βB=1 the second model reduces to the model introduced by Gilden et al., in the following called the Gilden model [Gilden D L, Thornton T, Mallon M W (1995), 1/f noise in human cognition, Science 267:1837-1839]. The MICS model diverges for WA+WB≧2, i.e., when subjects are over-compensating.
A possible extension of the second model is to consider variable coupling strengths W=W(dn). Since larger deviations are likely to be perceived more distinctly, one possible scenario is to introduce couplings W that increase with dn. For example, W may increase when large deviations such as glitches are perceived.
The experimental setup comprises a keyboard 310 connected to speakers 320 and a recorder 330 for recording notes played by test subjects 1 and 2 on the keyboard 310. Preferably, the keyboard 310 has a midi interface and the recording device 330 records midi messages.
The performances were recorded at the Harvard University Studio for Electroacoustic Composition (See Supporting Information for details) on a Studiologic SL 88o keyboard yielding 57 time series of Musical Instrument Digital Interface (MIDI) recordings. However, the results presented here apply not only to MIDI but also to acoustic recordings.
Each recording typically lasted 6-8 minutes and contained approx. 1000 beats per subject. The temporal occurrences t1, . . . , tn of the beats were extracted from the MIDI recordings and the interbeat intervals read In=t1 . . . tn-1 with t0=0. The subjects were asked to press a key with their index finger according to the following. Task type (Ia): Two subjects played beats in synchrony with one finger each. (Ib) ‘Sequential recordings’ were made, where subject B synchronized with prior recorded beats of subject A. Sequential recordings are widely used in professional studio recordings, where typically the drummer is recorded first, followed by layers of other instruments. Task type (II): One subject played beats in synchrony with one finger from each hand. Task type (III): One subject played beats with one finger (‘finger tapping’). Finger tapping of single subjects is well-studied in literature [Repp B H, Su Y H (2013), Sensorimotor synchronization: A review of recent research, (2006-2012). Psychon B Rev 20:403-452.] and serves as a baseline, whereas our focus is on synchronization between subjects. In addition to periodic tapping, a 4/4 rhythm {1, 2.5, 3, 4}, where the second beat is replaced by an offbeat, was used in tasks (I-III).
A comparison of the MICS model (
The MICS model predicts emergence of LRCC (
Evidence for LRCC between IA and IB on time scales up to the total recording time is reported in
In contrast, when a single subject is synchronizing his left and right hands (tasks (II)), no significant LRCC were observed, suggesting that the interaction of two complex systems is a necessary prerequisite for rhythmic binding.
The inventor identified two distinct regions in the PSD of the interbeat intervals separated by a vertex of the curve at a characteristic frequency fc≈0.1 fNyquist (see
e) shows that the MICS model reproduces both regions and fc for interacting complex systems. The two subjects potentially perceive the deviations dn=tA,n−tB,n between their beats. The DFA exponent α=0.72 for the time series dn indicates long-range correlations in the deviations (averaging over the entire data set one finds
In the present data set, exponents where found to be in a broad range 0.5<λ<1.5, hence the analysis suggests to couple audio tracks using LRCC with a power law exponent 0.5<λ<1.5. However, even larger exponents λ>1.5 are found when no global detrending of the interbeat intervals is used or in cases when the nonstationarity of the time series is not easily removed by global detrending.
There is a fundamental difference between settings where individuals are provided with a metronome click (e.g., over headphones) while playing and where no metronome is present (also referred to as self-paced play) that manifests in the PSD of the interbeat intervals.
For self-paced play of musical rhythms, the PSD of the interbeat intervals exhibits two distinct regions [Hennig H, et al. (2011), The Nature and Perception of Fluctuations in Human Musical Rhythms, PLoS ONE 6:e26457]. Long-range correlations are found asymptotically for small frequencies in the PSD. This region relates to correlations over long time scales of up to several minutes (as long as the subject does not frequently lose rhythm). On the other hand, for high frequencies in the PSD anti-correlations are found.
In contrast, a different situation is observed in presence of a metronome: For play of both complex musical rhythms [Hennig H, Fleischmann R, Geisel T (2012), Musical rhythms: The science of being slightly off, Physics Today 65:64-65.] and finger tapping [Repp B H, Su Y H (2013), Sensorimotor synchronization: A review of recent research, (2006-2012). Psychon B Rev 20:403-452.], long-range correlations were found in the time series of deviations of the beats from the metronome clicks. Below, the difference between the deviations and the interbeat intervals in the PSD will be quantified. The deviations from the metronome clicks are defined as en=tn−Mn, where tn is the temporal occurrence (e.g., the onset) of the n'th beat, Mn=nT is the temporal occurrence of the n'th metronome click and T is the time period between two consecutive metronome clicks. The interbeat intervals read
I
n
=t
n
−t
n-1
=e
n
−e
n-1
+T.
Hence, the interbeat intervals are the derivative of the deviations (except for a constant). In the following, a relation is derived between the PSD exponents of en and In. Given a time series xn where the PSD asymptotically decays in a power law 1/fβ with exponent β. Let the time series {dot over (x)}n=xn−xn-1 denote the derivative of xn. Then it can be shown analytically that the PSD of the derivative time series {dot over (x)}n asymptotically follows a power law with exponent β−2 [Beran, J, Statistics for long-memory processes, Chapman&Hall/CRC 1994]. Applying this general result to the present case, one finds
β(In)=β(en)−2
As a consequence, when en exhibits long-range correlations with exponent 0<β(en))<2, the derivative In exhibits long-range anti-correlations with −2<β(In)<0.
When subjects are synchronizing beats with a metronome, the time series of deviations exhibits long-range correlations with PSD exponents reported in the range β(en)=[0.2; 1.3] [Hennig H, Fleischmann R, Geisel T (2012), Musical rhythms: The science of being slightly off, Physics Today 65:64-65.]. Hence, one may expect the PSD exponents for the time series of interbeat intervals in the range β(In)=β(en)−2=[−1.8; −0.7]. Thus, the interbeat intervals are long-range anti-correlated for settings where a metronome is present. Humanizing a time series of deviations en with an exponent 0<β<2 thus is equivalent to humanizing the interbeat In intervals with −2<β<0. In contrast, for self-paced play as found by the inventor (i.e., in absence of a metronome), the interbeat intervals are long-range correlated on time scales of up to several minutes.
Different audio tracks are represented as channels 1 and 2. For each channel the standard deviation of the timing error may be set. In addition, the timing error for the spectrum of each channel may be set (β). Further, the motor error standard deviation may also be adjusted for each channel. Finally, the user may also set the coupling strength W for each channel. Given these data, the software device calculates an offset. More than two channels can be modified by having each additional channel responding to the average of all other channels' deviations.
Once the relevant parameters are set, the plug-in combines the audio tracks according to the previously described method.