Method and Device for Introducing Human Interactions in Audio Sequences

Information

  • Patent Application
  • 20150364123
  • Publication Number
    20150364123
  • Date Filed
    June 13, 2014
    10 years ago
  • Date Published
    December 17, 2015
    9 years ago
Abstract
A method for combining first second audio tracks includes modifying at least one of the two audio tracks; and storing the first and the second audio track in a non-volatile medium, characterized in that the interbeat intervals of the modified first and the second audio track exhibit long-range cross-correlations (LRCC).
Description

The present invention relates to a method and device for introducing human interactions in audio sequences.


Post-processing has become an integral part of professional music production. A song, e.g. a pop or rock song or a film score is typically assembled from a multitude of different audio tracks representing musical instruments, vocals or a software instruments. In audio engineering, tracks are often combined where musicians have not actually played together. This may eventually be recognized by a listener.


It is therefore an object of the present invention to provide a method and a device for combining audio tracks, where the result sounds like a simultaneous recording of the individual tracks, even if they were recorded separately.


SUMMARY OF THE INVENTION

This object is achieved by a method and a device according to the independent claims. Advantageous embodiments are defined in the dependent claims.


According to the invention, determining these characteristics of scale-free (fractal) musical coupling in human play can be used to imitate the generic interaction between two musicians in arbitrary audio tracks, comprising, in particular, electronically generated rhythms.


More particularly, the interbeat intervals exhibit long-range correlations (LRC) when one or more audio tracks are modified and the interbeat intervals exhibit long-range cross-correlations (LRCC) when two or more audio tracks are modified.


A time series contains LRC if its power spectral density (PSD) asymptotically decays in a power law, p(f)˜1/fβ for small frequencies f and 0<β<2. The limits β=0 (β=2) indicate white noise (Brownian motion) while −2<β<0 indicates anti-correlations. In the literature, different normalizations for the power spectral frequency f can be found, which can be converted into one another. Here, f is measured in units of the Nyquist frequency (fNyquist=½ Hz), which is half the sampling rate of the time series.


Long-Range Cross-Correlations (LRCC) between two sequences of interbeat intervals, i.e. two non-stationary time series, exist if the covariance FDCCA (s) defined below asymptotically follows a power law F(s)˜sδ with 0.5<δ<1.5. In contrast, δ=0.5 indicates absence of LRCC.


The presence of such cross-correlations may be measured using a variant of detrended cross-correlation analysis (DCCA) [Podobnik B, Stanley H (2008), Detrended Cross-Correlation Analysis: A New Method for Analyzing Two Nonstationary Time Series. Phys. Rev. Lett. 100:084102]. Global detrending with a polynomial of degree k may be added as an initial step prior to DCCA, which has been shown crucial in analyzing slowly varying non-stationary signals [Podobnik B, et al. (2009), Quantifying cross-correlations using local and global detrending approaches. Eur. Phys. J. B 71:243-250.]. In fact, global detrending proved to be a crucial step to calculate the DCCA exponent of the non-stationary time series of interbeat intervals analyzed by the inventors. Without global detrending much larger DCCA exponents are obtained, i.e., spurious LRCC are detected that reflect global trends.


Given two time series Xn, Xn′, where n=1 . . . N, the DCCA method including prior global detrending thus consists of the following steps:


(1) Global detrending: fitting a polynomial of degree k to Xn and a polynomial to Xn′, where typically k=1 . . . 5. One may use k=3. It should carefully be checked that the obtained DCCA scaling exponents do not change significantly with k.


(2) Integrating the time series Rni=1n Xn and Rn′=Σi=1n Xn′.


(3) Dividing the series into windows of size s, (3) Least-squares fit {tilde over (R)}n and {tilde over (R)}n′ for both time series in each window.


(4) Calculating the detrended covariance









F
DCCA



(
s
)


=


1
/

(


N
s

-
1

)







k
=
1


N
s





(


R
k

-


R
~

k


)



(


R
k


-


R
~

k



)





,




where Ns is the number of windows of size s.


For fractal scaling, FDCCA (s) α sδ with 0.5<δ<1.5. Absence of LRCC are indicated by δ=0.5. Another indicator of absence of LRCC is that the detrended covariance FDCCA (s) changes signs and fluctuates around zero as a function of the time scale s [Podobnik B, et al. (2009), Quantifying cross-correlations using local and global detrending approaches, Eur. Phys. J. B 71:243-250].


The invention may be embodied in a computer-implemented method or a device for combining a first and a second audio track, in a software plugin product, e.g. for a digital audio workstation (DAW) that, when executed, implements a method according to the invention, in an audio signal, comprising one or more audio tracks obtained by a method according to the invention and/or in a medium storing an audio signal according to the invention.





BRIEF DESCRIPTION OF THE FIGURES

These and other aspects and advantages of the present invention are described more thoroughly in the following detailed description of embodiments of the invention and with reference to the drawing in which



FIG. 1 shows a flowchart of a method according to an embodiment of the invention.



FIG. 2 shows an example of two coupled time series generated with the two-component ARFIMA process.



FIG. 3 shows a diagram of an experimental setup for analyzing combinations of audio tracks played by a human subject.



FIG. 4 shows a representative example of the findings from a recording of two professional musicians A and B playing periodic beats in synchrony (task type (Ia).



FIG. 5 shows: (a) Evidence of scale-free cross-correlations in the MICS model (b)



FIG. 6 shows an illustration of the PSD of the interbeat intervals when humans are playing or synchronizing rhythms (a) without and (b) with a metronome.



FIG. 7 shows a user interface 700 of a software implemented human interaction device based on the MICS model.





DETAILED DESCRIPTION


FIG. 1 shows a flowchart of a method according to an embodiment of the invention. The method receives a first audio track A and a second audio track B as inputs.


The procedure to introduce human-like musical coupling in two audio tracks A and B is demonstrated using an instrumental version of the song ‘Billie Jean’ by Michael Jackson. The song Billie Jean was chosen because drum and bass tracks consist of a simple rhythmic and melodic pattern that is repeated continuously throughout the entire song. This leads to a steady beat in drum and bass, which is well suited to demonstrate their generic mutual interaction. For simplicity, all instruments were merged into two tracks: track A includes all drum and keyboard sounds, while track B includes the bass.


In step 110, the interbeat intervals of the first and the second audio track are determined. The interbeat intervals of tracks A and B read IA,t=Xt+T and IB,t=Yt+T, where T is the average interbeat interval given by the tempo (here, T=256 ms, which corresponds to 234 beats per minute in the eighth notes). In case the audio tracks are MIDI files, this may be done based on the ‘note on’ messages. In other case, known suitable beat detection procedures may be used.


If the time series Xt and Yt are long-range cross-correlated, a musical coupling between drum and bass tracks is obtained.


In step 120, the interbeat intervals of at least one of the first audio track A and the second audio track B are modified. Small deviations are added to the interbeat intervals in order to modify a long-range cross-correlation (LRCC) between the interbeat intervals of the first and the second audio track. More particularly, the interbeat intervals are modified in order to induce LRCC between the interbeat intervals of the two audio tracks with a power law exponent, also called DCCA exponent δ, which measures the strength of the LRCC. For δ=0.5, there are no LRCC, while the strength of the LRCC increases with δ.


More than two audio tracks can be modified by having each additional track responding to the average of all other tracks' deviations.


In particular, musical coupling between Xt and Yt is introduced using a two-component Autoregressive Fractionally Integrated Moving Average (ARFIMA) process with δ=0.9, (2), that generates two time series x1,2 which exhibit LRCC [Podobnik B, Stanley H (2008), Detrended Cross-Correlation Analysis: A New Method for Analyzing Two Nonstationary Time Series. Phys. Rev. Lett. 100:084102; Podobnik B, Wang D, Horvatić D, Grosse I, Stanley H E (2010), Time-lag cross-correlations in collective phenomena, Europhys. Lett. 90:68001].


The process is defined by







X
t

=




n
=
1







w
n



(


α
A

-
0.5

)




x

t
-
n











Y
t

=




n
=
1







w
n



(


α
B

-
0.5

)




y

t
-
n











x
t

=


[


WX
t

+


(

1
-
W

)



Y
t



]

+

ξ

t
,
A










y
t

=


[



(

1
-
W

)



X
t


+

WY
t


]

+

ξ

t
,
B







with Hurst exponents 0.5<αA,B<1, weights wn(d)=d Γ(n−d)/(Γ(1−d) Γ(n+1)), Gaussian white noise ξt,A and ξt,B and gamma function Γ. The coupling constant W ranges from 0.5 (maximum coupling between xt and yt) to 1 (no coupling). It has been shown analytically, that the cross-correlation exponent is given by δ=(αAB)/2.


The standard deviation chosen for Xt and Yt was 10 ms. The time series of deviations Xt and Yt for musical coupling are shown in FIG. 2. The measured DCCA exponent reads δ=0.93 (in agreement with the analytical value 0.9 within margins of error) showing LRCC.


Introducing LRC in audio tracks is referred to as “humanizing”. For separately humanized sequences (i.e., without adding cross-correlations between the sequences), however, absence of LRCC is expectable. Indeed, when humanizing the time series of interbeat intervals separately (e.g., with an exponent β=0.9), the detrended covariance of Xt and Yt oscillates around zero, i.e., no LRCC are found.


All other characteristics, such as pitch, timbre and loudness remain unchanged.


In step 130, the combined audio tracks are stored in a non-volatile, computer-readable medium.



FIG. 2 shows an example of two coupled time series generated with the two-component ARFIMA process. The deviations from their respective positions (e.g., given by a metronome) are shown in the drum track (upper blue curve, offset by 50 ms for clarity) and bass track (lower black curve) to introduce musical coupling. When an instrument is silent on a beat, the corresponding deviation is skipped. The time series each of length N=1120 were generated with a two-component ARFIMA process with Hurst exponents αAB=0.9 and coupling constant W=0.5. The bottom of FIG. 2 shows an excerpt of the first four bars of the song Billie Jean by Michael Jackson. Because there is a drum sound on every beat, all 1120 deviations are added to the drum track, whereas in the first two bars the bass pauses.


Other processes than the ARFIMA process that generate LRCC can also be used to induce musical coupling. More particularly, when two subjects A and B are synchronizing a rhythm, each person attempts to (partly) compensate for the deviations dn=tA,n=tB,n perceived between the two n'th beats when generating the n+1'th beat. This is reflected by the following model referred to as the Mutually Interacting Complex Systems (MICS) model






I
A,nACA,n+T+ξA,n−ξA,n-1−WAdn-1






I
B,nBCB,n+T+ξB,n−ξB,n-1+WBdn-1  (1)


where CA,n and CB,n are Gaussian distributed 1/fβ noise time series with exponents 0<βA,B<2, ξA,n and ξB,n is Gaussian white noise and T is the mean beat interval. We set d0=0. The model assumes that the generation of temporal intervals is composed of three parts: (i) an internal clock with 1/fβ noise errors, (ii) a motor program with white noise errors associated with moving a finger or limb, referred to in FIG. 7 as the motor error, (iii) an coupling term between the subjects with coupling strengths WA and WB.


The deviations dn which the musicians perceive and adapt to can be written as a sum over all previous interbeat intervals







d
n

=



t

A
,
n


-

t

B
,
n



=




j
=
1

n



(


I

A
,
j


-

I

B
,
j



)







thus involving all previous elements of the time series of IBIs of both musicians. Therefore, this model reflects that scale-free coupling of the two subjects emerges mainly through the adaptation to deviations between their beats.


The coupling strengths o<WA,B<2 describe the rate of compensation of a deviation in the generation of the next beat. In the limit WA=WB=0 and βAB=1 the second model reduces to the model introduced by Gilden et al., in the following called the Gilden model [Gilden D L, Thornton T, Mallon M W (1995), 1/f noise in human cognition, Science 267:1837-1839]. The MICS model diverges for WA+WB≧2, i.e., when subjects are over-compensating.


A possible extension of the second model is to consider variable coupling strengths W=W(dn). Since larger deviations are likely to be perceived more distinctly, one possible scenario is to introduce couplings W that increase with dn. For example, W may increase when large deviations such as glitches are perceived.



FIG. 3 shows a diagram of an experimental setup for analyzing combinations of audio tracks played by a human subject.


The experimental setup comprises a keyboard 310 connected to speakers 320 and a recorder 330 for recording notes played by test subjects 1 and 2 on the keyboard 310. Preferably, the keyboard 310 has a midi interface and the recording device 330 records midi messages.


The performances were recorded at the Harvard University Studio for Electroacoustic Composition (See Supporting Information for details) on a Studiologic SL 88o keyboard yielding 57 time series of Musical Instrument Digital Interface (MIDI) recordings. However, the results presented here apply not only to MIDI but also to acoustic recordings.


Each recording typically lasted 6-8 minutes and contained approx. 1000 beats per subject. The temporal occurrences t1, . . . , tn of the beats were extracted from the MIDI recordings and the interbeat intervals read In=t1 . . . tn-1 with t0=0. The subjects were asked to press a key with their index finger according to the following. Task type (Ia): Two subjects played beats in synchrony with one finger each. (Ib) ‘Sequential recordings’ were made, where subject B synchronized with prior recorded beats of subject A. Sequential recordings are widely used in professional studio recordings, where typically the drummer is recorded first, followed by layers of other instruments. Task type (II): One subject played beats in synchrony with one finger from each hand. Task type (III): One subject played beats with one finger (‘finger tapping’). Finger tapping of single subjects is well-studied in literature [Repp B H, Su Y H (2013), Sensorimotor synchronization: A review of recent research, (2006-2012). Psychon B Rev 20:403-452.] and serves as a baseline, whereas our focus is on synchronization between subjects. In addition to periodic tapping, a 4/4 rhythm {1, 2.5, 3, 4}, where the second beat is replaced by an offbeat, was used in tasks (I-III).



FIG. 4 shows a representative example of the findings from a recording of two professional musicians A and B playing periodic beats in synchrony (task type (Ia). FIG. 4: (top) Two professional musicians A and B synchronizing their beats: comparison of experiments (a-c) with MICS model (d-f). (a) The IBIs of 1134 beats of A (black curve) and B (blue curve, offset by 0:1 s for clarity) exhibits slowly varying trends and a tempo increase from 133 to 182 beats per minute. (b,e) The PSD of time series IA, IB shows LRC asymptotically for small f and anti-correlations for large f separated by a vertex of the curve at f≈0.1 fNyquist [7]. (c) Evidence of LRCC between IA and IB, DCCA exponent is δ=0.69. (d-f) The MICS model for βAB=0.85, N=1133 predicts δ=0.74, in excellent agreement with the experimental data. A global trend extracted from (a) was added to the curves in (d) for illustration.


A comparison of the MICS model (FIG. 4, right panel) with the experiments (left panel) shows excellent agreement. The vertex at the characteristic frequency fc in the PSD is reproduced by the MICS model (cf. FIG. 4 (b,e)).


The MICS model predicts emergence of LRCC (FIG. 5(a)). This MICS model also predicts that, asymptotically, the DFA scaling exponents αA,B of the interbeat intervals are determined by the ‘clock’ with the strongest persistence: αAB=[max(βA, βB)+1]/2. This result is valid for long time series of length N≧105, see FIG. 5(b). Surprisingly, even when turning off, say, clock A (i.e., βA=0), the long-time behavior of both IA and IB is asymptotically given by the exponent of the long-range correlated clock B (and vice versa) for large N. Thus, the musician with the higher scaling exponent determines the partner's long-term memory in the IBIs. However, in experiments the exponents can differ significantly in shorter time series of length N≈1000 which can be seen by comparing the PSD exponents in FIGS. 4(e) and 5(b).



FIG. 5 shows: (a) Evidence of scale-free cross-correlations in the MICS model (b) The PSD of IA (and IB) shows two regions: LRC asymptotically for small f with exponent β(IA)=0.86≈max(βA; βB) and anti-correlations for large f. Other parameters (a-b): N=217, βAB=0.85, coupling WA=WB=0.5, and σAB=6.


Evidence for LRCC between IA and IB on time scales up to the total recording time is reported in FIG. 4(c) with DCCA exponent δ=0.69±0.05. The two subjects are rhythmically bound together on a time scale up to several minutes and the generation of the next beat of one subject depends on all previous beat intervals of both subjects in a scale-free manner. LRCC were found in all performances of both laypeople and professionals, when two subjects were synchronizing simple rhythms. Thus, rhythmic interaction can be seen as a scale-free process.


In contrast, when a single subject is synchronizing his left and right hands (tasks (II)), no significant LRCC were observed, suggesting that the interaction of two complex systems is a necessary prerequisite for rhythmic binding.


The inventor identified two distinct regions in the PSD of the interbeat intervals separated by a vertex of the curve at a characteristic frequency fc≈0.1 fNyquist (see FIG. 4(b): (i) The small frequency region asymptotically exhibits long-range correlations. This region covers long periods of time up to the total recording time. (ii) The high frequency region exhibits short-range anti-correlations. This region translates to short time scales. These two regions were first described in single subjects finger tapping without a metronome [Gilden D L, Thornton T, Mallon M W (1995), 1/f noise in human cognition, Science 267:1837-1839]. Because these two regions are observed in the entire data set (i.e., in all 57 recorded time series across all tasks), this suggests that these regions are persistent when musicians interact.



FIG. 4(
e) shows that the MICS model reproduces both regions and fc for interacting complex systems. The two subjects potentially perceive the deviations dn=tA,n−tB,n between their beats. The DFA exponent α=0.72 for the time series dn indicates long-range correlations in the deviations (averaging over the entire data set one finds α=0.73±0.11).


In the present data set, exponents where found to be in a broad range 0.5<λ<1.5, hence the analysis suggests to couple audio tracks using LRCC with a power law exponent 0.5<λ<1.5. However, even larger exponents λ>1.5 are found when no global detrending of the interbeat intervals is used or in cases when the nonstationarity of the time series is not easily removed by global detrending.


There is a fundamental difference between settings where individuals are provided with a metronome click (e.g., over headphones) while playing and where no metronome is present (also referred to as self-paced play) that manifests in the PSD of the interbeat intervals.



FIG. 6 is an illustration of the PSD of the interbeat intervals when humans are playing or synchronizing rhythms (a) without and (b) with a metronome. (a) Illustration of the case where rhythms are played in absence of a metronome: The PSD of the interbeat intervals exhibits long-range correlations (asymptotically for low frequencies with PSD exponent β=1.01) and anti-correlations for high frequencies. The characteristic frequency separating the two regions is observed at 0.1 fNyquist. The time series of interbeat intervals was calculated with the Gilden model for β=1.0 and relative strength of clock noise over motor noise σ=0.5, i.e. for rather dominant motor noise (which only manifests on short time scales, but does not affect the long-term behavior) [Gilden D L, Thornton T, Mallon M W (1995), 1/f noise in human cognition, Science 267:1837-1839]. (b) Illustration of the case where rhythms are played while synchronizing beats with a metronome: The PSD of the interbeat intervals exhibits long-range anti-correlations.


For self-paced play of musical rhythms, the PSD of the interbeat intervals exhibits two distinct regions [Hennig H, et al. (2011), The Nature and Perception of Fluctuations in Human Musical Rhythms, PLoS ONE 6:e26457]. Long-range correlations are found asymptotically for small frequencies in the PSD. This region relates to correlations over long time scales of up to several minutes (as long as the subject does not frequently lose rhythm). On the other hand, for high frequencies in the PSD anti-correlations are found.


In contrast, a different situation is observed in presence of a metronome: For play of both complex musical rhythms [Hennig H, Fleischmann R, Geisel T (2012), Musical rhythms: The science of being slightly off, Physics Today 65:64-65.] and finger tapping [Repp B H, Su Y H (2013), Sensorimotor synchronization: A review of recent research, (2006-2012). Psychon B Rev 20:403-452.], long-range correlations were found in the time series of deviations of the beats from the metronome clicks. Below, the difference between the deviations and the interbeat intervals in the PSD will be quantified. The deviations from the metronome clicks are defined as en=tn−Mn, where tn is the temporal occurrence (e.g., the onset) of the n'th beat, Mn=nT is the temporal occurrence of the n'th metronome click and T is the time period between two consecutive metronome clicks. The interbeat intervals read






I
n
=t
n
−t
n-1
=e
n
−e
n-1
+T.


Hence, the interbeat intervals are the derivative of the deviations (except for a constant). In the following, a relation is derived between the PSD exponents of en and In. Given a time series xn where the PSD asymptotically decays in a power law 1/fβ with exponent β. Let the time series {dot over (x)}n=xn−xn-1 denote the derivative of xn. Then it can be shown analytically that the PSD of the derivative time series {dot over (x)}n asymptotically follows a power law with exponent β−2 [Beran, J, Statistics for long-memory processes, Chapman&Hall/CRC 1994]. Applying this general result to the present case, one finds





β(In)=β(en)−2


As a consequence, when en exhibits long-range correlations with exponent 0<β(en))<2, the derivative In exhibits long-range anti-correlations with −2<β(In)<0.


When subjects are synchronizing beats with a metronome, the time series of deviations exhibits long-range correlations with PSD exponents reported in the range β(en)=[0.2; 1.3] [Hennig H, Fleischmann R, Geisel T (2012), Musical rhythms: The science of being slightly off, Physics Today 65:64-65.]. Hence, one may expect the PSD exponents for the time series of interbeat intervals in the range β(In)=β(en)−2=[−1.8; −0.7]. Thus, the interbeat intervals are long-range anti-correlated for settings where a metronome is present. Humanizing a time series of deviations en with an exponent 0<β<2 thus is equivalent to humanizing the interbeat In intervals with −2<β<0. In contrast, for self-paced play as found by the inventor (i.e., in absence of a metronome), the interbeat intervals are long-range correlated on time scales of up to several minutes.



FIG. 7 shows a user interface 700 of a software implemented human interaction device based on the MICS model. The human interaction device is a software module or plug-in that may be plugged in to a digital audio work station, comprising a computer, a sound card or audio interface, an input device or digital audio editor. For example, a user-friendly device can be created for Ableton's audio software “Live” using the application programming interface “Max for Live”.


Different audio tracks are represented as channels 1 and 2. For each channel the standard deviation of the timing error may be set. In addition, the timing error for the spectrum of each channel may be set (β). Further, the motor error standard deviation may also be adjusted for each channel. Finally, the user may also set the coupling strength W for each channel. Given these data, the software device calculates an offset. More than two channels can be modified by having each additional channel responding to the average of all other channels' deviations.


Once the relevant parameters are set, the plug-in combines the audio tracks according to the previously described method.

Claims
  • 1. A method for combining a first and a second audio track, comprising the steps modifying at least one of the two audio tracks; andstoring the first and the second audio track in a non-volatile medium;characterized in thatthe interbeat intervals of the modified first and the second audio track exhibit long-range cross-correlations (LRCC).
  • 2. The method according to claim 1, wherein the detrended covariance of the interbeat intervals of the first and the second audio track exhibit a power law.
  • 3. The method according to claim 1, wherein small deviations are added to the interbeat intervals of at least one of the two audio tracks.
  • 4. The method according to claim 2, wherein small deviations are added to the interbeat intervals of at least one of the two audio tracks.
  • 5. The method according to claim 2, wherein the detrended cross-correlation exponent (δ) is chosen such that 0.5<δ<5.
  • 6. The method according to claim 5, where 0.5<δ<1.5.
  • 7. The method according to claim 1, wherein the first and the second audio track are recorded sequentially.
  • 8. The method according to claim 1, wherein one of the first and the second audio tracks is the recording of a software instrument.
  • 9. The method of claim 1, wherein at least one of the first and the second audio tracks is the recording of a human musician.
  • 10. The method of claim 1, wherein one of the audio tracks is a drum track.
  • 11. A method for humanizing an audio track, comprising the steps: modifying the audio track; andstoring the audio track in a non-volatile medium;characterised in thatthe interbeat intervals of the modified audio track exhibit long-range correlations (LRC).
  • 12. The method according to claim 11, wherein the exponent (β) of the power spectral density
  • 13. The method of claim 12, wherein 0<β<2.
  • 14. The method according to claim 11, where anti-correlations with exponent −10<β<0 are found for high frequencies in the power spectral density
  • 15. A device for combining a first and a second audio track, comprising a modifying module for modifying at least one of the two audio tracks; anda storage module for storing the first and the second audio track in a non-volatile medium;characterized in thatthe interbeat intervals of the modified audio track exhibit long-range correlations (LRC) and/or the interbeat intervals of the modified first and the second audio track exhibit long-range cross-correlations (LRCC).