The present disclosure is related to data processing providing a real-time musical synchrony between a human musician and pre-recorded music data providing accompaniments of the human musician.
The goal is to grasp the musical intention of the performer and map them to that of the pre-recorded accompaniment to achieve an acceptable musical behavior.
Some known systems deal with the question of real-time musical synchrony between a musician and accompaniment.
Document D1: Christopher Raphael (2010): “Music Plus One and Machine Learning”, in Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21-28, is related to learning systems where the intention of musician is predicted from models that are trained on actual performances of the same performer. Despite the issue of data availability for training, the synchronization depends here on high-level musical parameters (such as musicological data) rather than probabilistic parameters of an event. Moreover, statistical or probabilistic predictions undermine the extreme variability of performances between sessions (and for a same performer). Furthermore, this approach relies on synchronizing musician events with computer actions. Computer actions do not model high-level musical parameters and thus are impractical.
Document D2: Roger B Dannenberg (1997): “Abstract time warping of compound events and signals” in Computer Music Journal, 61-70, takes the basic assumption that the musician tempo is continuous and kept between two events, resulting into a piece-wise linear prediction of the music position used for synchronization. In any real-world setup, tempo discontinuity is a fact that leads to failure of such approximations. Moreover, this approach takes only into account musician's Time-Map and undermines the pre-recorded accompaniment Time-Map (assuming it fixed) and thus missing important high-level musical knowledge.
Document D3: Arshia Cont, José Echeveste, Jean-Louis Giavitto, and Florent Jacquemard. (2012): “Correct Automatic Accompaniment Despite Machine Listening or Human Errors in Antescofo”, in Proceedings of International Computer Music Conference (ICMC), Ljubljana (Slovenia), incorporates the notion of Anticipation with a cognitive model of the brain to estimate musician's time-map. In order to incorporate high-level musical knowledge for accompaniment synchronization, they introduce two types of synchronizations: Tight Synchronization is used to ensure that certain key positions are tightly synchronized.
While appropriate, their solution introduces discontinuities in the resulting Time-Map. Such discontinuities are to be avoided when synchronizing continuous audio or video streams.
Smooth Synchronization attempts to produce a resulting continuous Time-Map by assuming that the resulting accompaniment tempo is equal to that of the musician and predicting its position using that value.
Despite this appropriate tempo detection, the real-time tempo is prone to error and can lead to unpredictable discontinuities. Furthermore, coexistence of the two strategies in the same session poses further discontinuities in the resulting time-map.
Document D4: Dawen Liang, Guangyu Xia, and Roger B Dannenberg (2011): “A framework for coordination and synchronization of media”, in Proceedings of the International Conference on New Interfaces for Musical Expression (p.167-172), proposes a compromise between sporadic synchronization such as Tight above and tempo-only synchronization such as Loose, in order to dynamically synchronize time-maps with the goal of converging values to the reference accompaniment time-map. A constant window spanning musical duration w into the future is used so as to force the accompaniment to compensate deviations at time t such that it converges at t+w. This leads to continuous curves that are piece-wise linear on musical position output.
This strategy has however two drawbacks:
The present disclosure aims to improve the situation.
To that end, it is proposed a method for synchronizing a pre-recorded music accompaniment to a music playing of a user,
Said user's music playing being captured by at least one microphone delivering an input acoustic signal feeding a processing unit,
said processing unit comprising a memory for storing data of the pre-recorded music accompaniment and providing an output acoustic signal based on said pre-recorded music accompaniment data to feed at least one loudspeaker playing the music accompaniment for said user,
Wherein said processing unit:
Where x is a temporal variable, $tempo is the determined tempo in user's music playing, and w is a duration of compensation of said lag diff.
A notion of “Time-Map” can therefore be used to model musical intentions incoming from a human musician compared to pre-recorded accompaniments. A time-map is a function that maps physical time t to musical time p (in beats).
In a non real-time (or offline) setup, and given the strong assumption that tempo estimation from the device is correct, a time map position p is an integral of the beat multiplied by this tempo from time 0 to t.
However, when the musician does not follow the tempo set in the music score, the estimated tempo in the current playing of the accompaniment needs to be adapted in near future defined by the compensation duration w, and the use of the synchronization function F ensures that convergence is reached to the current user's tempo after that compensation duration.
In an embodiment, said music accompaniment data defines a music score and wherein variable x is a temporal value corresponding to a duration of a variable number of beats of said music score.
In an embodiment, said compensation duration w has a duration of at least one beat on a music score defined by said music accompaniment data.
In an embodiment, said compensation duration w is chosen.
Preferably it can be set to one beat duration, but possibly more, according to a user's choice that can be entered for example through an input of said processing unit.
In an embodiment where the accompaniment data defines a music score, a position pos of the musician playing on said score is forecasted by a linear relation defined as pos(x)=$tempo*x, where x is a number of music beats counted on said music score, and if a lag diff is detected, the synchronization function F(x) is then used so as to define a number of beats xdiff corresponding to said lag time diff such that:
F(xdiff)−pos(xdiff)=diff.
In this embodiment, a prediction is determined on the basis of said synchronization function F(x), until a next beat xdiff+w by applying a transformation function A(t), given by:
A(t)F(t−t0+xdiff)+p
Where p is a current position of the musician playing on the music score at current time t0.
In an embodiment where said accompaniment data defines a music score, the processing unit further estimates a future position of the musician playing on said music score at a future synchronization time tsync, and determines a tempo (reference e2 of
In this embodiment and when the transformation function A(t) is used, the tempo of the music accompaniment to apply to the output acoustic signal is determined as the derivative of A(t) at current time t0:
tempo=A′(t0)=F′(xdiff)
(which is known analytically).
In an embodiment, the determination of said musical events in said input acoustic signal comprises:
In fact, the assignment of musical event can be done onto the music score and for example on the solo part and thus determined by this, and not the “accompaniment” itself. These can be in symbolic music notation format such as MIDI typically. Therefore, the wording “stored data of an accompaniment music score” is to be interpreted broadly and may encompass the situation when such data comprise further a music score of a solo track which is not the accompaniment itself.
An association of the music score events is more generally performed in the pre-recorded accompaniment (time-map).
The present disclosure aims also at a device for synchronizing a pre-recorded music accompaniment to a music playing of a user, comprising a processing unit to perform the method presented above.
It aims also at a computer program comprising instructions which, when the program is executed by a processing unit, cause the processing unit to carry out the method presented above.
It aims also at a computer-readable medium comprising instructions which, when executed by a processing unit, cause the computer to carry out the method.
Therefore, to achieve real-time synchrony between musician and pre-recorded accompaniment, the present disclosure addresses specifically the following drawbacks in the state-of-the-art:
More details and advantages of embodiments are given in the detailed specification hereafter and appear in the annexed drawings where:
The present disclosure proposes to solve the problem of synchronizing a pre-recorded accompaniment to a musician in real-time. To this aim, a device DIS (as shown in the example of
The device DIS comprises in an embodiment, at least:
The memory MEM can store, inter alia, instructions data of a computer program according to the present disclosure.
Furthermore, music accompaniment data are stored in the processing unit (for example in the memory MEM). Music accompaniment data are therefore read by the processor PROC so as to drive the output interface OUT to feed at least one loudspeaker SPK (a baffle or an earphone) with an output acoustic signal based on the pre-recorded music accompaniment data.
The device DIS further comprises a Machine Listening Module MLM which can include an independent hardware (as shown with dashed lines in
A user US can hear the accompaniment music played by the loudspeaker SPK and can play with a music instrument on the accompaniment music, emitting thus a sound captured by a microphone MIC connected to the input interface INP. The microphone MIC can be incorporated in the user's instrument (such as in an electric guitar) or separated (for voice or acoustic instruments recording). The captured sound data are then processed by the machine listening module MLM and more generally by the processing unit PU.
More particularly, the captured sound data are processed so as to identify a delay or an advance of the music played by the user, compared to the accompaniment music, and to adapt then the speed of playing of the accompaniment music to the user's playing. For example, the tempo of the accompaniment music can be adapted accordingly. The time difference which is detected by the module MLM, between the accompaniment music and the music played by the user, is called hereafter “lag” at current time t and noted diff.
More particularly, musician events can be detected in real-time by the machine listening module MLM which outputs then t-uplets of musical events and tempo data pertaining to real-time detection of such events from a music score. This embodiment can be similar for example to the one disclosed in Cont (2010). In the embodiment where the machine listening module MLM has a hardware separated from the processing unit PU, the module MLM is thus exchangeable and can be thus any module that provides “events” and, optionally hereafter the tempo, in real-time, on a given music score, by listening to a musician playing.
As indicated above, the machine listening module MLM operates preferably “in real-time”, ideally with a lag of less than 15 milliseconds, which corresponds to a perceptual threshold (ability to react to an event) in most of the current usual listening algorithms.
Thanks to the pre-recorded accompaniment music data on the one hand, and to a tempo recognition in the musician playing on the other hand, the processing unit PU performs a dynamic synchronization. At each real-time instance t, it (PU) takes as input its own previous predictions at a previous time t−ε, and incoming event and tempo from machine listening. The resulting output is an accompaniment time-map that contains predictions at time t.
The synchronization is dynamic and adaptive thanks to prediction outputs at time t, based on a dynamically computed lag-dependent window (hereafter noted w). A dynamic synchronization strategy is introduced and its value is guaranteed mathematically to converge at a later time t_sync. The synchronization anticipation horizon t_sync itself is dependent on the computed lag time at time t with regards to previous instance and feedback from the environment.
The results of the adaptive synchronization strategy are to be consistent (same setup leads to same synchronization prediction). The adaptive synchronization strategy should also adapt to an interactive context.
The device DIS takes as live input musician's event and tempo, and outputs predictions for a pre-recorded accompaniment, having both pre-recorded accompaniment and music score at its disposal prior to launch. The role of the device DIS is to employ musician's Time-Map (as a result of live input) and construct a corresponding Synchronization Time-Map dynamically.
Instead of relying on a constant window length (like in state of the art), the parameter w is interpreted here as a stiffness parameter. Typically, w can correspond to a fixed number of beats of the score (for example one beat, corresponding to a quarter note of a 4/4 measure). Its time current value tv can be given at the real tempo of the accompaniment (tv=w*real tempo), which however does not necessarily correspond to the current musician tempo. The prediction window length w is determined dynamically (as detailed below with reference to
In an embodiment, a synchronization function F is introduced, whose role is to help construct the synchronization time-map and to compensate the lag diff in an ideal setup where the tempo is supposed to be, in a short time-frame, a constant value. Given the musician's position p (on a music score) and the musician's tempo noted hereafter “$tempo” at time t, F is a quadratic function that joins Time-Map points (0, 1) to (w, w*$tempo) and checks that its derivative is equal to parameter $tempo. The lag at time t between the musician's real-time musical position on the music score and that of the accompaniment track on the same score (both in beats) is denoted as diff. Therefore, parameter diff reflects exactly the difference between the position on the music score in beats of the detected musician's event in real-time and the position on the music score (in beats) of the accompaniment music that is to be synchronized.
It is shown here that the synchronization function F can be expressed as follows:
and if diff=0, F(x) simply becomes F(x)=$tempo*x where $tempo is the real tempo value provided by the module MLM, w is a prediction window corresponding finally to the time taken to compensate the lag diff until a next adjustment of the music accompaniment on the musician playing.
It is shown furthermore that, for any event detected at time t, and accompaniment lag diff beats ahead, there is a single solution xdiff of the equation F(x)−$tempo*x=diff. This unique solution defines the adaptive context on which predictions are computed and re-defines the portion of accompaniment map from xdiff as:
A(t)F(t−t0+xdiff)+p
A detailed explanation of the adaptation function A(t) is given hereafter.
By construction, the synchronizing accompaniment Time-Map converges in position and tempo at time t_sync=t+w−xdiff to the musician Time-Map. This mathematical construction ensures continuity of tempo until a synchronization time t_sync.
A simple explanation of
A(t)F(t−t0+xdiff)+p.
Where p is the current position of the musician playing on the score at current time t0. Then A(t) can be computed to give the right position that the musician playing should have in a future time tsync. Until this synchronization time tsync at least, the tempo of the accompaniment is adapted. It corresponds to a new slope e2 (oblique dashed line of
ctempo=A′(t0)=F′(xdiff)
which is known analytically.
Referring now to
On the basis of that time lag and a chosen duration w (a duration of a chosen number of beats in the music score typically), the synchronization function F(x) can be determined in step S4. Then, in step S5, xdiff can be the sole solution given by F(xdiff)−$tempo*xdiff=diff
The determination of xdiff makes it then possible to use the transformation function A(t) which is determined in step S6, so as to shift from the synchronization map to the musician time-map as explained above while referring to
Qualitatively, this embodiment contributes to reach the following advantages:
Moreover, high-level musical knowledge can be integrated into the synchronization mechanism in form of Time-Maps. To this end, predictions are extended to non-linear curves on Time-Maps. This extension allows formalisms for integrating musical expressivity such as accelerendi and fermata (i.e. with an adaptive tempo) and other common expressive musical specifications of performer's timing. This addition also enables the possibilities of automatic learning of such parameters from existing data.
Additional latencies are usually imposed by hardware implementations and networks communications. Compensating this latency in an interactive setup can not be reduced to a simple translation of the reading head (as seen in over-the-air audio/video streaming synchronization). The value of such latency can vary from 100 milliseconds to 1 second, which is far beyond acceptable psychoacoustic limits of human ear. The synchronization strategy takes this value optionally as input, and anticipates all output predictions based on the interactive context. As a result and for relatively small values of latency (in mid-range of 300 ms corresponding to most Bluetooth and AirMedia streaming formats), it is not necessary for the user to adjust the lag prior to performance. The general approach, expressed here in “musical time” as opposed to “physical time”, allows automatic adjustment of such parameter.
More generally, this disclosure is not limited to the detailed features presented above as examples of embodiments; it encompasses further embodiments.
Typically, the wordings related to “playing the accompaniment” on a “loudspeaker” and the notion of “pre-recorded music accompaniment” are to be interpreted broadly. In fact, the method applies to any “continuous” media, including for example audio and video. Indeed, video+audio content can be synchronized as well using the same method as presented above. Typically, the aforesaid “loudspeakers” can be replaced by an Audio-Video projection and video frames can thus be interpolated as presented above simply based on the position output of prediction for synchronization.
Number | Date | Country | Kind |
---|---|---|---|
20305168.5 | Feb 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/052250 | 2/1/2021 | WO |