The present invention relates to processing audio signals, in particular to automatically combining two successive audio signals or streams via a transitional audio signal or stream.
In a traditional music-based radio station, or when a DJ comperes a set at a club or event, messages of various types are usually interspersed between tracks. The messages may, for example, include: identification (e.g. artist and track names) or comment on the previous or next track; station identifiers or jingles; news; weather forecasts; advertisements; or just general chat. Such messages increase listeners' engagement with the radio station or DJ and provide useful information.
More recently, music streaming services offer large numbers of algorithmically generated “stations” or playlists of tracks selected according to some criteria, such as era, genre or artist. Listeners can readily select a station that suits their taste and/or mood from the wide variety available. However, such algorithmic stations and playlists do not include messages between tracks but rather play one track to the end and immediately start the next. Algorithmic stations and playlists can therefore lack the engagement of a human-curated radio station.
U.S. Pat. No. 6,192,340B1 discloses a method in which informational items obtained from an information provider are interleaved into a sequence of musical items. The informational items, e.g. stock quotes, are received as text and converted to audio by a voice synthesizer. Parameters of the audio informational items, such as the voice to be used for the synthesis, speed and volume, are set by user preference. Although the method of U.S. Pat. No. 6,192,340B1 has great flexibility to cater to a user's preferences for music and information sources, the resulting output can be artificial and disjointed.
It is an aim of the invention to provide an improved method of automatically combining audio signals and informational messages in a way that is more appealing to a listener, in particular by improving the transitions between musical items and informational items.
According to an embodiment of the invention, there is provided a method for automatically generating an audio signal, the method comprising: receiving a source audio signal; analyzing the source audio signal to identify a musical characteristic thereof; obtaining a supplemental audio signal based on the identified musical characteristic; and combining the source audio signal and the supplemental audio signal to form an extended audio signal.
Therefore, embodiments of the invention can provide an audio processing system for a computer based audio streaming service that automatically generates a transitional audio signal based on factors such as the general context of the listener as well as the musical mood, musical intensity, musical genre, musical key, musical melody, musical tempo, musical metadata and/or sentiment of the lyrics of an associated audio signal. Matching can be based on either or both of the preceding and succeeding audio signals.
The present invention will be described further below with reference to exemplary embodiments and the accompanying drawings, in which:
In the various figures, like parts are identified by like references.
The basic function of an embodiment of the invention is to automatically generate an extended audio signal by combining a source audio signal with a supplemental audio signal, for example to provide a customized transition from one source audio signal to another. This is illustrated in
The transitional audio signal 2 may contain one or more of: music; a jingle; a personalized message; a public service announcement; a news report; a weather report; a station indent; information about the preceding/succeeding audio signal (such as track or artist name); a notification generated by the operating system or an app of a device which is playing the combined audio signal. It is not essential that the transitional audio signal 2 includes any vocal element.
In an embodiment of the invention, the transitional audio signal 2 is generated based on high and low level audio features extracted from either or both of the preceding and succeeding audio signals and optionally the context of the listener. The context of the listener can include factors such as: user location; user current activity, current weather and/or the user's current emotional state; an entry in an electronic calendar. Contextual information can be acquired from the computer device that the user may be operating. The generated transitional audio signal can be prepared in advance or generated on the fly, allowing time for audio feature extraction, audio analysis, server computation etc.
The purpose of the transitional audio signal is to allow a smooth and seamless transition from one audio signal into another, where the preceding and succeeding audio signals can simply fade in or fade out from the transitional audio signal. Desirably, the content of the transitional audio signal is generated so as to be as non-invasive as possible, but it is also possible to provide a transitional audio signal that contrasts with the preceding and succeeding signals. In an embodiment, the transitional audio signal contains a musical element which matches a musical characteristic—such as at least one of: mood, intensity, genre, key, melody, tempo, metadata and/or sentiment of the lyrics—of the preceding audio signal and/or the succeeding audio signal. How this is achieved is described further below.
In an embodiment, the transitional audio signal contains a vocal element, e.g. a spoken voice or sung vocal, with the intention of providing a specific message which also matches at least one of the musical mood, musical intensity, musical genre, musical key, musical melody, musical tempo, musical metadata and/or sentiment of the lyrics of the preceding audio signal and/or the succeeding audio signal. If the transitional audio signal is to contain a vocal element such as a sung vocal or spoken voice, then this will determine the length of the transitional audio section. The transitional audio signal is desirably longer than the vocal element by a predetermined time or proportion. The generation of the vocal element is described further below.
It is to be noted that a match of a musical characteristic does not have to be exact and in particular if the preceding and succeeding audio signals differ in a musical characteristic, the transitional audio signal can have a musical characteristic that is between the musical characteristic of the preceding and succeeding audio signals so as to smooth the transition.
Various different procedures can be used to generate a musical element for the transitional audio signal. In a first procedure, the preceding audio signal and/or the succeeding audio signal are analyzed to identify at least one musical characteristic, e.g. the musical mood, musical intensity, musical genre, musical key, musical melody, musical tempo, musical metadata and/or sentiment of the lyrics thereof In an embodiment, analysis of the audio signal does not require reference to any metadata. The identified characteristics are used to select a musical element from a database of pre-recorded music. The selection can also be based on the context of the listener at the relevant time.
In a second procedure to generate a musical element for the transitional audio signal, a suitable musical section from either the preceding audio signal or the succeeding audio signal is extracted. A procedure for selection of a suitable section of an audio signal is described below. The extracted musical section is looped until the next audio signal is meant to start.
In the third procedure to generate a musical element for the transitional audio signal, first either the preceding audio signal or the succeeding audio signal are analyzed to identify at least one musical characteristic, e.g. musical mood, musical intensity, musical genre, musical key, musical melody, musical tempo, musical metadata and/or sentiment of the lyrics. The identified musical characteristic(s) are then used to generate music using samplers and/or synthesizers to match either the preceding audio signal or the succeeding audio signal.
The procedure used to generate the transitional audio signal can be predetermined, selected by the user of the apparatus or chosen automatically. If the selection of the procedure for generation of the transitional audio signal is automated, this can be done by a process of elimination, as shown in
The first step is to check S21 if there is a relevant musical transitional audio signal stored in the database, then the second procedure is attempted. If the second procedure is unable to find a suitable section of audio to loop, then the third procedure is attempted. If the third procedure fails, then the preceding audio signal is simply crossfaded into the succeeding audio signal. Other orders to attempt the procedures can be used and may be subject to user preferences.
In an embodiment of the invention, to extract one or more musical characteristics, such as musical mood, musical intensity, musical genre, musical key, musical melody and/or musical tempo, low and high level audio features are extracted from an audio signal. This is illustrated in
These common audio features can also be used in combination to describe the genre and mood of a piece of music, where the features can be used to discriminate between pieces music based on instrumentation, rhythmic patterns and pitch distributions [Ref 2].
Furthermore, these audio features can easily be extracted from audio signals using open source feature extraction libraries, such as Essentia, MIR Toolbox or LibXtract [Ref. 3]. To determine how close a match two audio signals are, simple calculations such as the Euclidean distance or the cosine distance between the audio feature vectors that represent each audio signal can be used. In an embodiment, any lyrics an audio signal may contain are also analyzed by performing sentiment analysis, this helps in determining the mood of a piece of music. Analysis can be based on lyrics as recorded in a database or from speech recognition as described in [Ref. 13]. Sentiment analysis can be based on Arousal and Valence features which are obtained from a weighted sum of Arousal and Valence values of individual words in the lyrics. Arousal and Valence values for words are obtained from available dictionaries. More details can be found in [Ref. 4].
Thus, the overall method of an embodiment of the invention is illustrated in
In a further embodiment, the transitional audio signal matching technique is extended. This is illustrated in
In another embodiment of the invention, illustrated in
If a vocal element is to be used in the transitional audio signal, then the segment that best fits the time length of the message is selected. Alternatively the segment of audio that is the quietest overall can be selected. The volume of a segment can be measured using RMS or a weighted mean-square measure as described [Ref 8]. If there is to be no vocal element then the last identified segment of the preceding audio signal or the first identified segment of the succeeding audio signal is to be used. The transitional audio signal is then constructed S62 by combining a vocal element 2a with a musical element 2b obtained by repeating the extracted section 1d a suitable member of times to match the length of the vocal element 2a.
An embodiment of the invention in which the music for the transitional audio signal is generated is shown in
The MIDI notes for the melody, chords and the beats, along with information such as musical genre, musical key and any metadata related to the preceding or succeeding audio signal used by a music generation engine to create S76 the music for the transitional audio signal.
The music generation engine that is used to generate transitional audio signals takes a number of inputs, for example musical key, musical melody, beat structure and musical genre. It also takes as an input, the desired level of musical complexity, which determines how similar the generated music is to either the preceding or succeeding audio signal. The level of complexity may be obtained S75 from a user preference or may be predetermined. In an embodiment levels of complexity from 1 to 10 are used as described below. More, fewer and/or different approaches can also be employed.
Level 1: the key, chord and tempo information are used to play just the root chord of the preceding or succeeding audio signal using a sampled instrument, e.g. a piano. The beat structure and tempo of either the preceding or succeeding audio signal is then used to generate a similar beat using a sampler or synthesizer.
Level 2: Similar to level 1, but the sampled instrument, e.g. piano, is replaced with an instrument that is similar to the chord playing instrument in either the preceding or succeeding audio signal. The beat may remain the same as level 1, but the structure of how the root chord is being played is slightly varied.
Level 3: Similar to level 2, but now a synthesized or sampled bass instrument is added based on the transcribed melody.
Level 4: Similar to level 3, but the chord progression with the respect to the key of the song is randomized, without imitating the chord progression in the either the preceding or succeeding audio signal. A gap may now be added to the beat in order to indicate a section change (fill).
Level 5: Similar to level 4, but the beat is shuffled or a clap added on every second beat to give it some variation.
Level 6: Similar to level 5, but another instrument that has a similar timbre to some of the instrumentation in either the preceding or succeeding audio signal is added. The melody of the new instrument is similar to the melody of the main instrument in the preceding or succeeding audio signal.
Level 7: Similar to level 6, but the automatically generated chord progression is changed to be more similar to the chord progression in either the preceding or succeeding audio signal.
Level 8: Similar to level 7, but now the chord progression mimics exactly the chord progression in either the preceding or succeeding audio signal and/or the drum fill mimics that of either the preceding or succeeding audio signal.
Level 9: Similar to level 8, but the beat and instrumentation are both be identical to that of either the preceding or succeeding audio signal.
Level 10: at this level there is maximum complexity. The instrumentation, melody, chord progression and beat structure mimic the preceding or succeeding audio signal as close as possible.
A further embodiment of the invention is configured to insert a transitional audio signal into an audio signal, e.g. one that is of a considerable length such as a DJ mix 10, as shown in
In the second instance where the vocal is to be synthesized, a message such as a news report or information about the background music will be fed to a text to speech algorithm (TTS) in order to vocalize the message S96. Various TTS algorithms are known and are available as on-line services. An approach that is particularly suitable is a network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms as described in [Ref. 11].
The synthesized vocal in the transitional audio signal may also be configured to imitate the vocalist in either the preceding audio signal or the succeeding audio signal by using a model that is based on features produced by a parametric vocoder that separates the influence of pitch and timbre as described in [Ref 12]. Alternatively, the style and tone of voice can be configured by the user of the apparatus or else determined using a style library, where the style library configures the voice based on such inputs as musical genre, etc. The speed of delivery of the synthesized vocal can be controlled, for example to fit the message to a desired duration.
The invention has been described above in relation to specific embodiments however the reader will appreciate that the invention is not so limited and can be embodied in different ways. For example, the invention can be implemented on a general-purpose computer but can also be implemented in whole or part application specific integrated circuits. The invention can be implemented on a standalone computer, e.g. a personal computer or workstation, a mobile phone or a tablet, or in a client-server environment as a hosted application. Multiple computers can be used to perform different steps of the method rather than all steps being carried out on a single computer. A computer program embodying the invention can be a standalone software program, an update or extension to an existing program, or a callable function in a function library. A computer program embodying the invention can be stored in a non-transitory computer readable storage medium such as an optical disk or magnetic disk or non-volatile memory.
Outputs of a method of the invention can be broadcast or streamed in any convenient format, played on any convenient audio device or stored in electronic form in any convenient file structure (e.g. mp3, WAV, an executable file, etc.). If the output of the invention is provided in the form of a stream or playlist, the transitional audio signal can be presented as a track of its own or combined into either of the preceding and succeeding tracks. The source audio signals and the transitional audio signals can be provided from separate sources (e.g. servers) and a remotely generated transitional audio signal can be combined with locally stored source audio streams. If the output of the invention is provided in the form of a stream or playlist, then if a user fast-forwards or skips, reproduction may advance to the start, end or an intermediate position of the transitional audio signal. In an embodiment, if the user fast-forwards or skips this is taken into account in generation of the transitional audio signal, for example by omitting information of the preceding track and providing only an introduction of the succeeding track. Other actions performed by the user in relation to the playback device can also be taken into account.
The invention should not be limited except by the appended claims.
The following documents are hereby incorporated by reference in their entirety.
[Ref. 1] Kim, Youngmoo E., et al. “Music emotion recognition: A state of the art review.” Proc. ISMIR. 2010.
[Ref. 2] Wang, Zhe, Jingbo Xia, and Bin Luo. “The Analysis and Comparison of Vital Acoustic Features in Content-Based Classification of Music Genre.” Information Technology and Applications (ITA), 2013 International Conference on. IEEE, 2013.
[Ref. 3] Moffat, David, David Ronan, and Joshua D. Reiss. “An evaluation of audio feature extraction toolboxes.” International Conference on Digital Audio Effects (DAFx), 2016.
[Ref. 4] Jamdar, Adit, et al. “Emotion analysis of songs based on lyrical and audio features.” arXiv preprint arXiv:1506.05012(2015).
[Ref. 5] Mauch, Matthias, Katy C. Noland, and Simon Dixon. “Using Musical Structure to Enhance Automatic Chord Transcription.” ISMIR. 2009.
[Ref. 6] Scholz, Florian, Igor Vatolkin, and Gunter Rudolph. “Singing Voice Detection across Different Music Genres.” Audio Engineering Society Conference: 2017 AES International Conference on Semantic Audio. Audio Engineering Society, 2017.
[Ref. 7] Yela, Delia Fano, et al. “On the Importance of Temporal Context in Proximity Kernels: A Vocal Separation Case Study.”, Audio Engineering Society Conference: 2017 AES International Conference on Semantic Audio.
[Ref. 8] R. ITU-R, “Itu-r bs. 1770-2, algorithms to measure audio programme loudness and true-peak audio level,” International Telecommunications Union, Geneva, 2011
[Ref. 9] Salamon, Justin, et al. “Melody extraction from polyphonic music signals: Approaches, applications, and challenges.” IEEE Signal Processing Magazine 31.2 (2014): 118-134.
[Ref. 10] Vogl, Richard, et al. “Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks.” Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, C N. 2018.
[Ref. 11] Shen, Jonathan, et al. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv preprint arXiv:1712.05884 (2017).
[Ref. 12] Blaauw, Merlijn, and Jordi Bonada. “A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs.” Applied Sciences 7.12 (2017): 1313.
[Ref. 13] McVicar, Matt, Daniel P W Ellis, and Masataka Goto. “Leveraging repetition for improved automatic lyric transcription in popular music.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
Number | Date | Country | Kind |
---|---|---|---|
1803072.6 | Feb 2018 | GB | national |
The present application claims foreign priority to GB patent application number 1803072.6 filed 26 Feb. 2018, which document is hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2019/050524 | 2/26/2019 | WO | 00 |