The present disclosure generally pertains to the field of audio processing, and in particular, to devices, methods and computer programs audio transposition.
There is a lot of audio content available, for example, in the form of compact disks (CD), tapes, audio data files which can be downloaded from the internet, but also in the form of soundtracks of videos, e.g. stored on a digital video disk or the like, etc.
When a music player is playing a song of an existing music database, the listener may want to sing along. Naturally, the listener's voice will add to the original artist's voice present in the recording and potentially interfere with it. This may hinder or skew the listener's own interpretation of the song. Therefore, karaoke systems provide a playback of a song in the musical key of the original song recording, for a karaoke singer to sing along with the playback. This can force the karaoke singer to reach a pitch range that is beyond his capabilities, i.e. too high or too low. This may result in a high singing effort for the karaoke singer to reach the pitch range of the original song and therefore the karaoke singer may not be able to stand long singing sessions or could damage his vocal cords. This may also result in the karaoke user having to adapt his pitch to reduce his effort and save his vocal cords and therefore the overall quality of the performance may be bad.
Although there generally exist techniques for audio transposition, it is generally desirable to improve methods and apparatus for transposition of audio content.
According to a first aspect the disclosure provides an electronic device comprising circuitry configured to separate by audio source separation a first audio input signal into a first vocal signal and an accompaniment, and to transpose an audio output signal by a transposition value based on a pitch ratio, wherein the pitch ratio is based on comparing a first pitch range of the first vocal signal and a second pitch range of a second vocal signal.
According to a second aspect the disclosure provides a method comprising: separating by audio source separation a first audio input signal into a first vocal signal and an accompaniment, and transposing an audio output signal by a transposition value based on a pitch ratio, wherein the pitch ratio is based on comparing a first pitch range of the first vocal signal and a second pitch range of the second vocal signal.
Further aspects are set forth in the dependent claims, the following description and the drawings.
Embodiments are explained by way of example with respect to the accompanying drawings, in which:
Before a detailed description of the embodiments under reference of
The embodiments disclose an electronic device comprising circuitry configured to separate by audio source separation a first audio input signal into a first vocal signal and an accompaniment, and to transpose an audio output signal by a transposition value based on a pitch ratio, wherein the pitch ratio is based on comparing a first pitch range of the first vocal signal and a second pitch range of the second vocal signal.
The electronic device may for example be any music or movie reproduction device such as a karaoke box, a smartphone, a PC, a TV, a synthesizer, mixing console or the like.
The circuitry of the electronic device may include a processor, may for example be CPU, a memory (RAM, ROM or the like), a memory and/or storage, interfaces, etc. Circuitry may comprise or may be connected with input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.)), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). Moreover, circuitry may comprise or may be connected with sensors for sensing still images or video image data (image sensor, camera sensor, video sensor, etc.
The input signal can be an audio signal of any type. It can be in the form of analog signals, digital signals, it can origin from a compact disk, digital video disk, or the like, it can be a data file, such as a wave file, mp3-file or the like, and the present disclosure is not limited to a specific format of the input audio content. An input audio content may for example be a stereo audio signal having a first channel input audio signal and a second channel input audio signal, without that the present disclosure is limited to input audio contents with two audio channels. In other embodiments, the input audio content may include any number of channels, such as remixing of a 5.1 audio signal or the like.
The input signal may comprise one or more source signals. In particular, the input signal may comprise several audio sources. An audio source can be any entity, which produces sound waves, for example, music instruments, voice, vocals, artificial generated sound, e.g. origin form a synthesizer, etc.
The input audio content may represent or include mixed audio sources, which means that the sound information is not separately available for all audio sources of the input audio content, but that the sound information for different audio sources, e.g. at least partially overlaps or is mixed.
The accompaniment may be a residual signal that results from separating the vocals signal from the audio input signal. For example, the audio input signal may be a piece of music that comprises vocals, guitar, keyboard and drums and the accompaniment signal may be a signal comprising the guitar, the keyboard and the drums as residual after separating the vocals from the audio input signal.
Transposition may be the changing of the pitch of tones of piece of music by a certain interval or shifting an entire piece of music into a different key according to the interval.
A pitch ratio may be a ratio between two pitches. Transposition by a pitch ratio may mean shifting a pitch of tones of piece of music by the ratio between two pitches of or shifting an entire piece of music into a different key according the number of semitones that is defined by the ratio between two pitches.
Blind source separation (BSS), also known as blind signal separation, is the separation of a set of source signals from a set of mixed signals. One application for Blind source separation (BSS), is the separation of music into the individual instrument tracks such that an upmixing or remixing of the original content is possible.
In the following, the terms remixing, upmixing, and downmixing can refer to the overall process of generating output audio content on the basis of separated audio source signals originating from mixed input audio content, while the term “mixing” can refer to the mixing of the separated audio source signals. Hence the “mixing” of the separated audio source signals can result in a “remixing”, “upmixing” or “downmixing” of the mixed audio sources of the input audio content.
In audio source separation, an input signal comprising a number of sources (e.g. instruments, voices, or the like) is decomposed into separations. Audio source separation may be unsupervised (called “blind source separation”, BSS) or partly supervised. “Blind” means that the blind source separation does not necessarily have information about the original sources. For example, it may not necessarily know how many sources the original signal contained or which sound information of the input signal belong to which original source. The aim of blind source separation is to decompose the original signal separations without knowing the separations before. A blind source separation unit may use any of the blind source separation techniques known to the skilled person. In (blind) source separation, source signals may be searched that are minimally correlated or maximally independent in a probabilistic or information-theoretic sense or on the basis of a non-negative matrix factorization structural constraints on the audio source signals can be found. Methods for performing (blind) source separation are known to the skilled person and are based on, for example, principal components analysis, singular value decomposition, (in)dependent component analysis, non-negative matrix factorization, artificial neural networks, etc.
Although, some embodiments use blind source separation for generating the separated audio source signals, the present disclosure is not limited to embodiments where no further information is used for the separation of the audio source signals, but in some embodiments, further information is used for generation of separated audio source signals. Such further information can be, for example, information about the mixing process, information about the type of audio sources included in the input audio content, information about a spatial position of audio sources included in the input audio content, etc.
The circuitry may be configured to perform the remixing or upmixing based on the at least one filtered separated source and based on other separated sources obtained by the blind source separation to obtain the remixed or upmixed signal. The remixing or upmixing may be configured to perform remixing or upmixing of the separated sources, here “vocals” and “accompaniment” to produce a remixed or upmixed signal, which may be sent to the loudspeaker system. The remixing or upmixing may further be configured to perform lyrics replacement of one or more of the separated sources to produce a remixed or upmixed signal, which may be sent to one or more of the output channels of the loudspeaker system.
According to some embodiment the circuitry may be further configured to determine the first pitch range of the first vocal signal based on a first pitch analysis result of the first vocal signal and the second pitch range of the second vocal signal based on a second pitch analysis result of the second vocal signal.
According to some embodiment wherein the accompaniment comprises all parts of the audio input signal except for the first vocal signal.
According to some embodiment wherein audio output signal may be the accompaniment.
According to some embodiment wherein audio output signal may be the audio input signal.
According to some embodiment wherein the audio output signal may be the mixture of the accompaniment and the first vocal signal.
According to some embodiment wherein may be further configured separate the accompaniment into a plurality of instruments.
According to some embodiment second audio input signal may be separated into the second vocal signal and a remaining signal.
According to some embodiment the circuitry may be further configured to determine a singing effort based on the second vocal signal, wherein the transposition value is based on the singing effort and the pitch ratio.
According to some embodiment the singing effort may be based on the second pitch analysis result of the second vocal signal and the second pitch range of the second vocal signal.
According to some embodiment the circuitry may be further configured to determine the singing effort based on a jitter value and/or a RAP value and/or a shimmer value and/or an APQ value and/or a Noise-to-Harmonic-Ratio and/or a soft phonation index.
According to some embodiment the circuitry may be further configured to transpose the audio output signal based on a pitch ratio, such that transposition value corresponds to an integer multiple of a semitone.
The transposition value may be rounded to ceil or rounded to floor to the next integer multiple of a semitone. Therefore, the accompaniment may be transposed by an integer multiple of a semitone.
According to some embodiment the circuitry may comprises a microphone configured to capture the second vocal signal.
According to some embodiment the circuitry may be further configured to capture the first audio input signal from a real audio recording.
A real audio recording may be any recoding of music that is recorded for example with a microphone compared to a computer-generated sound. A real audio recording may be stored in a suitable audio file like WAV, MP3, AAC, WMA, AIFF etc. That means the audio input may be an actual audio, meaning un-prepared raw audio from for example a commercial performance of a song.
The embodiments disclose a method comprising separating by audio source separation a first audio input signal into a first vocal signal and an accompaniment, and transposing an audio output signal by a transposition value based on a pitch ratio, wherein the pitch ratio is based on comparing a first pitch range of the first vocal signal and a second pitch range of the second vocal signal.
The embodiments disclose a computer program comprising instructions, the instructions when executed on a processor causing the processor to perform the method comprising separating by audio source separation a first audio input signal into a first vocal signal and an accompaniment, and transposing an audio output signal by a transposition value based on a pitch ratio, wherein the pitch ratio is based on comparing a first pitch range of the first vocal signal and a second pitch range of the second vocal signal.
Embodiments are now described by reference to the drawings.
In the embodiment of
The audio source separation is performed on the audio input signal x(n) in real-time. The audio input signal x(n) is for example a song on which the karaoke singing should be performed, which comprises the original vocals and the accompaniment. The audio input signal x(n) may processed online through a vocal separation algorithm to extract and potentially remove the user vocals from the playback sound or the audio input signal x(n) may be processed in advance when the audio input signal x(n) is for example stored in a music library. In case of in-advance-processing the pitch analysis and the pitch range estimation may also be performed in advance. In order to do in-advance-processing each of the songs in a karaoke song database needs to be analyzed for pitch range.
There exist karaoke boxes that where a manual transposition is possible. However, most karaoke singers (also called karaoke users) do not know whether the pitch range is adequate to their capabilities and therefore an automatic on-line transposition of the accompaniment sAcc(n) has a great advantage.
In one embodiment the audio input x(n) is a MIDI file (see more details in description of
In another embodiment the audio input x(n) is an audio recording, for example a WAV file, a MP3 file, AAC file, a WMA file, AIFF file etc. That means the audio input x(n) is an actual audio, meaning un-prepared raw audio from for example a commercial performance of a song. The karaoke material does not require any manual preparation, and can be processed totally automatically, on-line and be provided good quality and high realism, so in this embodiment no pre-prepared audio/MIDI material is needed.
To analyze pitch-range and singing effort (see
Although the pitch analysis unit and the transposer unit are functionally separated in
Further, advantages of the karaoke system described above are that the low-delay processing of vocal/instrument separation allows for an online pitch analysis and transposition. Further, the vocal separation allows for accurate analysis of vocal pitch range and determination of the singing effort. Further, the vocal/instrument separation processes real audio does the karaoke not limited to MIDI karaoke songs and therefore the music is much more realistic. Still further, the vocal/instrument separation enables improved transposition quality of real audio recordings
As the separation of the audio source signal may be imperfect, for example, due to the mixing of the audio sources, a residual signal 3 (r(n)) is generated in addition to the separated audio source signals 2a-2d. The residual signal may for example represent a difference between the input audio content and the sum of all separated audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its respective recorded sound waves. For input audio content having more than one audio channel, such as stereo or surround sound input audio content, also a spatial information for the audio sources is typically included or represented by the input audio content, e.g. by the proportion of the audio source signal included in the different audio channels. The separation of the input audio content 1 into separated audio source signals 2a-2d and a residual 3 is performed on the basis of blind source separation or other techniques which are able to separate audio sources.
In a second step, the separations 2a-2d and the possible residual 3 are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. On the basis of the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal on the basis of spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in
In a second step, the separations and the possible residual are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. On the basis of the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal on the basis of spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in
The audio input x(n) and audio input y(n) can be separated by the method described on
Another method to the removed the accompaniment from the audio input y(n) is for example a crosstalk cancellation method, where a reference of the accompaniment is subtracted in-phase from the microphone signal, for example by using adaptive filtering.
Another method to separate the audio input y(n) can be utilized if a mastering recording for the audio input y(n) is available in-detail knowledge about how audio input y(n) (i.e. a song) was mastered. In this case the stems need to be mixed again without the vocals and the vocals need to be mixed again without all the accompaniment. In this process a much larger number of stems is used during mastering, e.g. layered vocals, multi-microphone takes, effects being applied, etc.
At the signal framing 301, a windowed frame, such as the framed vocals Sn(i) can be obtained by
S
n(i)=s(n+i)h(i)
where s(n+i) represents the discretized audio signal (i representing the sample number and thus time) shifted by n samples, h(i) is a framing function around time n (respectively sample n), like for example the hamming function, which is well-known to the skilled person.
At the FFT spectrum analysis 302, each framed vocals is converted into a respective short-term power spectrum. The short-term power spectrum S(ω) as obtained at the Discrete Fourier transform, also known as power spectral density, which may be obtained by
where Sn(i) is the signal in the windowed frame, such as the framed vocals Sn(i) as defined above, ω are the frequencies in the frequency domain, |Sω(n)| are the components of the short-term power spectrum S(o) and N is the numbers of samples in a windowed frame, e.g. in each framed Vocals.
The pitch measure analysis 303 may for example be implemented as described in the published paper Der-Jenq Liu and Chin-Teng Lin, “Fundamental frequency estimation based on the joint time-frequency analysis of harmonic spectral structure” in IEEE Transactions on Speech and Audio Processing, vol. 9, no. 6, pp. 609-621, September 2001:
A pitch measure RP(ωf) is obtained for each fundamental-frequency candidate (Of from the power spectral density Sω(n) of the frame window Sn by
R
P(ωf)=RE(ωf)RI(ωf)
where RE(ωf) is the energy measure of a fundamental-frequency candidate ωf, and RI(ωf) is the impulse measure of a fundamental-frequency candidate ωf.
The energy measure RE(ωf) of a fundamental-frequency candidate ωf is given by
where K(ωf) is the number of the harmonics of the fundamental frequency candidate ωf, hin(nωf) is the inner energy related to a harmonic lωf of the fundamental frequency candidate ωf, and E is the total energy, where E=∫0∞Sω(n)dω.
is the area under the curve of spectrum bounded by an inner window of length win and the total energy is the total area under the curve of the spectrum.
The impulse measure RI(ωf) of a fundamental-frequency candidate ωf is given by
where ωf is the fundamental frequency candidate, K(ωf) is the number of the harmonics of the fundamental frequency candidate ωf, hin(lωf) is the inner energy of the fundamental frequency candidate, related to a harmonic nωf and hout(lωf) is the outer energy, related to the harmonic lωf.
is the area under the curve of spectrum bounded by an outer window of length wout.
The pitch analysis result {circumflex over (ω)}f(n) for frame window Sn is obtained by
{circumflex over (ω)}f(n)=argω
where {circumflex over (ω)}f(n) is the fundamental frequency for window S(n), and RP(ωf) is the pitch measure for fundamental frequency candidate ωf obtained by the pitch measure analysis 303, as described above.
The fundamental frequency {circumflex over (ω)}f(n) at sample n is the pitch measurement result that indicates the pitch of the vocals at sample n in the vocals signal s(n).
Still further, a low pass filter (LP) 304 is performed on the pitch measurement result {circumflex over (ω)}f(n) to obtain a pitch analysis result ωf(n) 305.
The low pass filter 305 can be a causal discrete-time low-pass Finite Impulse Response (FIR) filter of order M given by
where ai is the value of the impulse response at the ith instant for 0≤i≤M. In this causal discrete-time FIR filter ep(n) of order M, each value of the output sequence is a weighted sum of the most recent input values.
The filter parameters M and ai can be selected according to a design choice of the skilled person. For example, a0=1 for normalization purposes. The parameter M can for example be chosen on a time scale up to 1 sec.
A pitch analysis process as described with regard to
In the embodiment of
Other methods for pitch analysis and estimation for monophonic signals which can be used instead or additional to method described in
Still further, other more advanced methods for pitch analysis and estimation which can be used instead or additional to method described in
For a robust pitch determination, it is needed to use a pitch tracking (avoid pitch doubling errors and voiced/unvoiced detection), which is often done by using a dynamic programming on a pitch FO candidates, as described in any of the methods given above. A pitch tracking method is described in “An integrated pitch tracking algorithm for speech systems”, B. Secrest and G. Doddington, published in ICASSP '83. IEEE International Conference on Acoustics, Speech, and Signal Processing, Boston, Mass., USA, 1983, pp. 1352-1355, doi: 10.1109/ICASSP.1983.1172016.
Still further, the pitch analysis and the (key) transposition is better if vocals and the accompaniment are separate.
The pitch range determination process as described above can be carried out based on the original vocals soriginal(n) pitch analysis result ωf,original(n) and on the user vocals suser (n) pitch analysis result ωf,user(n).
The pitch determination process of the pitch determiner as described above in
In another embodiment the pitch range determination process of the pitch determiner 15 as described above may be carried out on in-advance stored audio input x(n) that is for example a stored song of a karaoke system whose pitch range should be determined. In this case the upper limit max _ωf(n) of the pitch range Rω(n)=[min_ωf(n), max_ωf(n)] is determined by setting
wherein max is the maximum-function and N is the number of all samples if the stored audio input x(n) and the lower limit min_ωf(n) of the pitch range Rω(n)=[min_ωf(n), max_ωf(n)] is determined by setting
wherein min is the minimum-function
In yet another embodiment the pitch range determination process of the pitch determiner 15 as described above may be carried out on in-advance stored audio input y(n), that is for example a stored karaoke performance of a user on a number of previous songs from which a pitch range and singing effort (see below) profile can be compiled. In this case the pitch range Rω(n)=[min_ωf(n), max_ωf(n)] can be determined as described in the previous paragraph.
Pitch Range Comparison
The pitch range comparison process of the pitch range comparator 16 as described above is carried out for every sample n of the user's vocals suser(n). That means, while a user may perform a karaoke, the pitch ratio Pω(n) can be adapted at every sample n. The final pitch ratio Pω(N) over all samples n=1 . . . N after finishing a karaoke performance by a user can be stored in a database, for example the storage 1202, and be linked to the user.
The pitch ratio Pω(n) is a value relative to original vocal pitch range average avg_ωf,original(n) and centered around the 1, so that it can be seen as a kind of a “transposition factor” which should be applied to the that original vocal pitch frequency ωf,original(n).
As described above, as well as the pitch analysis result of ωω(n) from the pitch analyzer 14 and the pitch range Rω(n) from the pitch range determiner 15 the pitch ratio Pω(n) can be determined online for every sample n from an audio input y(n), for example from a live karaoke performance of a user, and from an audio input x(n), for example from a chosen song to which to a karaoke performance should be performed.
If a pitch range Rω,user (N) of an user is known in advance (i.e. before the karaoke on a song is performed which yields an audio input y(n)), for example from another song that was performed by the user and is stored in the storage 1202, the pitch ratio Pω(N) may be determined based on the in advance known range of the user Rω,user and a in advance known range of the user Rω,original(N).
In the realm of music and musical transposition it is often stated how much semitones or full tones a piece of music is transposed. Since an octave comprises 12 semitones and octave corresponds to pitch ratio Pω(n)=2 the transposition up by a semitone corresponds to a pitch ratio Pω(n)=21/12=1.087 the transposition down by a semitone corresponds to pitch ratio Pω(n)=(½)1/12=0.920. In this way, the pitch ratio Pω(n) and a semitone transposition specification can be easily converted into each other. Therefore, another embodiment the pitch ratio Pω(n) may be rounded to ceil or to floor (i.e. up or down) to the next semitone such that pitch ratio Pω(n) always corresponds to a transposition by a integer multiple of an semitone.
As described above the goal is, during a karaoke performance of an user to a song, to transpose the accompaniment sAcc(n) of the song such that the user can more easily match his voice to the accompaniment sAcc(n). The “transposition factor” by which the accompaniment sAcc(n) should be transposed is determined as described in
In this embodiment the audio output signal is x*(n) is equal to the accompaniment sAcc(n). In general, the same process as described above in
The time-scale modification phase-vocoder and the resampling is described in more detail for example in the paper “New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects”, z published in Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 or in the papers mentioned therein. Still further an improved phase-vocoder is explained in more detail for example in the paper, “Improved Phase Vocoder Time-Scale Modification of Audio”, by Jean Laroche and Mark Dolson, published in IEEE transactions on speech and audio processing, vol. 7, no. 3, May 1999.
In case that the transposition value transpose_val(n) is smaller than 1, the steps 73 and 74 of
As described above, as well as the pitch ratio Pω(n) can be determined on-line for every sample n the transposed accompaniment sAcc*(n) can be determined on-line for every n depending on the current transposition value transpose_val(n) (this can also be viewed as a transposition key) and can then be applied to the whole song in real-time.
If a the pitch ratio Pω(N) (and thereby the transposition value transpose_val(n)) for a chosen karaoke song and for a specified user is known in advance, as described above, the transposed accompaniment sAcc*(n) may also be determined in advance.
As described above, the accompaniment sAcc(n) as output by the MSS 12 (see
As described above, the pitch ratio ratio Pω(n) can also be stated in semitones or full tones and exactly the same is true for the transposition value transpose_val(n).
Still further in another embodiment, the audio input signal x(n) may be available as a MIDI (Musical Instrument Digital Interface), and therefore the accompaniment sAcc(n), or the single tracks of the accompaniment may be available as MIDI file as well. In this case the transposition of the MIDI file accompaniment sAcc(n) can be achieved by standard MIDI commands like a transposition filter. That means in this case the transposition is performed by simply transposing the key of the MIDI track by the desired transposition value transpose_val(n) prior to the instrument synthesis.
Therefore, the above described transposer is able to process any type of recording (synthesized MIDI, third party cover, or commercially released recordings) wherein the transposition quality may be improved to by the high separation quality and pitch analysis and transposition value determination.
The karaoke system can further estimate a singing effort of a karaoke user. The singing effort indicates if a karaoke user has great effort to reach the pitch range of the original song, i.e. if the karaoke user must make high efforts to sing as high or as low as the original song. If amateur karaoke user sings beyond his natural capabilities for a longer period of time, the user will not be able to stand long singing sing sessions and could damage his vocal cords and the quality of the performance will be bad.
There are different characteristic parameters which can be deduced from an analysis of the user vocals suser (n) and or the user's pitch analysis result ωf,user(n) which indicate a high singing effort. These different characteristic parameters are for example:
A more in depth analysis of the above mentioned parameters and ways to measure and detect them based on the user vocals suser (n) and or the user's pitch analysis result ωf,user(n) is describe in the scientific paper “Vocal Folds Disorder Detection using Pattern Recognition Methods”, J. Wang and C. Jo, published in 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, 2007, pp. 3253-3256, doi: 10.1109/IEMBS.2007.4353023.
Most of the above parameters are related to the vocal cords. Some of these are related to expressiveness while singing as well, like jitter (vibrato), but exhibiting progressive chaotic vocal cord behavior through the karaoke singing session might be an indicator of developing short-term vocal cord issues like swelling. The NHR value could be as well used to detect aphonia as well. The karaoke system can monitor these above described and its variations over a karaoke session of a user and determine the singing effort and a possible vocal cord damage (for example through progressive degradation of singing quality).
In the embodiment above singing effort E(n) is a “binarized” value of the jitter value jitter_val(n), i.e. a flag was set when it was above a threshold and the flag was not set when it was below the threshold. In another embodiment the singing effort E(n) can be a quantitative value, for example a value that is direct proportional to the jitter value jitter_val(n).
In yet another embodiment any of the other above described different characteristic parameters can be used instead of the jitter or in addition in order to determine a first and a second singing effort value as described in
In yet another embodiment the singing effort E(n) can be a quantitative value, for example a value that is direct proportional to any linear or nonlinear combination above described different characteristic parameters.
In another embodiment the karaoke system can propose to stop or pause singing to prevent more severe vocal cord problems. More details how to recognize pathological speech, which can also be utilized to detect a high singing effort are for example described in “A system for automatic recognition of pathological speech”, by: Dibazar, Alireza & Narayanan, Shrikanth, published in Proceedings of the Asilomar Conference on Signals, Systems and Computers, November 2002. In this paper standard MFCC and pitch features are used for the classification of several speech production related pathologies.
If the singing effort determiner 22 has determined the singing effort value E and a pitch ratio Pu, a transposition value transpose_val can be determined.
In yet another embodiment the accompaniment sAcc(n) can be separated into melodic/harmonic tracks and percussion tracks, and the same single-track (single instrument) transposition as described above can be applied. If the accompaniment sAcc(n) is separated into more than one track (instrument) the transposition process of the transposer 17 is applied to each of the separated tracks individually and the individually transposed tracks are summed up afterwards into a stereo recording to receive the complete transposed accompaniment sAcc*(n).
It should be noted that the description above is only an example configuration. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces, or the like.
It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding.
It should also be noted that the division of the electronic device of
All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example, on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.
In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below.
(1) An electronic device comprising circuitry configured to separate by audio source separation a first audio input signal (x(n)) into a first vocal signal (soriginal (n)) and an accompaniment (sAcc(n); sA1(n), sA2(n), sA3(n)), and to transpose an audio output signal (x*(n)) by a transposition value (transpose_val(n)) based on a pitch ratio (Pω(n)), wherein the pitch ratio (Pω(n)) is based on comparing a first pitch range (Rω,original(n)) of the first vocal signal (soriginal(n)) and a second pitch range (Rω,user(n)) of the second vocal signal (suser(n)).
(2) The electronic device of (1), wherein the circuitry is further configured to determine the first pitch range (Rω,original(n)) of the first vocal signal (soriginal(n)) based on a first pitch analysis result (ωf,orginal(n)) of the first vocal signal (soriginal(n)) and the second pitch range (Rω,user(n)) of the second vocal signal (suser (n)) based on a second pitch analysis result (ωf,user(n)) of the second vocal signal (suser (n)).
(3) The electronic device of (1) or (2), wherein the circuitry is further configured to determine the first pitch analysis result (ωf,original(n)) based on the first vocal signal (soriginal (n)) and the second pitch analysis result (ωf,user(n)) based on the second vocal signal (suser (n)).
(4) The electronic device of anyone of (1) to (3), wherein the accompaniment (sAcc (n); sA1(n), sA2 (n), sA3 (n)) comprises all parts of the audio input signal (x(n) except for the first vocal signal (soriginal(n)).
(5) The electronic device of of anyone of (1) to (4), wherein audio output signal (x*(n)) is the accompaniment (sAcc(n)).
(6) The electronic device of anyone of (1) to (5), wherein the audio output signal (x*(n)) is the audio input signal (x(n)).
(7) The electronic device of anyone of (1) to (6), wherein the audio output signal (x*(n)) is a mixture of the accompaniment (sAcc(n)) and the first vocal signal (soriginal (n))
(8) The electronic device of anyone of (1) to (8), wherein the circuitry is further configured to separate the accompaniment (sA1(n), sA2(n), sA3(n)) into a plurality of instruments (sA1(n); sA2(n); sA3(n)).
(9) The electronic device of anyone of (1) to (8), wherein the circuitry is further configured to separate a second audio input signal (y(n)) by audio source separation.
(10) The electronic device of (9), wherein the second audio input signal (y(n)) is separated into the second vocal signal (suser (n)) and a remaining signal.
(11) The electronic device of anyone of (1) to (10), wherein the circuitry is further configured to determine a singing effort (E (n)) based on the second vocal signal (suser (n)), wherein the transposition value (transpose_val(n)) is based on the singing effort (E(n)) and the pitch ratio (P, (n)).
(12) The electronic device of (11) wherein, the singing effort (E(n)) is based on the second pitch analysis result (ωf,user(n)) of the second vocal signal (suser(n)) and the second pitch range (Rω,user(n)) of the second vocal signal (suser(n)).
(13) The electronic device of (11) or (12), wherein the circuitry is further configured to determine the singing effort (E(n)) based on a jitter value (jitter_val(n)) and/or a RAP value and/or a shimmer value and/or a APQ value and/or a Noise-to-Harmonic-Ratio and/or a soft phonation index.
(14) The electronic device of anyone of (1) to (13), wherein the circuitry is configured to transpose the audio output signal (x*(n)) based on a pitch ratio (P, (n)), such that transposition value (transpose_val(n)) corresponds to an integer multiple of a semitone.
(15) The electronic device of anyone of (1) to (14), wherein the circuitry comprises a microphone configured to capture the second vocal signal (suser(n)).
(16) The electronic device of anyone of (1) to (15), wherein the circuitry is configured to capture the first audio input signal (x(n)) from a real audio recording.
(17) A method comprising:
separating by audio source separation a first audio input signal (x(n)) into a first vocal signal (soriginal(n)) and an accompaniment (sAcc(n); sA1(n), sA2 (n), sA3 (n)), and transposing an audio output signal (x*(n)) by a transposition value (transpose_val(n)) based on a pitch ratio (Pω(n)), wherein the pitch ratio (Pω(n)) is based on comparing a first pitch range (Rω,original(n)) of the first vocal signal (soriginal(n)) and a second pitch range (Rω,user(n)) of the second vocal signal (suser(n)).
(18) A computer program comprising instructions, the instructions when executed on a processor causing the processor to perform the method (17).
Number | Date | Country | Kind |
---|---|---|---|
20180336.8 | Jun 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/065967 | 6/14/2021 | WO |