The present invention relates to karaoke events. More specifically, the present invention is concerned with a method and system for scoring a singing voice.
More specifically, in accordance with the present invention, there is provided a method for scoring a singer, comprising defining a reference melody from a reference song, recording a singer's rendering of the reference song, defining a melody of the singer's rendering of the reference song, comparing the melody of the singer's rendering of the reference song with the reference melody; and scoring the singer's rendering of the reference song.
There is further provided a system for scoring a singer, comprising a processing module determining notes duration and pitch of a melody of a reference song and notes duration and pitch of a melody of the singer's rendering of the reference song; and a scoring processing module comparing the notes duration and the pitch of the melody of the reference song and the notes and the pitch of the melody of the singer's rendering of the reference song.
Other objects, advantages and features of the present invention will become more apparent upon reading of the following non-restrictive description of specific embodiments thereof, given by way of example only with reference to the accompanying drawings.
In the appended drawings:
A singing voice, such as a karaoke user's performance, is recorded, and from the recorded file of the user's rendering of the song, the notes, i.e. the sung melody, is compared with the notes, i.e. the melody, of a reference file of the corresponding song. The comparison is based on an analysis of blocks of samples of sung notes, i.e. of an a cappella voice, and on a detection of the energy envelope of the notes, taking into account pitch and duration of the notes. The results of the comparison give an assessment of the performance of the karaoke in terms of pitch and note duration, as a score.
The system generally comprises a reference processing module 100 (see
The reference processing module 100 generates a set R of N parameters, defined as:
R={r0,r1,r2, . . . rN}
The set R defines the melody (notes) of a reference song. It serves as a reference when assessing the quality of the song as sung by a karaoke user.
The scoring processing module 400 determines, from the set R of N reference parameters, a set S of M parameters, corresponding to the quality of the melody as sung by the karaoke user, defined as:
S={s0, s1, . . . , sM}
A number of components are used to define a song, including, for example, the melody (notes) of the song, the background music, and the lyrics. A MusicXML type-file 110 may be used to transfer these components; others may be used, such as MIDI karaoke for example.
The components used to obtain parameters of the reference set R defined hereinabove, are essentially the lyrics and the melody, i.e. the notes to sing, with the duration thereof, the background music being processed so as to single out the voice. This processing comprises building a mono channel by adding the music usually emitted by the left channel and the right channel of a stereo loudspeaker or of an earphone for example and transmitting the mono channel integrally to the left channel of the earphone, and transmitting the mono channel, inverted, on the right channel: the signals of two channels are thus identical save for the phase thereof, which is inverted from the left to the right channels, and the analysis thus proceeds on the mono signal by adding sounds received by the right channel and by the left channel, which theoretically allows cancelling the background music accompanying the voice itself. This pre-processing allows minimizing the sound of the background music at the signal reception. In practice, the minimization is not total, but it is usually sufficient to simplify the analysis in real time, which can thus avoid using recognition algorithms of the voice in a polyphonic signal.
Similarly, the minimization of background music may be performed by restoring a mono channel after the recording of the performance sung (275,
The reference 110 is received by a music synthesis unit 130, either by a synthetic method or by vocal reference. In the synthetic method, the musical notes of the song are generated from data in the MusicXML file. In the vocal reference method, the voice of a reference singer is recorded, the reference singer singing on a music synthetized from data in the MusicXML file. The music synthesis unit 130 outputs a sampled signal, in which the reference melody is represented by:
X
A
={x
0
, x
1
, . . . , x
a−1}
where a is the total number of samples and XA is the set of all samples. This set is divided into blocks defined as:
X={x
0
,x
1
, . . . , x
b−1}
where b is the number of samples in the block X. As a result:
X
A
={x
0
,x
1
, . . . , x
a−1
}={X
0
,X
1
, . . . , X
B}
where B=a/b is the number of blocks.
While a continuous Fourier transform is achieved in a range [−∞, +∞], a discrete Fourier transform is achieved on a block of N samples, i.e. in a range [0, N−1]. The discrete Fourier transform emulates an infinite number of blocks by repeating the range [0, N−1] infinitely. However, interfering frequencies occur at the borders of the blocks, which may be reduced by applying a weighting window, such as, for example, a Hanning window, which acts on the samples as follows (see 140 in
where pn is the weight of sample n of the block, N is the number of samples in the block, yn is the value of the sample n of the block prior to weighing, and xn is the value of the weighed sample n of the block.
Considering the samples values x0, x1, . . . , xn−1 from the weighing window (140), a discrete Fourier transform (150) is defined by:
Or, in a matrix notation:
The discrete Fourier transform has a fast version which allows a very efficient processing of the above relations by a computer. A fast Fourier transform is based on symmetries that appear in the matrix notation, whatever the value of n.
According to a property of the Fourier transforms, when the values xk are real numbers, which happens to be the case here, only the first half of the n coefficients need be processed since the second part relates to the complex conjugate values of the first half.
A pitch detector (160) is used for determining the frequency of the reference note, as follows:
p=max(fd, fd+1, . . . , fu−1, fu)
where d is the index of the minimal frequency of the search, u is the index of the maximal frequency of the search, and p is the index corresponding to the maximum of the frequency spectrum.
The optimal values of the frequency range [d, u] ideally correspond to the lowest and the highest frequencies of the song respectively. Whenever these lowest and the highest frequencies of the song are unknown, a frequency range corresponding to the dynamic frequency range of a number of songs may be used.
The comparison between the reference and the song as sung by the karaoke user is performed based on a psycho-auditory basis corresponding to what the human ear perceives. Considering such a basis, a logarithmic scale is used for the frequency representation. However, a logarithmic scale tends to under represent lower frequencies compared to higher frequencies, which greatly reduces the ability to assess the real frequency, i.e. the musical note as sung by the karaoke user. In order to overcome this shortcoming, the following relation is applied:
where p is the index of the maximum frequency, and pe is the index of the estimated maximum.
This relation represents the position in frequency index of the center of gravity C of the area defined by
where E is the sampling frequency, b is the number of samples in a block, and M0=8,17579891564 Hz, i.e. the frequency of the first MIDI note, noted MIDI 0.
Each block provides an estimated index of the position of the maximum. In the case of an audio reference, the spectral energy of the maximum peak is thus stored.
The sampled signal, in which the reference melody is represented, generated by the music synthesis unit 130, is also transmitted to a peak detector 180. Two cases arise, depending on the type of the reference.
For a XLM, KAR or MIDI reference, the peak detection consists of detecting the presence or absence of a note melody: a maximum energy is considered when a note of the melody is present and a null energy is considered in absence of the note.
For an audio reference, detection of a peak corresponds to a sudden energy level in the input signal. The peak detector (180) may work on an analog detection of AM frequency demodulation, adapted as follows:
X
|A|
={|x
0
|, |x
1|, . . . , |xa−1|}
where |y| is the absolute value of y. Detection is done by a thresholding defined by:
X
P
={p
0
,p
1
, . . . , p
a−1}
where pi=|xi|>T pour i=0, 1, . . . , a−1 and T is the minimum threshold for detection of an energy peak.
With respect to note duration, in the case of a XLM, KAR or MIDI reference, the duration of the note, i.e. the length of time the note is sustained, corresponds to a duration indicated in the reference XML or KAR file.
In the case of an audio reference,
The duration of a note is estimated using this envelope. In fact, generally, the envelope corresponds to a plurality of notes. The duration estimated using this envelope allows to assess a singer's capacity to sustain notes without getting out of breath, and there is no need to discriminate between notes.
In
Moreover, in
In (200), a pair vector (t, l) is created for the whole song. Time t is represented as samples where t0 is the first sample and l is the length in number of samples of the envelope.
The client application receives the set of all envelopes of the reference file, described by vector Er:
E
r={(t0,l0),(t1,l1), . . . , (tm,lm)}
where m is the number of envelopes, i.e. the dimension of the vector.
Thus, the processing module 100 generates a set R of N parameters, defining the melody (notes) of a song, in terms of pitch and duration (i.e. time envelope). It serves as a reference when assessing the quality of the song as sung by a karaoke user.
Turning now to
The karaoke user, typically wearing earphones for the background music, performs in front of a microphone for the recording of his/her rendering of the song. At the microphone, an “a cappella” performance without musical accompaniment is collected 275, as described hereinabove in relation to
p
2=max(fd,fd−1, . . . , fi,fj, . . . , fu−1, fu)
where:
log−1 refers to either ex or 10x. The logarithm type is undefined in the above relations. It may be a naperian or a basis 10 logarithm. The above relations are independent from the logarithm type.
Each block provides two estimated indexes of the position of the maximum. The spectral energy of the peaks is then stored, for pitch comparison (262, 264). The characteristics are represented by 6 vectors defined as follows:
VR={me
ER={e0,e1, . . . , eb}
V1={me
E1={e1,0,e1,1, . . . , e1,b}
V2={me
E2={e2,0,e2,1, . . . , e2,b}
where VR is a vector of the values of the reference notes for each black; ER is the frequency energy of the reference note; V1 is a vector of estimated notes values for each block; E1 is the frequency energy of the note of the maximum peak; V2 is a vector of estimated notes (second peak) values for each block; and E2 is the frequency energy of the note of the second maximum peak.
The comparison between the reference notes and the karaoke user's notes (264) yields the following relation:
where i is the block index; j is the harmonic comparison index; and I is the index of the octave of search about the reference note.
The comparison relation takes into account harmonics of musical scales. Modulo 12 corresponds to a same note in a different musical octave. This modulo allows taking into account the register of the karaoke singer. For example, a woman's voice is naturally one octave higher than a man's voice. The function
applies to all values of the set of harmonic comparison indexes. As a result, a single value Ci,l is generated. It is to be noted that the computation of comparisons Ci,l is performed only if the frequency energy is sufficient, i.e. above sc. If VR
Two characteristics are derived from the values Ci,l, as follows:
In cases of KAR or MusicXML references, the tests for the reference energy are useless since the reference is entirely synthetized. The karaoke user does not have any clue about how loud he must use for singing. As a result, the value sc is uncalibrated. In order to overcome this situation, a calibration is performed to adjust the value of the threshold sc as follows: determining the average energy mp of the blocks of the karaoke user's file in presence of a note in the reference file; determining the average energy ma of the blocks of the karaoke user's file in absence of a note in the reference file; determining the average energy mq of the note of the blocks of the karaoke user's file in presence of a note in the reference file; and determining the average energy mb of the note of the blocks of the karaoke user's file in absence of a note in the reference file. Thresholds are obtained as follows:
In cases of audio signals, the value sc may be manually determined upon launching the program.
As described hereinabove, this signal is also processed, through a peak detector (280) (see 180 for the reference signal,
E
C={(t0,l0),(t1,l1), . . . , (tn,ln)}
where n is the number of envelopes, i.e. the dimension of the vector.
The note duration is determined as described hereinabove in relation to 190, 200 in
E
r={(t0,l0),(t1,l1), . . . , (tm,lm)}
and
E
C={(tt0,ll0),(tt1,ll1), . . . , (ttn,lln)}.
A first characteristic compares the total duration of the envelopes:
A second characteristic compares envelopes, by determining whether a sample, at time t, is found simultaneously in one envelope of Er and in one envelope of EC. Such samples are grouped in F′2. Thus:
A third characteristic compare the energy envelopes by blocks. In this case, the energy of a note in a block is considered, rather than the envelope of the signal. Such procedure allows eliminating background noise that triggers detection of notes and envelopes. The energy of the signal is weak, which allows evidencing false detections. For each bloc, under parameters are determined as follows:
With F′3 the number of blocks where the energy of the note is above a threshold Tf both in the reference and in the client signals, F″3 the number of blocks where the energy of the note is above the threshold Tf only in the reference signal, F′″3 the number of blocks where the energy of the note is above the threshold Tf only in the client signal, the third characteristic is then given by:
Moreover, F3 will be set to zero when
The final score (300) is given by S=F3*c6, where:
d1 and d5 are derived from Ci,1 and Ci,5 respectively. The values Ci,l are obtained to find the minimum error between two notes and use the absolute value in their formulas. d1 and d5 are obtained without considering the absolute value of the minimum because the negative values and the positives values are weighted differently in order to take into account psycho-auditory characteristics. Indeed, it has been noted that a note sounds falser when sung lower than higher. Thus d1 and d5 are obtained as follows:
where Csigni,j is the sign of the minimum of Ci,j, and pd is a weighting factor for negative values, here fixed to 2.
Thus:
where b is the number of blocks.
The score is sent to an Api and server for example.
The present method comprises processing a reference song, as either an “a cappella” voice or a digital file such as MIDI, MusicXML for example, modifying the audio references to the user so as to single out the voice by inverting a mono channel in one of the transmission channels of the accompanying music, detecting the notes one by one, analysing the signals and scoring.
As people in the art will appreciate, the present method and system provide assessing the quality of the reference sung notes and of the notes sung by the user, by using an estimation of the frequency of the sung notes. The comparison includes comparing signals envelopes and pitch. The pitch analysis is simplified since the voice from the background is singled out during recording.
The scope of the claims should not be limited by the embodiments set forth in the examples, but should be given the broadest interpretation consistent with the description as a whole.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2013/050721 | 9/20/2013 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61704804 | Sep 2012 | US |