The present invention relates generally to entertainment systems, and relates more particularly to karaoke systems.
Karaoke systems have become increasingly popular means of entertainment at parties and other social events. However, cost-constraints limit the quality and capabilities of conventional private-use karaoke systems. For example, it is very difficult for conventional private-use karaoke systems to obtain original musical tracks for user performances (e.g., as opposed to musical tracks that are re-recorded by a karaoke system manufacturer and performed by anonymous artists in the same key as the original musical track). This limits the selection of music available to karaoke users. Furthermore, the selections that are available are often modified versions of the original works.
Moreover, many karaoke users would benefit from a system that provides a score or assessment of the user's performance, e.g., in comparison to the originally recorded track. However, presently available karaoke systems do not include this capability.
Thus, there is a need in the art for a method and apparatus for adapting original musical tracks for karaoke use.
In one embodiment, the present invention is a method and apparatus for adapting original musical tracks for karaoke use. In one embodiment, an original musical track is separated into vocal elements and non-vocal elements. The vocal elements are aligned with corresponding text transcriptions (e.g., text-based lyrics), and the aligned text-based lyrics are then displayed to a user while the non-vocal elements are simultaneously played in a manner that is synchronous with the display of the lyrics.
The teaching of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
The present invention relates to karaoke systems, including karaoke systems that may be implemented for private or home use (e.g., at private parties or other social gatherings). The method and apparatus of the present invention may be implemented to transform virtually any computing device (including a desktop computer, a laptop computer, a cellular telephone, a personal digital assistant (PDA), a wristwatch, a portable music player, a car stereo, a hi-fi/entertainment center, a television, a gaming console, a dedicated karaoke device, a digital video recorder (DVR), or a cable or satellite set stop box, among others) into a karaoke system capable of adapting original musical tracks for karaoke use. Moreover, the method and apparatus of the present invention may be implemented to “score” a user's performance based on a comparison to the original musical track.
In optional step 106 (illustrated in phantom), the method 100 separates the original musical track into two portions: a first portion containing the original musical track's vocal elements and a second portion containing the original musical track's non-vocal elements. In one embodiment, step 106 is performed using any one or more known techniques for extracting vocals from stereo music files.
In step 108, the method 100 aligns the vocal elements of the original musical track with one or more text versions of the corresponding lyrics. In one embodiment, the text-based lyrics are input by the user. In another embodiment, the text-based lyrics are retrieved locally or remotely (e.g., from a local file or from the Internet). In one embodiment, this alignment step 108 is performed using the intact original musical track. In another embodiment, this alignment step 108 is performed using only vocal elements that have been separated from non-vocal elements of the original musical track (e.g., in accordance with optional step 106).
The method 200 is initialized at step 202 and proceeds to step 204, where the method 200 retrieves a plurality of text-based versions of the lyrics that correspond to the vocal elements of the original musical track. These text-based versions of the lyrics may be retrieved, for example, from multiple Internet web sites. In one embodiment, step 202 involves the selection of a predefined number of text-based versions of the lyrics from a given set of text-based versions.
In step 206, the method 200 normalizes and/or filters the retrieved versions of the text-based lyrics in order to canonicalize spellings and automatically correct obvious transcription errors. The method 200 then proceeds to optional step 208 (illustrated in phantom) and cuts waveforms of the vocal elements to approximately span the retrieved versions of the lyrics.
In step 210, the method 200 forcibly aligns the waveforms of the vocal elements to the normalized and filtered text-based lyrics. In one embodiment, this forcible alignment is performed with partial flexibility. That is, portions of the waveforms and portions of the text-based lyrics may be skipped in order to avoid failure of the alignment process.
In step 212, pauses in the aligned output of step 210 are identified and reduced. In one embodiment, pauses are reduced by iteratively cutting the waveforms at increasingly shorter pauses until substantially all of the waveforms are of manageable lengths (e.g., approximately thirty seconds or less).
In step 214, the method 200 generates lattices for flexible alignment and then flexibly aligns all of the waveforms using the generated flexible alignment lattices. In one embodiment, flexible alignment lattices are generated for each version of the text-based lyrics that is used in the method 200. In one embodiment, a flexible alignment lattice for a version of the text-based lyrics is generated by processing the version of the text-based lyrics to generate a hypothesis search graph having the following properties: (1) every word is optional; (2) every word is preceded by either an optional “garbage word” or a disfluency (e.g., “um”, “uh”, “hmm”, etc.); and (3) every word is followed by an optional pause of variable length. In one embodiment, the pause is modeled using a pause phone that is trained on background noise.
By making every word in the hypothesis search graph optional, arbitrary amounts of the text-based lyrics can be skipped while still entertaining the possibility of resynchronizing with the waveforms at a later point. By preceding every word in the hypothesis search graph with either a “garbage” word or a disfluency, some of the words that might be omitted by the transcription of the lyrics may be able to be recovered, and out-of vocabulary words (e.g., words not recognized by an implemented speech recognition system) may be aligned. By following every word in the hypothesis search graph with an optional pause, background noise may be more easily identified and distinguished from the speech to be recognized.
The method 200 then proceeds to step 216 and uses the flexible alignment results from step 214 to verify and/or correct the text-based versions of the lyrics. The method 200 terminates in step 218.
Referring back to
In step 112, the method 100 plays the portion of the original musical track containing the non-vocal (e.g., music) elements while simultaneously displaying the corresponding lyrics for the vocal elements (e.g., in text form) in a substantially synchronous manner. In one embodiment, display of the lyrics includes displaying synchronized lyric/word emphasis using the alignment information obtained in step 108. For example, the display may include an indicator that tells a user precisely when and/or for how long the displayed words and/or syllables should be sung or for how long certain notes should be held (e.g., such as a “follow the bouncing ball” indicator).
In one embodiment, the method 100 proceeds to optional step 114 (illustrated in phantom), where the method 100 calculates and displays a score assessing the user's performance (e.g., singing along to the original musical track elements played and displayed in step 112). In one embodiment, calculation of a user's performance score includes comparing one or more parameters of the user's performance to corresponding parameters of the original musical track. In one embodiment, these parameters include timing (e.g., comparing duration patterns using time-mediated alignment of the user's vocals with the vocal elements of the original musical track), pitch, vocal clarity, and pronunciation.
In one another embodiment, the method 100 calculates a word and sentence pronunciation score from a word-by-word pronunciation match comparing the user's lyrics as uttered/sung against a native speaker model or against the vocal elements of the original musical track. In one embodiment, scoring of a user's performance based on pronunciation may be executed in accordance with any of the methods described in commonly assigned U.S. Pat. No. 6,055,498 (issued Apr. 25, 2000 to Neumeyer et al.) and U.S. Pat. No. 6,226,611 (issued May 1, 2001 to Neumeyer et al.).
In another embodiment, the method 100 may incorporate cepstral information in step 114 in order to provide the user with an indication of a known singer whose performance the user's performance most closely resembles (e.g. “You sound like Madonna”).
In one embodiment, the score provided to the user in step 114 is a single metric representing an overall assessment of the user's performance (e.g., a cumulative or aggregated assessment of one or more of the parameters discussed above). In another embodiment, the calculated score breaks the user's performance into segments and assesses these segments individually (e.g., “In the first segment your pitch was perfect, but in the nth segment your pitch deviated from the original musical track”).
In one embodiment, scoring in accordance with step 114 is provided after a user completes his or her performance. However, in an alternative embodiment, scoring in accordance with step 114 is provided in real time, e.g., as the user performs. Real-time feedback enables a user to adjust his or her performance in order to attempt to achieve a desired score or result.
The method 100 terminates in step 116.
The method 100 thus may be implemented to transform virtually any existing computing device into a karaoke system capable of adapting original musical tracks for karaoke use. Moreover, the method 100 may be implemented to “score” a user's performance based on a comparison to the original musical track. Thus, the method 100 enables an existing computing device to perform advanced karaoke functions without the need to purchase additional hardware or dedicated machinery.
Those skilled in the art will appreciate that although the present invention has been described within the exemplary context of a karaoke application, the methods of the present invention may also be implemented for use in conjunction with any application that requires the synchronized broadcast of an audio or video signal with text transcription (e.g., closed captioning).
Alternatively, the karaoke adaptation module 305 can be represented by one or more software applications (such as shareware, or even a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC)), where the software is loaded from a storage medium (e.g., I/O devices 306) and operated by the processor 302 in the memory 304 of the general purpose computing device 300. Thus, in one embodiment, the karaoke adaptation module 305 for adapting original musical tracks described herein with reference to the preceding Figures can be stored on a computer readable medium or carrier (e.g., RAM, magnetic or optical drive or diskette, and the like).
Thus, the present invention represents a significant advancement in the field of karaoke. A method and apparatus are provided that allow a user to transform virtually any computing device into a karaoke machine. Moreover, the method and apparatus of the present invention allow a user to transform virtually any original music track into a track that is usable for karaoke purposes (e.g., comprising displayable lyrics synchronized with a playable musical track). The present invention therefore enhances the karaoke capabilities of an existing computing device without the need to purchase additional hardware or dedicated machinery.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.