1. Field of the Invention
Implementations described herein relate generally to parsing of electronic media and, more particularly, to the deconstructing of an electronic media stream into human recognizable portions.
2. Description of Related Art
Existing techniques for parsing audio streams are either frequency-based or word-based. Frequency-based techniques interpret an audio stream based on a series of concurrent wave forms representing vibration frequencies that produce sound. This wave from analysis can be considered longitudinal in the sense that each second of audio will have multiple frequencies. Word-based techniques interpret an audio stream like spoken word commands in which an attempt is made to automatically distinguish lyrics as streams of text.
Neither technique is sufficient to adequately distinguish an electronic media stream into human recognizable portions.
According to one aspect, a method may include training a model to identify portions of electronic media streams based on attributes of the electronic media streams; inputting an electronic media stream into the model; and identifying, by the model, portions of the electronic media stream.
According to another aspect, a method may include training a model to identify human recognizable labels for portions of electronic media streams based on at least one of attributes of the electronic media streams, feature information associated with the electronic media streams, or information regarding other portions within the electronic media streams; identifying portions of an electronic media stream; inputting the electronic media stream and information regarding the identified portions into the model; and determining, by the model, human recognizable labels for the identified portions
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, explain the invention. In the drawings,
The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention.
As used herein, “electronic media” may refer to different forms of audio and video information, such as radio, sound recordings, television, video recording, and streaming Internet content. The description to follow will describe electronic media in terms of audio information, such as an audio stream or file. It should be understood that the description may equally apply to other forms of electronic media, such as video streams or files.
Once the portions of the audio stream have been identified, a label may be associated with each of the portions. For example, a portion at the beginning of the audio stream may be labeled the intro, a portion that generally includes sound within the vocal frequency that may include the same or similar chord progression with slightly different lyrics as another portion may be labeled the verse, a portion that repeats with generally the same lyrics may be labeled the chorus, a portion that occurs somewhere within the audio stream other than the beginning or end with possibly different vocal and/or instrumental frequencies than the verses or chorus may be labeled the bridge, and a portion at the end of the audio stream that may trail off of the last chorus may be the outro.
The labels may be stored with their associated audio stream as metadata. The labels may be useful in a number of ways. For example, the labels may be used for intelligently selecting audio clips, intelligent skipping, searching the audio stream, metadata prediction, and clustering. Intelligently selecting audio clips might identify that portion of the audio stream, such as the chorus, to serve as a representation of the audio stream. Intelligent skipping might provide a better user experience when the user is listening to the audio stream by permitting the user to skip forward (or backward) to the beginning of the next (or previous) portion.
Searching the audio stream may permit the entire portion of the audio stream that contains the searched for term to be played instead of just the actual occurrence of the searched for term, which may improve the user's search experience. Metadata prediction may use the labels to predict metadata, such as the genre, associated with the audio stream. For example, certain signatures (e.g., arrangements of the different portions) may be suggestive of certain genres. Clustering may be valuable in identifying similar songs for suggestion to a user. For example, audio streams with similar signatures may be identified as related and associated with a same cluster.
Processor 320 may include a processor, microprocessor, or processing logic that may interpret and execute instructions. Main memory 330 may include a random access memory (RAM) or another type of dynamic storage device that may store information and instructions for execution by processor 320. ROM 340 may include a ROM device or another type of static storage device that may store static information and instructions for use by processor 320. Storage device 350 may include a magnetic and/or optical recording medium and its corresponding drive.
Input device 360 may include a mechanism that permits an operator to input information to device 300, such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, etc. Output device 370 may include a mechanism that outputs information to the operator, including a display, a printer, a speaker, etc. Communication interface 380 may include any transceiver-like mechanism that enables device 300 to communicate with other devices and/or systems.
As will be described in detail below, audio deconstructor 210, consistent with the principles of the invention, may perform certain audio processing-related operations. Audio deconstructor 210 may perform these operations in response to processor 320 executing software instructions contained in a computer-readable medium, such as memory 330. A computer-readable medium may be defined as a physical or logical memory device and/or carrier wave.
The software instructions may be read into memory 330 from another computer-readable medium, such as data storage device 350, or from another device via communication interface 380. The software instructions contained in memory 330 may cause processor 320 to perform processes that will be described later. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention. Thus, implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software.
Label identifier 420 may receive the break point identifiers from portion identifier 410 and determine a label for each of the portions. In one implementation, label identifier 410 may be based on a model that uses a machine learning, statistical, or probabilistic technique to predict a label for each of the portions of the audio stream, which is described in more detail below. The input to the model may include the audio stream with its break point identifiers (which identify the portions of the audio stream) and the output of the model may include the identified portions of the audio stream with their associated labels.
As described above, portion identifier 410 and/or label identifier 420 may be based on models.
As shown in
Portion Model
The training set for the portion model might include human training data and/or audio data. Human operators who are well versed in music might identify the break points between portions of a number of audio streams. For example, human operators might listen to a number of music files or streams and identify the break points among the intro, verse, chorus, bridge, and/or outro. The audio data might include a number of audio streams for which human training data is provided.
Trainer 510 may analyze attributes associated with the audio data and the human training data to form a set of rules for identifying break points between portions of other audio streams. The rules may be used to form the portion model.
Audio data attributes that may be analyzed by trainer 510 might include volume, intensity, patterns, and/or other characteristics of the audio stream that might signify a break point. For example, trainer 510 might determine that a change in volume within an audio stream is an indicator of a break point.
Additionally, or alternatively, trainer 510 might determine that a change in level (intensity) for one or more frequency ranges is an indicator of a break point. An audio stream may include multiple frequency ranges associated with, for example, the human vocal frequency range and one or more frequency ranges associated with the instrumental frequencies (e.g., a bass frequency, a treble frequency, and/or one or more mid-range frequencies). Trainer 510 may analyze changes in a single frequency range or correlate changes in multiple frequency ranges as an indicator of a break point.
Additionally, or alternatively, trainer 510 might determine that a change in pattern (e.g., beat pattern) is an indicator of a break point. For example, trainer 510 may analyze a window around each instance (e.g., time point) in the audio stream (e.g., ten seconds prior to and ten second after the instance) to compare the beats per second in each frequency range within the window. A change in the beats per second within one or more of the frequency ranges might indicate a break point. In one implementation, trainer 510 may correlate changes in the beats per second for all frequency ranges as an indicator of a break point.
Trainer 510 may generate rules for the portion model based on one or more of the audio data attributes, such as those identified above. Any of several well known techniques may be used to generate the model, such as logic regression, boosted decision trees, random forests, support vector machines, perceptrons, and winnow learners. The portion model may determine the probability that an instance in an audio stream is the beginning (or end) of a portion based on one or more audio data attributes associated with the audio stream:
The portion model may generate a “score,” which may include a probability output and/or an output value, for each instance in the audio stream that reflects the probability that the instance is a break point. The highest scores (or scores above a threshold) may be determined to be actual break points in the audio stream. Break point identifiers (e.g., time codes) may be stored for each of the instances that are determined to be break points. Pairs of identifiers (e.g., a time code and the subsequent or preceding time code) may signify the different portions in the audio stream.
The output of the portion model may include break point identifiers (e.g., time codes) relating to the beginning and end of each portion of the audio stream.
Label Model
The training set for the label model might include human training data, audio data, and/or audio feature information (not shown in
The audio feature information might include additional information that may assist in labeling the portions. For example, the audio feature information might include information regarding common portion labels (e.g., intro, verse, chorus, bridge, and/or outro). Additionally, or alternatively, the audio feature information might include information regarding common formats of audio streams (e.g., AABA format, verse-chorus format, etc.). Additionally, or alternatively, the audio feature information might include information regarding common genres of audio streams (e.g., rock, jazz, classical, etc.). The format and genre information, when available, might suggest a signature (e.g., arrangement of the different portions) for the audio streams. A common signature for audio streams belonging to the rock genre, for example, may include the chorus appearing once, followed by the bridge, and then followed by the chorus twice consecutively.
Trainer 510 may analyze attributes associated with the audio streams, the portions identified by the break points, the audio feature information, and the human training data to form a set of rules for labeling portions of other audio streams. The rules may be used to form the label model.
Some of the rules that may be generated for the label model might include:
Trainer 510 may form the label model using any of several well known techniques, such as logic regression, boosted decision trees, random forests, support vector machines, perceptrons, and winnow learners. The label model may determine the probability that a particular label is associated with a portion in an audio stream based on one or more attributes, audio feature information, and/or information regarding other portions associated with the audio stream:
The label model may generate a “score,” which may include a probability output and/or an output value, for a label that reflects the probability that the label is associated with a particular portion. The highest scores (or scores above a threshold) may be determined to be actual labels for the portions of the audio stream.
The output of the label model may include information regarding the portions (e.g., break point identifiers) and their associated labels. This information may be stored as metadata for the audio stream.
The audio stream may be processed to identify portions of the audio stream (block 620). In one implementation, the audio stream may be input into a portion model that is trained to identify the different portions of the audio stream with high probability. For example, the portion model may identify the break points between the different portions of the audio stream based on the attributes associated with the audio stream. The break points may identify where the different portions start and end.
Human recognizable labels may be identified for each of the identified portions (block 630). In one implementation, the audio stream, information regarding the break points, and possibly audio feature information (e.g., genre, format, etc.) may be input into a label model that is trained to identify labels for the different portions of the audio stream with high probability. For example, the label model may analyze the instrumental and vocal frequencies associated with the different portions and relationships between the different portions. Portions that repeat identically might be indicative of the chorus. Portions that contain similar instrumental frequencies but different vocal frequencies might be indicative of verses. A portion that contains different instrumental and vocal frequencies from both the chorus and the verses and occurs neither at the beginning or end of the audio stream might be indicative of the bridge. A portion that occurs at the beginning of the audio stream might be indicative of the intro. A portion that occurs at the end of the audio stream might be indicative of the outro.
When information regarding common formats is available, the label model may use the information to improve its identification of labels. For example, the label model may determine whether the audio stream has a signature that appears to match one of the common formats and use the signature associated with a matching common format to assist in the identification of labels for the audio stream. When information regarding genre is available, the label model may use the information to improve its identification of labels. For example, the label model may identify a signature associated with the genre corresponding to the audio stream to assist in the identification of labels for the audio stream.
Once labels have been identified for each of the portions of the audio stream, the audio stream may be stored with its break points and labels stored as metadata associated with the audio stream. The audio stream and its metadata may then be used for various purposes, some of which have been described above.
The audio deconstructor may identify labels for the portions of the song based on the attributes associated with the song, information regarding the break points, and possibly audio feature information (e.g., genre, format, etc.). For example, the audio deconstructor may analyze the instrumental and vocal frequencies associated with the different portions and relationships between the different portions. As shown in
The audio deconstructor may output the break points and the labels as metadata associated with the song. In this case, the metadata might indicate that the song begins with verse 1 that occurs until 0:18, followed by the chorus that occurs between 0:18 and 0:38, followed by verse 2 that occurs between 0:38 and 0:58, followed by the chorus that occurs between 0:58 and 1:18, followed by verse 3 that occurs between 1:18 and 1:38, and finally followed by the chorus after 1:38 until the end of the song, as shown in
Implementations consistent with the principles of the invention may generate one or more models that may be used to identify portions of an electronic media stream and/or identify labels for the identified portions.
The foregoing description of preferred embodiments of the invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.
For example, while a series of acts has been described with regard to
Techniques for deconstructing an electronic media stream have been described above. In addition, or as an alternative, to these techniques, it may be beneficial to detect individual instruments in the electronic media stream. The frequency ranges associated with the instruments may be determined and mapped against expected introduction of the instruments in well known arrangements. If a match with a well known arrangement is found, then information regarding its portions and labels may be used to facilitate identification of the portions and/or labels for the electronic media stream.
While the preceding description focused on deconstructing audio streams, the description may equally apply to deconstruction of other forms of media, such as video streams. For example, the description may be useful for deconstructing music videos and/or other types of video streams based, for example, on the tempo of, or chords present in, their background music.
Moreover, the term “stream” has been used in the description above. The term is intended to mean any form of data whether embodied in a carrier wave or stored as a file in memory.
It will be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code—it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.
No element, act, or instruction used in the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
This application is a Continuation of U.S. application Ser. No. 11/289,527 filed Nov. 30, 2005, the entire disclosure of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6225546 | Kraft et al. | May 2001 | B1 |
6249765 | Adler et al. | Jun 2001 | B1 |
6542869 | Foote | Apr 2003 | B1 |
6562077 | Bobrow et al. | May 2003 | B2 |
6651218 | Adler et al. | Nov 2003 | B1 |
6674452 | Kraft et al. | Jan 2004 | B1 |
6965546 | Tagawa et al. | Nov 2005 | B2 |
7038118 | Gimarc | May 2006 | B1 |
7179982 | Goto | Feb 2007 | B2 |
7232948 | Zhang | Jun 2007 | B2 |
20010003813 | Sugano et al. | Jun 2001 | A1 |
20020029232 | Bobrow et al. | Mar 2002 | A1 |
20030231775 | Wark | Dec 2003 | A1 |
20040170392 | Lu et al. | Sep 2004 | A1 |
20050102135 | Goronzy et al. | May 2005 | A1 |
20060065102 | Xu | Mar 2006 | A1 |
20060080095 | Pinxteren et al. | Apr 2006 | A1 |
20060212478 | Plastina et al. | Sep 2006 | A1 |
20060288849 | Peeters | Dec 2006 | A1 |
Entry |
---|
Charles Fox, “Genetic Hierarchical Music Structures”; Clare College, Cambridge; May 2000; Appendix E; 4 pages. |
Co-pending U.S. Appl. No. 11/289,527, filed Nov. 30, 2005 entitled “Deconstructing Electronic Media Stream Into Human Recognizable Portions”, Victor Bennett, 40 pages. |
Hainsworth S., et al.: The Automated Music Transcription Problem; retrieved online at : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.9571, 23 pages. |
U.S. Appl. No. 11/289,433, filed Nov. 30, 2005 entitled “Automatic Selection of Representative Media Clips”, by Victor Bennett, 36 pages, 14 pages of drawings. |
Abdallah et al., “Theory and Evaluation of a Bayesian Music Structure Extractor”, Proceedings of the Sixth International Conference on Music Information, University of London, 2005, 6 pages. |
Aucouturier et al., “Segmentation of Musical Signals Using Hidden Markov Models”, Proceedings of the Audio Engineering Society 110th Convention, King's College, 2001, 8 pages. |
Foote et al., “Media Segmentation using Self-Similarity Decomposition”, Proceedings—SPIE The International Society for Optical Engineering, 2003, 9 pages. |
Foote, “Methods for the Automatic Analysis of Music and Audio”, In Multimedia Systems, 1999, 19 pages. |
Goto, “A Chorus-Section Detecting Method for Musical Audio Signals”, Japan Science and Technology Corporation, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. V437-V440, 2003, 4 pages. |
Peeters et al., “Toward Automatic Music Audio Summary Generation from Signal Analysis”, Proceedings International Conference on Music Information Retrieval, 2002, 7 pages. |
Visell “Spontaneous organisation, pattern models, and music”, Organised Sound, 9(2), p. 151-165, 2004. |
Number | Date | Country | |
---|---|---|---|
Parent | 11289527 | Nov 2005 | US |
Child | 12652367 | US |