1. Field of the Invention
The present invention generally relates to retrieving information from a musical selection. More specifically, the present invention relates to identifying the compositional structure of a musical selection thereby allowing for musical search, recommendation, and social co-creation efforts.
2. Description of the Related Art
Music formats have evolved since the introduction of the phonograph in the late 1800s. The phonograph gave way to the gramophone, which in turn lead to vinyl and remains popular today. Vinyl was followed by the 8-track tape, the compact cassette, compact discs, and eventually mini-discs and MP3s. The change in music formats is especially dramatic over the last twenty years with a variety of download, music locker, subscription, and streaming services having come to market.
Technology has unquestionably driven these format changes. This is especially true with respect to the most recent wave of digital content. But the same technologies that have spearheaded the drastic evolution of musical format and delivery remain woefully deficient with respect to knowing what is actually in a musical selection.
Identifying information about music is relatively simple. Data concerning lyricists, instrumentalists, producers, labels, and studios is readily available to the listening public. But this information is nothing more than metadata; data about music. Knowledge of that information is unlikely to contribute to an understanding of what constitutes and makes for an enjoyable listening experience in any meaningful way.
For example, a listener may not necessarily like a particular music track simply because it was written or a produced by the same artist. Consider the English rock band “Radiohead” and it's lead singer Thom Yorke. Thom Yorke also has a solo musical endeavor known as “Atoms for Peace.” Simply because a listener enjoys “Radiohead” does not automatically equate to an enjoyment of “Atoms for Peace” even though the two musical acts share a lead singer.
A listener is more likely to enjoy a particular musical track because of the intangible creative contributions that a particular musician, lyricist, or producer makes to the music. For example: in what key is a particular song written? At what tempo is the song performed? Does the song use a particular instrument or instrumentation? Is the music written in a particular genre? What is the harmonic structure of a particular musical selection?
These nuanced questions concern the fundamental makeup of music at a compositional level. The answers to these questions might help explain why the same listener might enjoy a particular musical track by the aforementioned band “Radiohead” while at the same time enjoying tracks by a dance pop artist such as Britney Spears. But even so-called industry leaders in digital music have no ability to identify the compositional elements of a piece of music.
For example, the online music service Pandora takes songs one-by-one and rates them according to various non-compositional metrics. Pandora then recommends songs with similar ratings to users with a proclivity to relate to songs with certain ratings. The EchoNest, which is now a part of Spotify, identifies high spending users, records data related to plays and skips by those users to build a taste profile. EchoNest/Spotify then makes recommendations to other users having similar profiles. Both services—and many others like them—lack the nuanced attention to (and subsequent identification of) details concerning musical contours, labeling, and compositional DNA. Existing services and methodologies simply look at musical content as singular jumbles of sound and rely upon the aforementioned musical track metadata.
There is a need in the art for identifying and retrieving the compositional elements of a musical selection.
A first claimed embodiment of the present invention is a method for musical information retrieval. The method includes receiving a musical contribution, extracting musical information, and encoding the extracted musical information in a symbolic abstraction layer.
Embodiments of the present invention allow for identifying and retrieving the compositional elements of a music selection—music information retrieval (MIR). Through the use of machine learning and data science, hyper-customized user experiences may be created. By applying MIR to machine learning metrics, users can discover and enjoy new music from new artists and content producers. Similarly, records labels can market and sell music more accurately and effectively. MIR can also contribute to a new scale of music production that is built on an understanding of why a listener actually wants the music that they do rather than marketing a musical concept or artist without real regard for the performed content.
In this context, audio is received to allow for the retrieval and extraction of musical information. Information corresponding to a melody such as pitch, duration, velocity, volume, onsets and offsets, beat, and timbre are extracted. A similar retrieval of musical information occurs in the context of rhythmic taps whereby beats and a variety of onsets are identified. This musical information may then be used to identify particular musical tastes and search for content that corresponds to identified musical tastes. Similar processes may be utilized to aid in the generation of collaborative social co-creations of musical content.
For example, hardware device 100 may be utilized to implement musical information retrieval. Hardware device 100 might also be used for composition and production. Composition, production, and rendering may occur on a separate hardware device 100 or could be implemented as a part of a single hardware device 100. Composition, production, and rendering may be individually or collectively software driven, part of an application specific hardware design implementation, or a combination of the two.
Hardware device 100 as illustrated in
The aforementioned components of
Mass storage 130 may be implemented as tape libraries, RAID systems, hard disk drives, solid-state drives, magnetic tape drives, optical disk drives, and magneto-optical disc drives. Mass storage 130 is non-volatile in nature such that it does not lose its contents should power be discontinued. Mass storage 130 is non-transitory although the data and information maintained in mass storage 130 may be received or transmitted utilizing various transitory methodologies. Information and data maintained in mass storage 130 may be utilized by processor 110 or generated as a result of a processing operation by processor 110. Mass storage 130 may store various software components necessary for implementing one or more embodiments of the present invention by allowing for the loading of various modules, instructions, or other data components into memory 120.
Portable storage 140 is inclusive of any non-volatile storage device that may be introduced to and removed from hardware device 100. Such introduction may occur through one or more communications ports, including but not limited to serial, USB, Fire Wire, Thunderbolt, or Lightning. While portable storage 140 serves a similar purpose as mass storage 130, mass storage device 130 is envisioned as being a permanent or near-permanent component of the device 100 and not intended for regular removal. Like mass storage device 130, portable storage device 140 may allow for the introduction of various modules, instructions, or other data components into memory 120.
Input devices 150 provide one or more portions of a user interface and are inclusive of keyboards, pointing devices such as a mouse, a trackball, stylus, or other directional control mechanism, including but not limited to touch screens. Various virtual reality or augmented reality devices may likewise serve as input device 150. Input devices may be communicatively coupled to the hardware device 100 utilizing one or more the exemplary communications ports described above in the context of portable storage 140.
Display system 170 is any output device for presentation of information in visual or occasionally tactile form (e.g., for those with visual impairments). Display devices include but are not limited to plasma display panels (PDPs), liquid crystal displays (LCDs), and organic light-emitting diode displays (OLEDs). Other displays systems 170 may include surface conduction electron emitters (SEDs), laser TV, carbon nanotubes, quantum dot displays, and interferometric modulator displays (MODs). Display system 570 may likewise encompass virtual or augmented reality devices as well as touch screens that might similarly allow for input and/or output as described above.
Peripherals 180 are inclusive of the universe of computer support devices that might otherwise add additional functionality to hardware device 100 and not otherwise specifically addressed above. For example, peripheral device 180 may include a modem, wireless router, or otherwise network interface controller. Other types of peripherals 180 might include webcams, image scanners, or microphones although a microphone might in some instances be considered an input device.
The system infrastructure 200 of
The front end application 210 provides an interface to allow users to introduce musical contributions. Such contributions may occur on a mobile device as might be common amongst amateur or non-professional content creators. Contributions may also be provided at a professional workstation or server system executing an enterprise version of the application 210. The front end application 210 connects to the API server 220 over a communication network that may be public, proprietary, or a combination of the foregoing. Said network may be wired, wireless, or a combination of the foregoing.
The API server 220 is a standard hypertext transfer protocol (HTTP) server that can handle API requests from the front end application 210. The API server 220 listens for and responds to requests from the front end application 210, including but not limited to musical contributions. Upon receipt of a contribution, a job or “ticket” is created that is passed to the messaging servers 230.
Messaging server 230 is an advanced message queuing protocol (AMQP) message broker that allows for communication between the various back-end components of the system infrastructure via message queues. Multiple messaging servers may be run using an autoscaler 290 to ensure messages are handled with minimized delay.
Database 240 provides storage for system infrastructure 200. Database 240 maintains instances of musical contributions from various users. Musical contributions may be stored on web accessible storage services such as Amazon AWS Simple Storage Service (AWS S3), with the Database Server 240 storing web accessible addresses to sound and other data files corresponding to those musical contributions. Database 240 may also maintain user information, including but not limited to user profiles, data associated with those profiles (such as user tastes, search preferences, and recommendations), information concerning genres, compositional grammar rules and styles as might be used by composition server 250 and instrumentation information as might be utilized by production server 260.
Composition server 250 “listens” for tickets that are queued by messaging server 230 and maintained by database 240 and that reflect the need for execution of the composition and production processes. Composition server 250 maintains a composition module that is executed to generate a musical blueprint in the context of a given musical genre for rendering to sound data by the production server 260. The composition server 250 will then create rendering tickets on the messaging server 230. The production server 260 retrieves tickets for rendering and the score or blueprint as generated through the execution of the composition module and applies instrumentation to the same. The end result of the composition process is maintained in database 240.
System infrastructure 200 of
The microphone or audio receiving device may be integrated with or coupled to a hardware device like that illustrated in
If necessary, the application executes in step 320 to provide for the transmission of information to a computing device like hardware device 100 of
Upon receipt of the melodic musical contribution, the hardware device 100 or a mobile device with similar processing capabilities executes extraction software at step 330. Execution of the extraction or composition software extracts various elements of musical information from the melodic utterance. This information might include, but is not limited to, pitch, duration, velocity, volume, onsets and offsets, beat, and timbre. The extracted information is encoded into a symbolic data layer at step 340.
Musical information is extracted from the melodic musical utterance in step 330 to allow the computation of various audio features that are subsequently or concurrently encoded in step 340. Extraction may occur through the use of certain commercially available extraction tools like the Melodia extraction vamp plug-in tool. Melodia estimates the pitch of the melody in a polyphonic or monophonic musical contribution. An algorithm estimates the fundamental frequency of the contribution by estimating when the melody is and is not present (i.e. voicing detection) and the pitch of the melody when it is determined to in fact be present.
The accuracy or confidence measure of any pitch determination, especially when multiple pitch candidates are present, may alternatively or further be adjudged through the use of YIN. YIN is an algorithm that estimates fundamental frequency and is based on various auto-correlation methodologies. YIN utilizes a signal model that may be extended to handle various forms of aperiodicity.
Music information retrieval and extraction may also involve the use of the Essentia open source library. Essentia is a library of reusable algorithms that implement audio input/output functionality, standard digital processing blocks, statistical characterization of data, and large sets of spectral, temporal, tonal, and high-level music descriptors. Essentia may also be used to compute high-level descriptions of music through generation of classification models.
Extraction of musical information from the melodic signal in step 330 may occur in the context of uniform 12 millisecond frames. While other frame lengths may be utilized in the extraction process at step 330, the use of uniform frames allow for quantization of a sequence of features along with the aforementioned fundamental frequency and confidence values. In parallel with the quantization is the computation of loudness and beat values. Individual notes may also be extracted by extracting patterns in music via Markov chains. The note information and beat detection may then be realigned as necessary to translate notes and timing information into both absolute time and musical time.
Absolute time is that time affected by tempo. For example, certain events may occur sooner or later dependent upon the speed or pace of a given piece of music. A particular note value (such as a quarter note) is specified as the beat and the amount of time between successive beats is a specified fraction of a minute (e.g., 120 beats per minute). Musical time is that time identified by a measure and a beat. For example, measure two, beat two. Absolute time in comparison to musical time can be reflected as seconds versus metered bars and beats.
The foregoing extracted musical information is reflected as a tuple—an ordered list of elements with an n-tuple representing a sequence of n elements with n being a non-negative integer—as used in relation to the semantic web. Tuples are usually written by listing elements within parenthesis and separate by commas (e.g., (2, 7, 4, 1, 7)). The tuples are static in size with the same number of properties per note. Tuples are then migrated into the symbolic layer at step 340.
The symbolic layer into which extracted musical information is encoded allows for the flexible representation of audio information as it transitions from the audible analog domain to the digital data domain. In this regard, the symbolic layer pragmatically operates as sheet music. While MIDI-like in nature, the symbolic layer of the presently disclosed invention is not limited to or dependent upon MIDI (Musical Instrument Digital Interface). MIDI is a technical standard allowing for electronic musical instruments and computing devices to communicate with one another. MIDI uses event messages to specify notation, pitch, and velocity; control parameters corresponding to volume and vibrato; and clock signals that synchronize tempo. The symbolic layer of the present invention operates in a fashion similar to MIDI; the symbolic layer represents music as machine input-able information.
Through use of this symbolic layer, other software modules and processing routines are able to utilize retrieved musical information for the purpose of applying compositional rules, instrumentation, and ultimately rendering of content for playback in the case of social co-creation of music. Such further utilization or processing takes place at step 350 and will vary depending on the particular intent as to the future use of any musical contribution. Music content may ultimately be passed as an actual MIDI file. For the purposes of using musical information retrieval to generate a subsequent composition process, the abstract symbolic layer is passed versus the likes of a production file.
Upon receipt of the rhythmic musical contribution, hardware device 100 executes extraction or composition software at step 430 to extract various musical data features. This information might include, but is not limited to high frequency content, spectral flux, and spectral difference. The extracted information is encoded into the symbolic layer at step 440; extraction of this information may take place through the use of the Essentia library as described above. Extracted information may be made available for further use at step 450. Such further uses may similar to or some instances identical or in conjunction with those described with respect to step 250 in
High frequency content is a measure taken across a signal spectrum such as a short term Fourier transform. This measure can be used to characterize the amount of high-frequency content in a signal by adding the magnitudes of the spectral bins while multiplying each magnitude by the bin position proportional to frequency as follows:
where X(k) is a discrete spectrum with N unique points. Through the extraction of high frequency content, musical information concerning onset detection may be extracted.
Spectral flux is a measure of change in the power spectrum of a signal as calculated by comparing the power spectrum of one frame against the frame immediately prior. Spectral flux can be used to determine the timbre of an audio signal. Spectral flux may also be used for onset detection.
Spectral differencing is a methodology for detecting downbeats in musical audio given a sequence of beat times. A robust downbeat extractor is useful in the context of music information retrieval. Downbeat extraction through spectral differencing allows for rhythmic pattern analysis for genre classification, the indication of likely temporal boundaries for structural audio segmentation, and otherwise improves the robustness of beat tracking.
The use of music information retrieval information related to high frequency content, spectral flux, and spectral difference is to answer a simple question: “is there a tap or some other rhythmic downbeat present?” If music information extraction indicates the answer to be yes, an examination of the types of sounds—or tap polyphony—that generated a given tap or downbeat is undertaken. For example, a tap or downbeat might be grouped into one of several sounds classes such as a tap on a table, a tab on a chair, a tap in the human body and so forth. Information related to duration or pitch is of lesser to no value. Information concerning outset, class, velocity, and loudness may be encoded unto a tuple that is, in turn, integrated into the symbolic layer.
In an a further embodiment of the present invention, a de-noising operation may take place using source separation algorithms. By executing and applying such an algorithm, random characteristics that do not match the overall input may be identified and removed from the audio sample. For example, a musical contribution might be interrupted by a ringing doorbell or a buzz saw. These anomalies would present as inconsistent with onsets in the case of a rhythmic tap or a fundamental frequency (or at least a confident one) in the case of a melodic contribution. Source separation might also be utilized to identify and differentiate between various contributors, humming modes or styles, as well as singing. Source separation might, in this regard, be used to refine note extraction and identify multiple melodic streams.
Another embodiment might utilize evaluation scripts to aid in learning and training of a musical information retrieval package. Users could manually annotate musical contributions such that the script may score the accuracy of characterization of various elements of musical information including but not limited frequency and notation accuracy, tempo, and identification of onsets or downbeats.
The foregoing detailed description has been presented for purposes of illustration and description. The foregoing description is not intended to be exhaustive or to the present invention to the precise form disclosed. Many modifications and variations of the present invention are possible in light of the above description. The embodiments described were chosen in order to best explain the principles of the invention and its practical application to allow others of ordinary skill in the art to best make and use the same. The specific scope of the invention shall be limited by the claims appended hereto.
The present application is a continuation-in-part and claims the priority benefit of U.S. patent application Ser. No. 14/920,846 filed Oct. 22, 2015, which claims the priority benefit of U.S. provisional application No. 62/067,012 filed Oct. 22, 2014; the present application is also a continuation-in-part and claims the priority benefit of U.S. patent application Ser. No. 14/931,740 filed Nov. 3, 2015, which claims the priority benefit of U.S. provisional application No. 62/074,542 filed Nov. 3, 2014; the present application claims the priority benefit of U.S. provisional application No. 62/075,176 filed Nov. 4, 2014. The disclosure of each of the aforementioned applications is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62067012 | Oct 2014 | US | |
62074542 | Nov 2014 | US | |
62075176 | Nov 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14920846 | Oct 2015 | US |
Child | 14932888 | US | |
Parent | 14931740 | Nov 2015 | US |
Child | 14920846 | US |