A mashup is a creative work that is typically created by blending elements from two or more sources. In the context of music, a mashup is generally created by combining the vocal track from one song with the instrumental track from another song, and occasionally adding juxtaposition, or changing the keys or tempo. While mashups are a popular form of music creation, they require specialized knowledge regarding music composition that makes the process of creating them very difficult for most people. For example, to successfully create a mashup one must be able to analyze the key, beat, and structure of a song, know how to separate out the vocal and instrumental components, and then mix these components from different songs using the right effects and equalizers.
It is with respect to these and other general considerations that embodiments have been described. Also, although relatively specific problems have been discussed, it should be understood that the embodiments described herein should not be limited to solving the specific problems identified in the background.
Aspects of the present disclosure generally relate to methods, systems, and media for combining audio tracks.
In one aspect, a computer-implemented method for combining audio tracks is provided. A first audio track and a second audio track are received. The first audio track is separated into a vocal component and one or more accompaniment components. The second audio track is separated into a vocal component and one or more accompaniment components. A structure of the first audio track and a structure of the second audio track are determined. The first audio track and the second audio track are aligned based on the determined structures of the tracks. The vocal component of the first audio track is stretched to match a tempo of the second audio track. The stretched vocal component of the first audio track is added to the one or more accompaniment components of the second audio track.
In another aspect, a system for combining audio tracks is provided. The system comprises at least one processor and a memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations, the set of operations including: receiving a first audio track and a second audio track; separating the first audio track into a vocal component and one or more accompaniment components; separating the second audio track into a vocal component and one or more accompaniment components; determining a structure of the first audio track and a structure of the second audio track; aligning the first audio track and the second audio track based on the determined structures of the tracks; stretching the vocal component of the first audio track to match a tempo of the second audio track; and adding the stretched vocal component of the first audio track to the one or more accompaniment components of the second audio track.
In yet another aspect, a non-transient computer-readable storage medium is provided. The non-transient computer-readable storage medium comprising instructions being executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to: receive a first audio track and a second audio track; separate the first audio track into a vocal component and one or more accompaniment components; separate the second audio track into a vocal component and one or more accompaniment components; determine a structure of the first audio track and a structure of the second audio track; align the first audio track and the second audio track based on the determined structures of the tracks; stretch the vocal component of the first audio track to match a tempo of the second audio track; and add the stretched vocal component of the first audio track to the one or more accompaniment components of the second audio track.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Non-limiting and non-exhaustive examples are described with reference to the following Figures.
In the following detailed description, references are made to the accompanying drawings that form a part hereof, and in which are shown by way of illustrations specific embodiments or examples. These aspects may be combined, other aspects may be utilized, and structural changes may be made without departing from the present disclosure. Embodiments may be practiced as methods, systems, or devices. Accordingly, embodiments may take the form of a hardware implementation, an entirely software implementation, or an implementation combining software and hardware aspects. The following detailed description is therefore not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims and their equivalents.
The present disclosure describes various examples of a computing device having an audio processor configured to create a new musical track that is a mashup of different, pre-existing audio tracks, such as, for example, musical tracks. In some examples, the audio processor can process and utilize a variety of information types. For example, the audio processor may be configured to process various types of audio signals or tracks, such as mixed original audio signals that include both a vocal component and an accompaniment (e.g., background instrumental) component, where the vocal component includes vocal content and the accompaniment component includes instrumental content (e.g., such as musical instrument content). In one example, the audio processor can separate each audio track into the different sources or components of audio, including, for example, a vocal component and one or more accompaniment components. Such accompaniment components of an audio track may include, for example, drums, bass, and the like.
In some examples, the audio processor can use song or track segmentation information and/or segment label information in the process of creating a mashup. For example, the audio processor can identify music theory labels for audio tracks. Non-overlapping segments within the audio tracks are labeled beforehand with suitable music theory labels. In some examples, the music theory labels correspond to music theory structures, such as introduction (“intro”), verse, chorus, bridge, outro, or other suitable labels. In other examples, the music theory labels correspond to non-structural music theory elements, such as vibrato, harmonics, chords, etc. In still other examples, the music theory labels correspond to key signature changes, tempo changes, etc. In some examples, the audio processor identifies music theory labels for segments that overlap, such as labels for key signatures, tempo changes, and structures (i.e., intro, verse, chorus).
In at least one embodiment, the system for combining audio tracks allows a user to select (e.g., input, designate, etc.) any two songs and the system will automatically create and output a mashup of the two songs. The system may also enable a user to play an interactive role in the mashup creation process, in an embodiment. In one example, the system may generate a visualization of the songs selected by the user, display the visualization via a user interface, and permit the user to make selections and/or adjustments to various characteristics of the songs during the process of creating the mashup. In this manner, the system allows users to create customized mashups of audio tracks.
This and many further embodiments for a computing device are described herein. For instance,
The computing device 110 may be any type of computing device, including a smartphone, mobile computer or mobile computing device (e.g., a Microsoft® Surface® device, a laptop computer, a notebook computer, a tablet computer such as an Apple iPad™ a netbook, etc.), or a stationary computing device such as a desktop computer or PC (personal computer). The computing device 110 may be configured to communicate with a social media platform, cloud processing provider, software as a service provider, or other suitable entity, for example, using social media software and a suitable communication network. The computing device 110 may be configured to execute one or more software applications (or “applications”) and/or services and/or manage hardware resources (e.g., processors, memory, etc.), which may be utilized by users of the computing device 110.
Computing device 110 comprises an audio processor 111, in an embodiment. In the example shown in
The source processor 112 is configured to separate an audio track into different sources or components of audio that makeup the track. For example, the source processor 112 may receive an audio track and separate the audio track into a vocal component and one or more accompaniment components such as drums, bass, and various other instrumental accompaniments.
The boundary processor 114 is configured to generate segment boundary identifications within audio portions. For example, the boundary processor 114 may receive audio portions and identify boundaries within the audio portions that correspond to changes in a music theory label. Generally, the boundaries identify non-overlapping segments within a song or excerpt having a particular music theory label. As an example, an audio portion with a duration of 24 seconds may begin with a four second intro, followed by an 8 second verse, then a 10 second chorus, and a two second verse (e.g., a first part of a verse). In this example, the boundary processor 114 may generate segment boundary identifications at 4 seconds, 12 seconds, and 22 seconds. In some examples, the boundary processor 114 communicates with a neural network model or other suitable model to identify the boundaries within an audio track.
The segment processor 116 is configured to generate music theory label identifications for audio portions. In various examples, the music theory label identifications may be selected from a plurality of music theory labels. In some examples, at least some of the plurality of music theory labels denote a structural element of music. Examples of music theory labels may include introduction (“intro”), verse, chorus, bridge, instrumental (e.g., guitar solo or bass solo), outro, silence, or other suitable labels. In some examples, the segment processor 116 identifies a probability that a particular audio portion, or a section or timestamp within the particular audio portion, corresponds to a particular music theory label from the plurality of music theory labels. In other examples, the segment processor 116 identifies a most likely music theory label for the particular audio portion (or the section or timestamp within the particular audio portion). In still other examples, the segment processor 116 identifies start and stop times within the audio portion for when the music theory labels are active. In some examples, the segment processor 116 communicates with a neural network model or other suitable model to generate the music theory label identifications.
The beat processor 118 is configured to analyze the beat of an audio track and detect beat and downbeat timestamps within the audio track.
Data store 120 may include one or more of any type of storage mechanism, including a magnetic disc (e.g., in a hard disk drive), an optical disc (e.g., in an optical disk drive), a magnetic tape (e.g., in a tape drive), a memory device such as a RAM device, a ROM device, etc., and/or any other suitable type of storage medium. The data store 120 may store source audio 130 (e.g., audio tracks for user selection), for example. In some examples, the data store 120 provides the source audio 130 to the audio processor 111 for analysis and mashup. In some examples, one or more data stores 120 may be co-located (e.g., housed in one or more nearby buildings with associated components such as backup power supplies, redundant data communications, environmental controls, etc.) to form a datacenter, or may be arranged in other manners. Accordingly, in an embodiment, one or more of data stores 120 may be a datacenter in a distributed collection of datacenters.
Source audio 130 includes a plurality of audio tracks, such as songs, portions or excerpts from songs, etc. As used herein, an audio track may be a single song that contains several individual tracks, such as a guitar track, a drum track, a vocals track, etc., or may include only one track that is a single instrument or input, or a mixed track having multiple sub-tracks. Generally, the plurality of audio tracks within the source audio 130 are labeled with music theory labels for non-overlapping segments within the audio tracks. In some examples, different groups of audio tracks within the source audio 130 may be labeled with different music theory labels. For example, one group of audio tracks may use five labels (e.g., intro, verse, pre-chorus, chorus, outro), while another group uses seven labels (e.g., silence, intro, verse, refrain, bridge, instrumental, outro). Some groups may allow for segment sub-types (e.g., verse A, verse B) or compound labels (e.g., instrumental chorus). In some examples, the audio processor 111 is configured to convert labels among audio tracks from the different groups to use a same plurality of music theory labels.
Network 140 may comprise one or more networks such as local area networks (LANs), wide area networks (WANs), enterprise networks, the Internet, etc., and may include one or more of wired and/or wireless portions. Computing device 110 and data store 120 may include at least one wired or wireless network interface that enables communication with each other (or an intermediate device, such as a Web server or database server) via network 140. Examples of such a network interface include but are not limited to an IEEE 802.11 wireless LAN (WLAN) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth™ interface, or a near field communication (NFC) interface. Examples of network 140 include a local area network (LAN), a wide area network (WAN), a personal area network (PAN), the Internet, and/or any combination thereof.
In some examples, the audio processor 111 may take the received audio tracks (e.g., song A 204A and song B 204B) and perform various analyses on the audio tracks, including, for example, source separation 206, structure analysis 208, and beat detection 210. In one example, the audio processor 111 may perform these analyses by employing one or more music information retrieval algorithms Such music information retrieval algorithms may be implemented, for example, by one or more of the source processor 112, the boundary processor 114, the segment processor 116, and the beat processor 118 of the audio processor 111. Each of source separation 206, structure analysis 208, and beat detection 210 are further illustrated in
In source separation 206, the source audio 204 received by the audio processor 111 is analyzed and separated into different audio components that make up each of song A 204A and song B 204B, in an embodiment. In one example, each of song A 204A and song B 204B may be analyzed by the source processor 112 to separate the vocal components of the songs from the accompaniment components of the songs.
Using the outputs from the source separation 206 and the structure analysis 208, chorus extraction 212 may be performed.
In one embodiment, once the structure and beat of the audio tracks are analyzed in the structure analysis 208 and beat detection 210, respectively, an audio stretch 214 may be applied to the vocal component of one of the audio tracks so that the vocal component matches the tempo of the other audio track. For example, the vocal component of song A 204A may undergo audio stretching 214 to match the tempo of song B 204B, where the tempo of song B 204B may be determined (e.g., estimated) based on data about the beat of song B 204B generated from the beat detection 210.
Following the audio stretching 214, the stretched vocal component of one of the audio tracks (e.g., song A 204A) may be combined with the one or more accompaniment components of the other audio track (e.g., song B 204B) during audio mixing 216.
As shown in the example data flow 300, audio data is both the input and the output of the source separation 206. For example, the source separation 206 is performed on the source audio 204 to generate source-separated audio 302, which may include song A source-separated audio 304 and song B source-separated audio 310. In the example illustrated, song A source-separated audio 304 includes a vocal component 306 and at least three accompaniment components 308, namely, a drum component 308A, a bass component 308B, and one or more other instrumental components 308C. The song B source-separated audio 310 also includes a vocal component 312 and at least three accompaniment components 314, which may be a drum component 314A, a bass component 314B, and one or more other instrumental components 314C.
As shown in the example data flow 400, the output of the structure analysis 208 is data about the structure of the audio tracks. For example, the structure analysis 208 is performed on the source audio 204 to generate structure data 402, which may include song A structure data 404 and song B structure data 406. In one embodiment, the audio processor 111 (e.g., the boundary processor 114 and/or the segment processor 116) is configured to receive the source audio 204 and generate music theory label identifications and segment boundary identifications. For example, the boundary processor 114 may be configured to generate segment boundary identifications within audio portions of each of song A 204A and song B 204B, and the segment processor 116 may be configured to generate music theory label identifications for segments identified by the segment boundary identifications, in an embodiment. In the example shown in
As shown in the example data flow 500, the output of the beat detection 210 is data about the beat of the audio tracks. For example, the beat detection 210 is performed on the source audio 204 to generate beat data 502, which may include song A beat data 504 and song B beat data 506.
In the example visualizations 700A and 700B, the vocal component of song A is visualized by sections 704A, 704B, 704C, and 704D, and beats 706, while the accompaniment component of song B is visualized by sections 708A, 708B, 708C, and 708D, and beats 710. In an example scenario, if a user wishes to align section 704C of the song A vocal component with section 708A of the song B accompaniment component, the user may interact (e.g., via a graphical user interface) with the visualization 700A by dragging the song B accompaniment component so that those two sections are aligned, as shown in the visualization 700B.
Method 800 begins with step 802. At step 802, a first audio track and a second audio track are received. The first and second audio tracks may correspond to song A 204A and song B 204B in
At step 804, the first audio track may be separated into a vocal component and one or more accompaniment components. In one example, the one or more accompaniment components may include a drum component, a bass component, and one or more other instrumental components of the first audio track.
At step 806, the second audio track may be separated into a vocal component and one or more accompaniment components. In one example, the one or more accompaniment components may include a drum component, a bass component, and one or more other instrumental components of the second audio track.
At step 808, a structure of the first audio track and a structure of the second audio track may be determined. In some examples, step 808 may include identifying segments within the first audio track and segments within the second audio track, and identifying music theory labels for the identified segments within the first audio track and for the identified segments within the second audio track.
At step 810, the first audio track and the second audio track may be aligned based on the determined structures. In one example, the first audio track and the second audio track may be aligned based on the identified segments and music theory labels for the first audio track and the second audio track (which may be identified at step 808).
At step 812, the vocal component of the first audio track may be stretched to match a tempo of the second audio track. In one example, stretching the vocal component of the first audio track to match a tempo of the second audio track comprises at step 812 includes detecting beat and downbeat timestamps for the first audio track and for the second audio track, and estimating the tempo of the second audio track based on the detected beat and downbeat timestamps for the second audio track.
At step 814, the stretched vocal component of the first audio track may be added to the one or more accompaniment components of the second audio track.
The operating system 905, for example, may be suitable for controlling the operation of the computing device 900. Furthermore, embodiments of the disclosure may be practiced in conjunction with a graphics library, other operating systems, or any other application program and is not limited to any particular application or system. This basic configuration is illustrated in
As stated above, a number of program modules and data files may be stored in the system memory 904. While executing on the processing unit 902, the program modules 906 (e.g., audio track mashup application 920) may perform processes including, but not limited to, the aspects, as described herein. Other program modules that may be used in accordance with aspects of the present disclosure, and in particular for combining audio tracks, may include source processor 921, boundary processor 922, segment processor 923, and beat processor 924.
Furthermore, embodiments of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors. For example, embodiments of the disclosure may be practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in
The computing device 900 may also have one or more input device(s) 912 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, etc. The output device(s) 914 such as a display, speakers, a printer, etc. may also be included. The aforementioned devices are examples and others may be used. The computing device 900 may include one or more communication connections 916 allowing communications with other computing devices 950. Examples of suitable communication connections 916 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
The term computer readable media as used herein may include computer storage media. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules. The system memory 904, the removable storage device 909, and the non-removable storage device 910 are all computer storage media examples (e.g., memory storage). Computer storage media may include RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 900. Any such computer storage media may be part of the computing device 900. Computer storage media does not include a carrier wave or other propagated or modulated data signal.
Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
One or more application programs 1166 may be loaded into the memory 1162 and run on or in association with the operating system 1164. Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth. The system 1102 also includes a non-volatile storage area 1168 within the memory 1162. The non-volatile storage area 1168 may be used to store persistent information that should not be lost if the system 1102 is powered down. The application programs 1166 may use and store information in the non-volatile storage area 1168, such as email or other messages used by an email application, and the like. A synchronization application (not shown) also resides on the system 1102 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non-volatile storage area 1168 synchronized with corresponding information stored at the host computer.
The system 1102 has a power supply 1170, which may be implemented as one or more batteries. The power supply 1170 may further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
The system 1102 may also include a radio interface layer 1172 that performs the function of transmitting and receiving radio frequency communications. The radio interface layer 1172 facilitates wireless connectivity between the system 1102 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio interface layer 1172 are conducted under control of the operating system 1164. In other words, communications received by the radio interface layer 1172 may be disseminated to the application programs 1166 via the operating system 1164, and vice versa.
The visual indicator 1120 may be used to provide visual notifications, and/or an audio interface 1174 may be used for producing audible notifications via an audio transducer (e.g., audio transducer 1025 illustrated in
A mobile computing device 1000 implementing the system 1102 may have additional features or functionality. For example, the mobile computing device 1000 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Data/information generated or captured by the mobile computing device 1000 and stored via the system 1102 may be stored locally on the mobile computing device 1000, as described above, or the data may be stored on any number of storage media that may be accessed by the device via the radio interface layer 1172 or via a wired connection between the mobile computing device 1000 and a separate computing device associated with the mobile computing device 1000, for example, a server computer in a distributed computing network, such as the Internet. As should be appreciated such data/information may be accessed via the mobile computing device 1000 via the radio interface layer 1172 or via a distributed computing network. Similarly, such data/information may be readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
As should be appreciated,
The description and illustration of one or more aspects provided in this application are not intended to limit or restrict the scope of the disclosure as claimed in any way. The aspects, examples, and details provided in this application are considered sufficient to convey possession and enable others to make and use the best mode of claimed disclosure. The claimed disclosure should not be construed as being limited to any aspect, example, or detail provided in this application. Regardless of whether shown and described in combination or separately, the various features (both structural and methodological) are intended to be selectively included or omitted to produce an embodiment with a particular set of features. Having been provided with the description and illustration of the present application, one skilled in the art may envision variations, modifications, and alternate aspects falling within the spirit of the broader aspects of the general inventive concept embodied in this application that do not depart from the broader scope of the claimed disclosure.