Neural networks are machine learning models that employ one or more layers of models to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
A system includes a computer including a processor and a memory. The memory includes instructions such that the processor is programmed to receive an audio input representing a percussion performed by a user and classify, at a trained neural network, the audio input as a particular musical type.
In other features, the trained neural network maps audio input sequences representing the percussion to target musical instrument sequences.
In other features, the trained neural network comprises a convolutional neural network.
In other features, the convolutional neural network comprises at least one dropout layer.
In other features, a number of inputs within an input layer of the convolutional neural network is equal to a number of recorded sounds corresponding to the audio input.
In other features, the number of inputs comprises at least two, wherein a first input of the at least two inputs corresponds to audio input representing a kick and a second input of the at least two inputs corresponds to audio input representing a snare.
In other features, the processor is further programmed to receive the audio input from a microphone.
In other features, the processor is further programmed to perform audio envelope detection on the audio input prior to classification of the audio input.
In other features, the trained neural network is configured to transform the audio input into corresponding images, wherein the trained neural network is configured to classify the corresponding images into a particular musical type.
In other features, the trained neural network is configured to use Frequency Cepstral Coefficient (MFCC) feature extraction layers to transform the audio input into the corresponding images.
A method is disclosed that includes receiving an audio input representing a percussion performed by a user and classifying, at a trained neural network, the audio input as a particular musical type.
In other features, the trained neural network maps audio input sequences representing the percussion to target musical instrument sequences.
In other features, the trained neural network comprises a convolutional neural network.
In other features, the convolutional neural network comprises at least one dropout layer.
In other features, a number of inputs within an input layer of the convolutional neural network is equal to a number of recorded sounds corresponding to the audio input.
In other features, the number of inputs comprises at least two, wherein a first input of the at least two inputs corresponds to audio input representing a kick and a second input of the at least two inputs corresponds to audio input representing a snare.
In other features, the method further includes receiving the audio input from a microphone.
In other features, the method further includes performing audio envelope detection on the audio input prior to classification of the audio input.
In other features, the method further includes transforming the audio input into corresponding images, wherein the trained neural network is configured to classify the corresponding images into a particular musical type.
In other features, the trained neural network is configured to use Frequency Cepstral Coefficient (MFCC) feature extraction layers to transform the audio input into the corresponding images.
Further areas of applicability will become apparent from the description provided herein. It should be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
Vocal percussion, e.g., beatboxing, can involve mimicking one or more musical instruments, such as a drum machine, via a user's mouth, lips, tongue, and/or voice. The vocal percussion may correspond to a particular type of drum sound, such as output corresponding to a bass drum, corresponding to a snare drum, output corresponding to a hi-hat, output corresponding to a closed hat, or the like.
The trained neural network 120 is a system that can be trained end-to-end to map audio input sequences to target musical instrument sequences. The input sequence can be a sequence of vectors that each represent a different frame of audio data (e.g., representing 25 milliseconds of audio, or another amount of audio). Each input vector can indicate audio input for the corresponding time period of an audio segment. The output can be a prediction representing a classification of the corresponding audio input. In an example implementation, the trained neural network 120 may generate a prediction representing a classification of the vocal percussion. For example, the trained neural network 120 may receive audio input representing a vocal percussion and may output a prediction indicating that the audio input is classified as corresponding to a snare drum audio sound, a bass drum audio sound, or the like.
In some example implementations, the user device 170 may provide the audio input to the computing system 110 for classification. In these implementations, the computing system 110 maintains the neural network 120 and provides the classification to the user device 160 via the network 170. In other example implementations, the trained neural network 120 may reside in the memory of the user device 160 such that the audio input classification may be accomplished locally.
Based on the audio classification, a processor of the user device 170 and/or the computing system 110 may generate digital data representing musical notes and/or musical patterns. In an example implementation, the digital data may comprise Musical Instrument Digital Interface (MIDI) data. The digital data can be provided to a musical instrument 180, and the musical instrument 180 can produce audio corresponding to the digital data. For example, the trained neural network 120 generates an audio classification for audio input provided by a user. Based on the audio classification, digital data representing the musical notes and/or musical patterns corresponding to the audio classification is generated, and the musical instrument 180 generates audio the corresponds to the digital data. In an example implementation, the audio input may comprise percussive vocalized sounds, audio input representing percussion performed by a user's appendages, e.g., hands, feet, musical notes produced by user through humming or singing, and/or output note velocity. The corresponding generated audio may comprise audio signals representing percussive sounds, i.e., drum sounds, musical notes, and/or audio representing a musical note velocity.
The nodes 202 are sometimes referred to as artificial neurons 202, because they are designed to emulate biological, e.g., human, neurons. A set of inputs (represented by the arrows) to each neuron 202 are each multiplied by respective weights. The weighted inputs can then be summed in an input function to provide, possibly adjusted by a bias, a net input. The net input can then be provided to activation function, which in turn provides a connected neuron 205 an output. The activation function can be a variety of suitable functions, typically selected based on empirical analysis. As illustrated by the arrows in
The DNN 200 can accept audio as input and generate a one or more outputs, or predictions, based on the input. As discussed below, the predictions can comprise classifications of the audio input. The DNN 200 can be trained with ground truth data, i.e., data about a real-world condition or state. For example, the DNN 200 can be trained with ground truth data or updated with additional data by a processor of the computing system 110. Weights can be initialized by using a Gaussian distribution, for example, and a bias for each node 202 can be set to zero. Training the DNN 200 can including updating weights and biases via suitable techniques such as back-propagation with optimizations. Ground truth data can include, but is not limited to, classification labels corresponding to particular audio input.
The convolution layers 310 may include one or more convolutional filters, which are be applied to the audio input data 340 to generate an output 335. While
The convolutional neural network 300 may also include one or more fully connected layers 325 (FC1 and FC2). The convolutional neural network 300 may further include a logistic regression (LR) layer 330 and one or more dropout layers (DL) 333. Between each layer 310, 315, 320, 325, 330, 333 of the convolutional neural network 300 are weights that can be updated. The output of each of the layers (e.g., 310, 315, 320, 325, 330, 333) may serve as an input of a succeeding one of the layers (e.g., 310, 315, 320, 325, 330, 333) in the convolutional neural network 300 to learn object detections from the audio input data provided at the first of the convolution blocks 305A. The output 335 of the convolutional neural network 300 can represent a classification prediction based on the audio input data 340. The classification prediction can be defined as a probability of an audio input corresponding to a particular percussion sound, e.g., drum sound, kick sound, snare sound, etc.
In an example implementation, a trained DNN 200 can comprise at least one input layer, at least six hidden layers, and at least one output layer. During operation, the number of inputs in the input layer can be equal to the number of recorded sounds. For example, if a user records a kick, a snare, and an exp, the number of inputs comprises three. In an example implementation, the architecture of the DNN 200 can comprise an input layer, a first fully connected layer, an activation layer, i.e., tanh layer, a first dropout layer, a second fully connected layer, a second dropout layer, and a sigmoid output layer. A dropout rate for each of the dropout layers may be set to 0.25.
After training, the DNN 200 may be used by the user device 160 and/or the computing system 110 to classify audio inputs as shown in
At block 515, the trained neural network 200 classifies the audio input. For example, the audio input may be classified as a particular vocal percussion type. The vocal percussion type may be a particular type of drum sound, such as output corresponding to a bass drum, corresponding to a snare drum, output corresponding to a hi-hat, output corresponding to a closed hat, or the like. At block 520, a musical instrument 180, such as an electronic musical instrument, generates audio output based on the classified audio. For example, the computing system 110 and/or the user device 160 may generate digital data corresponding to the audio classification, and the digital data can be provided to the musical instrument 180 for audio output generation. The process 500 then ends.
In one or more implementations, one or more suitable audio preprocessing techniques may be applied to the audio input. For example, a processor of the computing system 110 and/or the user device 160 may perform audio envelope detection, which may be a magnitude of the audio signal computed by a Hilbert function (see
where Thr represents the threshold parameter, SIGNAL represents the smoothed audio signal at the ith iteration, and n represents the number of audio signal samples. Referring to
In some implementations, the neural network 200 may use one or more Mel-Frequency Cepstral Coefficient (MFCC) feature extraction layers which transform the audio signals, e.g., audio input, into images such that the neural network 200 can perform image classification.
In general, the computing systems and/or devices described may employ any of a number of computer operating systems, including, but by no means limited to, AppLink/Smart Device Link middleware, the Microsoft Automotive® operating system, the Microsoft Windows® operating system, the Unix operating system (e.g., the Solaris® operating system distributed by Oracle Corporation of Redwood Shores, California), the AIX UNIX operating system distributed by International Business Machines of Armonk, New York, the Linux operating system, the Mac OSX and iOS operating systems distributed by Apple Inc. of Cupertino, California, the BlackBerry OS distributed by Blackberry, Ltd. of Waterloo, Canada, and the Android operating system developed by Google, Inc. and the Open Handset Alliance, or the QNX® CAR Platform for Infotainment offered by QNX Software Systems. Examples of computing devices include, without limitation, an on-board vehicle computer, a computer workstation, a server, a desktop, notebook, laptop, or handheld computer, or some other computing system and/or device.
Computers and computing devices generally include computer-executable instructions, where the instructions may be executable by one or more computing devices such as those listed above. Computer executable instructions may be compiled or interpreted from computer programs created using a variety of programming languages and/or technologies, including, without limitation, and either alone or in combination, Java™, C, C++, Matlab, Simulink, Stateflow, Visual Basic, Java Script, Perl, HTML, etc. Some of these applications may be compiled and executed on a virtual machine, such as the Java Virtual Machine, the Dalvik virtual machine, or the like. In general, a processor (e.g., a microprocessor) receives instructions, e.g., from a memory, a computer readable medium, etc., and executes these instructions, thereby performing one or more processes, including one or more of the processes described herein. Such instructions and other data may be stored and transmitted using a variety of computer readable media. A file in a computing device is generally a collection of data stored on a computer readable medium, such as a storage medium, a random-access memory, etc.
Memory may include a computer-readable medium (also referred to as a processor-readable medium) that includes any non-transitory (e.g., tangible) medium that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including, but not limited to, non-volatile media and volatile media. Non-volatile media may include, for example, optical or magnetic disks and other persistent memory. Volatile media may include, for example, dynamic random-access memory (DRAM), which typically constitutes a main memory. Such instructions may be transmitted by one or more transmission media, including coaxial cables, copper wire and fiber optics, including the wires that comprise a system bus coupled to a processor of an ECU. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
Databases, data repositories or other data stores described herein may include various kinds of mechanisms for storing, accessing, and retrieving various kinds of data, including a hierarchical database, a set of files in a file system, an application database in a proprietary format, a relational database management system (RDBMS), etc. Each such data store is generally included within a computing device employing a computer operating system such as one of those mentioned above, and are accessed via a network in any one or more of a variety of manners. A file system may be accessible from a computer operating system, and may include files stored in various formats. An RDBMS generally employs the Structured Query Language (SQL) in addition to a language for creating, storing, editing, and executing stored procedures, such as the PL/SQL language mentioned above.
In some examples, system elements may be implemented as computer-readable instructions (e.g., software) on one or more computing devices (e.g., servers, personal computers, etc.), stored on computer readable media associated therewith (e.g., disks, memories, etc.). A computer program product may comprise such instructions stored on computer readable media for carrying out the functions described herein.
With regard to the media, processes, systems, methods, heuristics, etc. described herein, it should be understood that, although the steps of such processes, etc. have been described as occurring according to a certain ordered sequence, such processes may be practiced with the described steps performed in an order other than the order described herein. It further should be understood that certain steps may be performed simultaneously, that other steps may be added, or that certain steps described herein may be omitted. In other words, the descriptions of processes herein are provided for the purpose of illustrating certain embodiments, and should in no way be construed so as to limit the claims.
Accordingly, it is to be understood that the above description is intended to be illustrative and not restrictive. Many embodiments and applications other than the examples provided would be apparent to those of skill in the art upon reading the above description. The scope of the invention should be determined, not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. It is anticipated and intended that future developments will occur in the arts discussed herein, and that the disclosed systems and methods will be incorporated into such future embodiments. In sum, it should be understood that the invention is capable of modification and variation and is limited only by the following claims.
All terms used in the claims are intended to be given their plain and ordinary meanings as understood by those skilled in the art unless an explicit indication to the contrary in made herein. In particular, use of the singular articles such as “a,” “the,” “said,” etc. should be read to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.
Number | Name | Date | Kind |
---|---|---|---|
20090306797 | Cox | Dec 2009 | A1 |
20140341395 | Matsumoto | Nov 2014 | A1 |
20200035222 | Sypniewski | Jan 2020 | A1 |
20200380940 | Abdallah | Dec 2020 | A1 |
Entry |
---|
Q. Ding and N. Zhang, “Classification of Recorded Musical Instruments Sounds Based on Neural Networks,” 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing, Honolulu, HI, USA, 2007, pp. 157-162, doi: 10.1109/CIISP.2007.369310. (Year: 2007). |
Number | Date | Country | |
---|---|---|---|
63138558 | Jan 2021 | US |