This disclosure relates to machine learning, particularly to machine learning methods and systems based on transformer architecture.
In the field of machine learning, transformers as disclosed in A. Vaswani, et al., “Attention is all you need,” 31st Conference on Neural Information Processing Systems, 2017 (dated Dec. 6, 2017) are used in fields such as natural language processing and computer vision. In a more recent development, a transformer-in-transformer (TNT) architecture has been proposed by K. Han, et al., “Transformer in transformer,” arXiv preprint arXiv:2103.00112, 2021 (dated Jul. 5, 2021), in which local and global information are modeled such that sentence position encoding can maintain the global spatial information, while word position encoding is used for preserving the local relative position. However, such multilevel transformer architecture in the field of music information retrieval such as audio data recognition has yet to be proposed or developed. As such, further development is required in this field with regards to transformers for audio data recognition.
Devices, systems and methods related to causing an apparatus to generate music information of the audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral transformer and a temporal transformer, are disclosed herein. For example, the apparatus, or methods implemented using the apparatus, may include at least one processor and at least one memory including computer program code for one or more programs, the memory and the computer program code being configured to, with the processor, cause the apparatus to train a transformer-based neural network model. The apparatus may be configured to train the multilevel transformer.
In some examples, the apparatus includes at least one processor and a non-transitory computer-readable medium storing therein computer program code including instructions for one or more programs that, when executed by the processor, cause the processor to perform the following steps: obtain audio data; generate a time-frequency representation of the audio data to be applied as input for a transformer-based neural network model, the transformer-based neural network model comprising a transformer-in-transformer module which includes a spectral transformer and a temporal transformer; determine spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data, the spectral embeddings including a first frequency class token (FCT); determine each vector of a second FCT by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determine second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determine third temporal embeddings bypassing the second temporal embeddings through the temporal transformer; and generate music information of the audio data based on the third temporal embeddings.
In some examples, the spectral embeddings are determined by generating the first FCT to include at least one spectral feature from a frequency bin and frequency positional encodings (FPE) to include at least one frequency position of the first FCT. In some examples, each of the spectral transformer and the temporal transformer comprises a plurality of encoder layers, each encoder layer comprising a multi-head self-attention module, a feed-forward network module, and a layer normalization module. In some examples, each of the spectral transformer and the temporal transformer comprises a plurality of decoder layers configured to receive an output from one of the encoder layers, each decoder layer comprising a multi-head self-attention module, a feed-forward network module, a layer normalization module, and an encoder-decoder attention module.
In some examples, the spectral embeddings are matrices with matrix dimensions that are determined based on a number of frequency bins and a number of channels employed by the transformer-in-transformer module, and a number of the spectral embeddings is determined by a number of time-steps employed by the transformer-in-transformer module. In some examples, the temporal embeddings are vectors having a vector length determined by a number of features employed by the transformer-in-transformer module, and a number of the temporal embeddings is determined by a number of time-steps employed by the transformer-in-transformer module.
In some examples, the transformer-based neural network model comprises a plurality of transformer-in-transformer modules in a stacked configuration such that the temporal embedding is updated through each of the plurality of transformer-in-transformer modules. In some examples, the spectral transformer and the temporal transformer are arranged hierarchically such that the spectral transformer is configured to generate local music information of the audio data and the temporal transformer is configured to generate global music information of the audio data.
According to another implementation, a method implemented by at least one processor is disclosed, where the method includes the steps of: obtaining audio data; generating a time-frequency representation of the audio data to be applied as input for a transformer-based neural network model, the transformer-based neural network model comprising a transformer-in-transformer module which includes a spectral transformer and a temporal transformer; determining spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data, the spectral embeddings including a first frequency class token (FCT); determining each vector of a second FCT by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determining second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determining third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generating music information of the audio data based on the third temporal embeddings.
In some examples, the method also includes the step of determining the spectral embeddings by generating the first FCT to include at least one spectral feature from a frequency bin and generating frequency positional encodings (FPE) to include at least one frequency position of the first FCT. In some examples, each of the spectral transformer and the temporal transformer comprises a plurality of encoder layers, each encoder layer comprising a multi-head self-attention module, a feed-forward network module, and a layer normalization module. In some examples, each of the spectral transformer and the temporal transformer comprises a plurality of decoder layers configured to receive an output from one of the encoder layers, each decoder layer comprising a multi-head self-attention module, a feed-forward network module, a layer normalization module, and an encoder-decoder attention module.
In some examples, the spectral embeddings are matrices with matrix dimensions that are determined based on a number of frequency bins and a number of channels employed by the transformer-in-transformer module, and a number of the spectral embeddings is determined by a number of time-steps employed by the transformer-in-transformer module. In some examples, the temporal embeddings are vectors having a vector length determined by a number of features employed by the transformer-in-transformer module, and a number of the temporal embeddings is determined by a number of time-steps employed by the transformer-in-transformer module.
In some examples, the transformer-based neural network model comprises a plurality of transformer-in-transformer modules in a stacked configuration such that the temporal embedding is updated through each of the plurality of transformer-in-transformer modules. In some examples, the spectral transformer and the temporal transformer are arranged hierarchically such that the spectral transformer is configured to generate local music information of the audio data and the temporal transformer is configured to generate global music information of the audio data.
The implementations will be more readily understood in view of the following description when accompanied by the below figures, wherein like reference numerals represent like elements, and wherein:
Briefly, systems and methods include a transformer-in-transformer (TNT) architecture which implements a spectral transformer which extracts frequency-related features into frequency class token (FCT) for each frame of audio data such that the FCT is linearly projected and added to temporal embeddings which aggregate useful information from the FCT. The TNT architecture also implements a temporal transformer which processes the temporal embeddings to exchange information across the time (temporal) axis. This architecture of implementing a spectral transformer and a temporal transformer is referred to herein as spectral-temporal TNT in which a plurality of such TNT blocks may be stacked to build the spectral-temporal TNT model architecture to learn the representation for audio data such as music signals, to perform tasks such as music information retrieval (MIR) research and analysis including, but not limited to, music tagging, vocal melody extraction, chord recognition, etc.
In MIR analysis, the time axis is represented as an axis of sequence, and the frequency axis is represented as an axis of feature. Referring to
A positional encoding block 108 is any suitable module which is capable of adding positional information to the input time-frequency representation after it is processed through the convolution block 106. The specifics of how the positional information is added are explained with regard to
With regard to
A frequency positional embedding (FPE, also represented as Eϕ) is a learnable matrix which is used to apply frequency positional encoding to the representation and is generated by an FPE generation block 202. The FPE matrix is denoted by Eϕ∈(F′+1)×K′. An element-wise adder 206 implements element-wise addition with S″t and Eϕ, the result of which is denoted as Ŝt=S″t⊕Eϕ (where ⊕ denotes the element-wise addition). The combined three-dimensional matrix for all time-steps t, i.e. Ŝ having the dimensions T′, F′+1, and K′, is the output of the positional encoding block 108. In the resulting representation matrix Ŝ, the FCT vectors therein are collectively denoted by Ĉ=[ĉ1, ĉ2, . . . , ĉT′] which allows the representation matrix Ŝ to carry information such as pitch and timbre of the audio data to the following attention layers. For example, a pitch in the signal can lead to high energy at a specific frequency bin, and the positional encoding makes each of the FCT vector aware of the frequency position.
With regard to the data flow of the temporal embeddings 400, El is used to denote the temporal embedding matrix which is a combination of individual temporal embedding vectors at layer l, such that El=[el1, el2, . . . , elT′], where elt∈1×D, that is, each elt is a temporal embedding vector at time t of dimension D, and D is the number of features El is a learnable temporal embedding matrix which is randomly initialized as E0∈T′×D, prior to entering the first spectral-temporal TNT block. As the temporal embedding matrix passes through each subsequent layer, the learnable matrix El is gradually improved.
In the following examples, the FCT vectors are located in the first frequency bin of the spectral embedding matrix, i.e. Ŝl. The initial Ŝl matrix (or Ŝ0) which enters the first spectral-temporal TNT block, is the output obtained from the positional encoding block 108, previously denoted as S in
For example, each of the temporal embedding vectors, that is, el-11, el-12, . . . , et-1T′, of the learnable matrix El-1 is passed through the linear projection layer 404, which transforms the vectors from having the dimension of D to having the dimension of K′. This enables the projected vectors of dimension K′ to be added, using the adder 406, with the first frequency bin of the spectral embedding matrix Ŝl-1, which is where the FCT vectors are located. The result of adding the projected vectors to the spectral embedding matrix is denoted as Šl-1. The resulting matrix Šl-1 is inputted into the spectral transformer encoder 408 which outputs the matrix Ŝl, which can be used as the input spectral embedding for the next layer.
The output matrix Ŝl is then passed through the linear projection layer 410, which transforms each of the FCT vectors of the output matrix Ŝl, that is, the vectors located in the first frequency bin of the output spectral embedding matrix Ŝl, changing the dimension from K′ to D. The linearly projected FCT vectors are then added with the temporal embedding vectors el-11, el-12, . . . , el-1T′ using the adder 412. The added vectors (el1, el2, . . . , elT′) are inputted into the temporal transformer encoder 414 to obtain the matrix El, which can be used the input temporal embedding for the next layer.
The MHSA module 508 is an extension of the self-attention such that the three inputs Q, K, and V are split along their feature dimension into h numbers of heads, and then multiple self-attentions are performed in parallel, each self-attention being performed on one of the heads. The output of the heads are then concatenated and linearly projected into the final output. The FFN module 510 has two linear layers with a Gaussian Error Linear Unit (GELU) activation function there between. In some examples, the pre-norm residual units are also implemented to stabilize the training of the model.
Generally, the transformer encoder 500 operates such that Xl=Enc(Xl-1), where the Enc(⋅) operation is performed as follows. In a first portion of the encoder 500, the temporal embedding matrix or vector Xl-1 is passed through the layer normalization module 506 and subsequently through the multi-head self-attention module 508. The resulting matrix or vector from the multi-head self-attention module 508 is added to the original matrix or vector Xl-1, where the result thereof can be denoted as X′l-1. In the next portion of the encoder 500, the resulting matrix or vector X′l-1 is passed through the layer normalization module 506 and subsequently through the feed-forward network module 510, after which the resulting matrix or vector from the feed-forward network module 510 is added to the original matrix or vector X′l-1, and the final result is outputted in the form of vector or matrix Xl to be inputted into the next transformer layer.
In some examples, multiple spectral-temporal TNT blocks 110 are stacked to form a spectral-temporal TNT module. For example, there may be three TNT blocks 110 in one such TNT module. The module may start with inputting the initial spectral embedding matrix Ŝ0 and the initial temporal embedding matrix E0 for the first TNT block. For each TNT block, as shown in
In the first step, each of the FCT vectors ĉl-1t in Ŝl-1 is updated by adding the linear projection of the associated temporal embedding vector el-1t using the linear projection layer 404. This operation is represented by čl-1t=ĉl-1t⊕Linear(el-1t), where čl-1t is the updated FCT vector from the previous FCT vector ĉl-1t, and the Linear(⋅) operation represents a shared linear layer, i.e. the linear projection layer 404.
In the second step, the spectral embedding matrix Šl-1, which includes the updated FCT vectors čl-1t ranging from t=1 to t=T′ at the first frequency bin or the first row, is passed through the spectral transformer encoder 408, defined as Ŝl=SpecEnc(Šl-1).
In the third step, each of the FCT vectors ĉlt in Ŝl is linearly projected and added back to the corresponding temporary embedding vector el-1t such that ěl-1t=el-1t⊕Linear(ĉlt), where ělt denotes the updated temporal embedding vectors located in an updated temporal embedding matrix {hacek over (E)}l-1.
Lastly, in the fourth step, the updated temporal embedding matrix {hacek over (E)}l-1, instead of the sum of the temporal embedding matrix El-1 and the spectral embedding matrix Ŝl-1, is subsequently updated using the temporal transformer encoder 414, represented by the TempEnc(⋅) function, such that El=TempEnc({hacek over (E)}l-1). This operation assists in building up the relationship along the time axis and is therefore beneficial in improving performance of the transformer-based neural network model by reducing the number of parameters. Moreover, the temporal transformer does not require access to the information of every frequency bin, but rather only the important frequency bins that are attended by the FCT vectors, within each spectral embedding matrix.
The output block 112 receives the final output of the TNT blocks 110, denoted as E3, which is the temporal embedding matrix from the third TNT block, which is the final TNT block in the TNT module. Although the number three (3) is depicted, it is to be understood that there may be any suitable number of TNT blocks, such as more or less than three TNT blocks, depending on the amount of data that is to be learned.
Different outputs may be required from the output block 112 depending on the tasks that are to be performed using such output. For example, in frame-wise prediction tasks such as vocal melody extraction and chord recognition, each temporal embedding vector e3t is fed into a shared fully-connected layer with sigmoid or SoftMax function for the final output. For example, in song-prediction tasks such as music tagging, the output block 112 initiates a temporal class token vector εl, where l=0, that is concatenated at the front end of El to form another matrix Êl such that Êl=[εl, el1, el2, . . . , elT′]. Note that the temporal class token vector εl does not have an associated FCT vector in the spectral embedding matrix because the temporal class token vector εl operates to aggregate the temporal embedding vectors along the time axis. Lastly, the ϵ3 vector, representing the temporal class token vector after the third TNT block, is fed to a fully-connected layer, followed by a sigmoid layer, to obtain the probability output.
In the processor 604, there are modules capable of performing each of the blocks 102, 104, 106, 108, 110, and 112 as previously disclosed. The modules may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium, such as the memory unit 606, for execution by the processor 604. Furthermore, in each spectral TNT block 110, there are a spectral transformer block 608, temporal transformer block 610, and linear projection block 612, such that a plurality of spectral TNT blocks 110 may include a plurality of individually operable spectral transformers 608, temporal transformers 610, and linear projection blocks 612, to achieve the multilevel transformer architecture disclosed herein.
The decoder 700 of the spectral transformer block 608 and the decoder 702 of the temporal transformer block 610 also have similar component blocks, mainly the multi-head self-attention block 508, the feed-forward network block 510, the layer normalization block 506, and an encoder-decoder attention block 704 which helps the decoder 700 or 702 focus on the appropriate matrices that are outputted from each encoder.
In step 808, the processor determines each vector of a second FCT by passing each vector of the first FCT in the spectral embeddings through the spectral transformer. In step 810, the processor determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings. In step 812, the processor determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer. In step 814, the processor generates music information of the audio data based on the third temporal embeddings.
The method 800, in some example, may pertain to the dataflow within a single spectral TNT block, and it should be understood that the TNT-based neural network model may have multiple such TNT blocks that are functionally coupled or stacked together, for example serially such that the output from the first TNT block is used as an input for the subsequent TNT block, in order to improve the efficiency and efficacy of training the model based on the training data set in the database.
In some examples, each of the spectral transformer and the temporal transformer includes a plurality of encoder layers, each encoder layer including a multi-head self-attention module, a feed-forward network module, and a layer normalization module. Each of the spectral transformer and the temporal transformer may include a plurality of decoder layers configured to receive an output matrix from one of the encoder layers, each decoder layer including a multi-head self-attention module, a feed-forward network module, a layer normalization module, and an encoder-decoder attention module.
Additional steps may be implemented in the method 800 as disclosed herein. For example, the processor may determine the dimensions of the spectral embedding matrices based on a number of frequency bins and a number of channels employed by the multilevel transformer, and further determine a number of the spectral embedding matrices based on a number of time-steps employed by the multilevel transformer. For example, the processor may determine a vector length of the temporal embedding vectors based on a number of features employed by the multilevel transformer, and further determine a number of the temporal embedding vectors based on a number of time-steps employed by the multilevel transformer. The spectral transformer and the temporal transformer may be arranged hierarchically such that the spectral (lower-level) transformer learns the local information of the audio data and the temporal (higher-level) transformer learns the global information of the audio data.
In some examples, a positional encoding block is operatively coupled with the multilevel transformer such that a concatenator of the positional encoding block concatenates the FCT vectors with a convoluted time-frequency representation of the audio data, and an element-wise adder of the positional encoding block adds the FPE matrices to the convoluted time-frequency representation of the audio data.
There are numerous advantages in implementing such method or processing device to train a transformer-based neural network model via the use of the multilevel transformer. For example, the multilevel transformer is capable of learning the representation for audio data such as music or vocal signals and demonstrating improved performance in music tagging, vocal melody extraction, and chord recognition. In some examples, the multilevel transformer is capable of learning a more effective model using smaller datasets due to the multilevel transformer being configured such that only the important local information is passed to the temporal transformer through FCTs, which largely reduces the dimensionality of the data flow compared to the other transformer-based models for learning audio data, as known in the art. The reduction in data flow dimensionality facilitates more efficient machine learning due to reduced workload.
Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements. The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be mask works that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the examples.
The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
In the preceding detailed description of the various examples, reference has been made to the accompanying drawings which form a part thereof, and in which is shown by way of illustration specific preferred examples in which the invention may be practiced. These examples are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other examples may be utilized, and that logical, mechanical and electrical changes may be made without departing from the scope of the invention. To avoid detail not necessary to enable those skilled in the art to practice the invention, the description may omit certain information known to those skilled in the art. Furthermore, many other varied examples that incorporate the teachings of the disclosure may be easily constructed by those skilled in the art. Accordingly, the present invention is not intended to be limited to the specific form set forth herein, but on the contrary, it is intended to cover such alternatives, modifications, and equivalents, as can be reasonably included within the scope of the invention. The preceding detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims. The above detailed description of the embodiments and the examples described therein have been presented for the purposes of illustration and description only and not by limitation. For example, the operations described are done in any suitable order or manner. It is therefore contemplated that the present invention covers any and all modifications, variations or equivalents that fall within the scope of the basic underlying principles disclosed above and claimed herein.
The above detailed description and the examples described therein have been presented for the purposes of illustration and description only and not for limitation.
Number | Name | Date | Kind |
---|---|---|---|
8046214 | Mehrotra | Oct 2011 | B2 |
20080312758 | Koishida | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
2009524108 | Jun 2009 | JP |
2021101665 | May 2021 | WO |
Entry |
---|
An apparatus comprising: at least one processor and a non-transitory computer-readable medium storing therein computer program code including instructions for one or more programs that, when executed by the processor, cause the processor to: Dec. 2, 2019 (Year: 2019). |
An apparatus comprising: at least one processor and a non-transitory computer-readable medium storing therein computer program code including instructions for one or more programs that, when executed by the processor, cause the processor to: Dec. 2, 2019 (Year: 2019) (Year: 2019). |
K. G. Gopalan, D. S. Benincasa and S. J. Wenndt, “Data embedding in audio signals,” 2001 IEEE Aerospace Conference Proceedings (Cat. No. 01TH8542), Big Sky, MT, USA, 2001, pp. 2713-2720 vol. 6, doi: 10.1109/AERO.2001.931292. (Year: 2001). |
International Search Report dated Jun. 11, 2023 in International Application No. PCT/SG2022/050704. |
Han K. et al., “Transformer in Transformer,” 35th Conference on Neural Information Processing Systems, Jul. 5, 2021, pp. 1-12 [Retrieved on May 23, 2023]. |
Zadeh A. et al., “WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation,” Nov. 21, 2019, pp. 1-11 [Retrieved on May 23, 2023]. |
Number | Date | Country | |
---|---|---|---|
20230124006 A1 | Apr 2023 | US |