In some instances, neural networks may be employed to synthesize audio of natural sounds, e.g., musical instruments, singing voices, and speech. Further, some audio synthesis implementations have begun to utilize neural networks that leverage different digital signal processors (DDSPs) to synthesize audio of natural sounds in an offline context via batch processing. However, real-time synthesis using a neural network and DDSP has not been realizable as the subcomponents employed when using a neural network and DDSP together have proven inoperable when used together in the real-time context. For example, the real-time buffer of the device and the frame size of the neural network may be different, which can significantly limit the utility and/or accuracy of the neural network. Further, the computations required to use a neural network and DDSP together are processor intensive and memory intensive, thereby restricting the type of devices capable of implementing a synthesis technique that uses a neural network and DDSP. Further, some of the computations performed when using a neural network with DDSP introduce a latency that makes real-time use infeasible.
The following presents a simplified summary of one or more implementations of the present disclosure in order to provide a basic understanding of such implementations. This summary is not an extensive overview of all contemplated implementations, and is intended to neither identify key or critical elements of all implementations nor delineate the scope of any or all implementations. Its sole purpose is to present some concepts of one or more implementations of the present disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, a method may include generating a frame by sampling audio input in increments equal to a buffer size of a host device until a threshold corresponding to a frame size used to train a machine learning model is reached; extracting, from the frame, amplitude information, pitch information, and pitch status information; determining, by the machine learning model, control information for audio reproduction based on the amplitude information, the pitch information, and the pitch status information, the control information including pitch control information and noise magnitude control information; generating filtered noise information by inverting the noise magnitude control information using an overlap and add technique; generating, based on the pitch control information, additive harmonic information by combining a plurality of scaled wavetables; and rendering audio output based on the filtered noise information and the additive harmonic information.
In another aspect, a device may include an audio capture device; a speaker; a memory storing instructions, and at least one processor coupled with the memory and to execute the instructions to: capture audio input via the audio capture device; generate a frame by sampling the audio input in increments equal to a buffer size of a host device until a threshold corresponding to a frame size used to train a machine learning model is reached; extract, from the frame, amplitude information, pitch information, and pitch status information; determine, by the machine learning model, control information for audio reproduction based on the amplitude information, the pitch information, and the pitch status information, the control information including pitch control information and noise magnitude control information; filter the noise magnitude control information using an overlap and add technique to generate filtered noise information; generate, based on the pitch control information, additive harmonic information by combining a plurality of scaled wavetables; render audio output based on the filtered noise information and the additive harmonic information; and reproduce the audio output via the loudspeaker.
In another aspect, an example computer-readable medium (e.g., non-transitory computer-readable medium) storing instructions for performing the methods described herein and an example apparatus including means of performing operations of the methods described herein are also disclosed.
Additional advantages and novel features relating to implementations of the present disclosure will be set forth in part in the description that follows, and in part will become more apparent to those skilled in the art upon examination of the following or upon learning by practice thereof.
The Detailed Description is set forth with reference to the accompanying figures, in which the left-most digit of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in the same or different figures indicates similar or identical items or features.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. In some instances, well-known components are shown in block diagram form in order to avoid obscuring such concepts.
In order to synthesize realistic sounding audio of natural sounds, engineers have sought to employ neural audio synthesis with DDSPs. However, the current combination has proven to be infeasible for use in the real time context. For example, the subcomponents employed when using a neural network and DDSP together have proven inoperable when used together in the real-time context. As another example, the computations required to use a neural network and DDSP together are processor intensive and memory intensive, thereby restricting the type of devices capable of implementing a synthesis technique that uses a neural network and DDSP. As yet still another example, some of the computations performed when using a neural network with DDSP introduce a latency that makes real-time use infeasible.
This disclosure describes techniques for real-time and low latency synthesis of audio using neural networks and differentiable digital signal processors. Aspects of the present disclosure synthesize realistic sounding audio of natural sounds, e.g., musical instruments, singing voice, and speech. In particular, aspects of the present disclosure employ a machine learning model to extract control signals that are provided to a series of signal processors implementing additive synthesis, wavetable synthesis, and/or filtered noise synthesis. Further, aspects of the present disclosure employ novel techniques for subcomponent compatibility, latency compensation, and additive synthesis to improve audio synthesis accuracy, reduce the resources required to perform audio synthesis, and meet real-time context requirements. As a result, the present disclosure may be used to transform a musical performance using a first instrument into musical performance using another instrument or sound, provide more realistic sounding instrument synthesis, synthesize one or more notes of an instrument based on one or more samples of other notes of the instrument, and summarize the behavior and sound of a musical instrument.
As illustrated in
Further, in some aspects, the synthesis module 100 may be configured to generate a frame by sampling the audio input 108 in increments equal to a buffer size of the device 101 until a threshold corresponding to a frame size used to train the machine learning model 104 is reached, as described with respect to
The feature detector 102 may be configured to detect feature information 112(1)-(n). In some aspects, the feature information 112 may include amplitude information, pitch information, and pitch status information of each frame generated by the synthesis module 100 from the audio input 108. Further, as illustrated in
The ML model 104 may be configured to determine control information 114(1)-(n) based on the feature information 112(1)-(n) of the frames generated by the synthesis module 100. In some examples, the ML model 104 may include a neural network or another type of machine learning model. In some aspects, a “neural network” may refer to a mathematical structure taking an object as input and producing another object as output through a set of linear and non-linear operations called layers. Such structures may have parameters which may be tuned through a learning phase to produce a particular output, and are, for instance, used for audio synthesis. In addition, the ML model 104 may be a model capable of being used on a plurality of different devices having differing processing and memory capabilities. Some examples of neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. For example, in some aspects, the ML model 104 may include a recurrent neural network with at least one recurrent layer. Further, the ML model 104 may be trained using various training or learning techniques, e.g., backwards propagation of errors. For instance, the ML model 104 may train to determine the control information 114. In some aspects, a loss function may be backpropagated through the ML model 104 to update one or more parameters of the ML model 104 (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, etc. In some aspects, the loss comprises a spectral loss determined between two waveforms. Further, gradient descent techniques may be used to iteratively update the parameters over a number of training iterations.
As illustrated in
Additionally, in some aspects, the ML model 104 may be configured to process the control information 114 based on pitch status information before providing the control information 114 to the synthesis processor 106. For instance, rendering the audio output 110 based on a frame lacking pitch may cause chirping artifacts. Accordingly, to reduce chirping artifacts within the audio output 110, the ML model 104 may zero the harmonic distribution of the control information 114 based on the pitch status information indicating that the current frame does not have a pitch, as described in detail with respect to
Additionally, the synthesis processor 106 may be configured to render the audio output 110 based on the control information 114(1)-(n). For example, the synthesis processor 106 may be configured to generate a noise audio component using an overlap and add technique, generate a harmonic audio component from plurality of scaled wavetables using the pitch control information, and render the audio output 110 based on the noise audio component and harmonic audio component. Further, as described with respect to
Further, as illustrated in
As illustrated in
Further, as illustrated in
Additionally, the decoder 412 may be configured to generate control information (e.g., the harmonic distribution 430, harmonic amplitude 432, and noise magnitude information 434) based on the fundamental frequency 426 and the amplitude 428. In some aspects, the decoder 412 maps the fundamental frequency 426 and the amplitude 428 to control parameters for the synthesizers of the synthesis processor 106. In particular, the decoder 412 may comprise a neural network which receives the fundamental frequency 426 and the amplitude 428 as inputs, and generates control inputs (e.g., the harmonic distribution 430, the amplitude 432, and the noise magnitude information 434) for the DDSP element(s) of the synthesis processor 106.
Further, the exponential sigmoid module 418 may be configured to format the control information (e.g., harmonic distribution 430, harmonic amplitude 432, and noise magnitude information 434 via the biasing module 414) as non-negative by applying a sigmoid nonlinearity. As illustrated in
The windowing module 420 may be configured to receive the harmonic distribution 430 and the fundamental frequency in Hz 436, and upsample the harmonic distribution 430 with overlapping Hamming window envelopes with predefined values (e.g., frame size of 128 and hop size of 64) based on the fundamental frequency in Hz 436. As described in detail with respect to
Further, in some aspects, the device 101 may display visual data corresponding to the control information. For example, in some aspects, the device 101 may include a graphical user interface that displays the pitch status information 306, the harmonic distribution 430, harmonic amplitude 432, and noise magnitude information 434, and/or fundamental frequency in Hz 436. Further, the control information 114 may be presented in a thread safe manner that does not negatively impact the synthesis module determining the audio output and/or add audio artifacts. For example, in some aspects, double buffering of the harmonic distribution may be employed to allow for the harmonic distribution to be safely displayed in a GUI thread.
In some examples, the user input 706 may include a linear control that allows the user to compress or expand the amplitude about a target threshold. Further, a ratio may define how strongly the amplitude is compressed towards (or expanded away from) the threshold. For example, ratios greater than 1:1 (e.g., 2:1) pull the signal towards the threshold, ratios lower than 1:1 (e.g., 0.5:1) push the signal away from the threshold, and a ratio of exactly 1:1 has no effect, regardless of the threshold.
In some examples, the user input 706 may be employed as parameters for transient shaping of the amplitude control signal. Further, the user input 706 for transient shaping may include an attack input which controls the strength of transient attacks. Positive percentages for the attack input may increase the loudness of transients, negative percentages for the attack input may reduce the loudness of transients, and a level of 0% may have no effect. The user input 706 for transient shaping may also include a sustain input that controls the strength of the signal between transients. Positive percentages for the sustain input may increase the perceived sustain, negative percentages for the sustain input may reduce the perceived sustain, and a level of 0% may have no effect. In addition, the user input 706 for transient shaping may also include a time input representing a time characteristic. Shorter times may result in sharper attacks while longer times may result in longer attacks.
In some examples, the user input may further include a knee input defining the interaction between a threshold and a ratio during transient shaping of the amplitude control signal. In some aspects, the threshold may represent an expected amplitude transfer curve threshold, while the ratio may represent an expected amplitude transfer curve ratio. In addition, the user input may include an amplitude transfer curve knee width.
Further, as illustrated in
As illustrated in
Wavetable synthesis is well-suited to real-time synthesis of periodic and quasi-periodic signals. In many instances, real-world objects that generate sound often exhibit physics that are well described by harmonic oscillations (e.g., vibrating strings, membranes, hollow pipes and human vocal chords). By using lookup tables composed of single-period waveforms, wavetable synthesis can be as general as additive synthesis whilst requiring less real-time computation. Accordingly, the wavetable synthesizer 806 provides speed and processing benefits over traditional methods that require additive synthesis over numerous sinusoids, which cannot be performed in real-time. Further, in some aspects, the wavetable synthesizer 806 may employ a double buffer to store and index the scaled wavetables generated from the audio input 108, thereby providing storage benefits in addition to the computational benefits.
In some aspects, the wavetable synthesizer 806 may be further configured to apply frequency-dependent antialiasing to a wavetable. For example, the synthesis processor 106 may be configured to apply frequency-dependent antialiasing to the wavetable based on the pitch of the current frame as represented by the smooth fundamental frequency in Hz 814. Further, the frequency-dependent antialiasing may be applied to the scaled wavetable prior to storing the scaled wavetable within the double buffer.
Further, the mix control 808 be configured be independently increase or decrease the volumes of the noise audio component 812 and the harmonic audio component 816, respectively. In some aspects, the mix control 808 may modify the volume of the noise audio component 812 and/or the harmonic audio component 816 in a real-time safe manner based on user input. In addition, the mix control 808 may be configured to apply a smoothing gain when modifying the noise audio component 812 and/or the harmonic audio component 816 to prevent audio artifacts. Further, the mix control 808 may be implemented using a real-time safe technique in order to reduce and/or limit audio artifacts.
Additionally, the mix control 808 may provide the noise audio component 812 and the harmonic audio component 816 to the latency compensation module 810 to be aligned. For example, the noise synthesizer 802 may introduce delay that may be corrected by the latency compensation module. In particular, in some aspects, the latency compensation module 810 may shift the noise audio component 812 and/or the harmonic audio component 816 so that the noise audio component 812 and/or the harmonic audio component 816 are properly aligned, and combine the noise audio component 812 and the harmonic audio component 816 to form the audio output 110. As described herein, in some examples, the latency compensation module 810 may combine the noise audio component 812 and the harmonic audio component 816 to form an audio output 110 associated with an instrument differing from the instrument that produced the audio input 108. In some other examples, the latency compensation module 810 may combine the noise audio component 812 and the harmonic audio component 816 to form an audio output 110 of one or more notes of an instrument based on one or more samples of other notes of the instrument captured within the audio input 108.
The processes described in
At block 1302, the method 1300 may include generating a frame by sampling audio input in increments equal to a buffer size of a host device until a threshold corresponding to a frame size used to train a machine learning model is reached. For example, the ML model 104 may be configured with a frame size equaling 480 samples, and the I/O buffer size of the device 101 may be 128 samples. As a result, the synthesis module 100 may sample the audio input 108 within the buffers 204 of the device, generate a frame including the data from the 1st sample of the first buffer 204(1) to the 36th sample of the fourth buffer 204(4), and provide the frame to feature detector 102. Further, the synthesis module 100 may repeat the frame generation step in real-time as the audio input is received by the device 101.
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the synthesis module 100 may provide means for generating a frame by sampling audio input in increments equal to a buffer size of a host device until a threshold corresponding to a frame size used to train a machine learning model is reached.
At block 1304, the method 1300 may include extracting, from the frame, amplitude information, pitch information, and pitch status information. For example, the feature detector 102 may be configured to detect the feature information 112. In some aspects, the pitch detector 302 of the feature detector 102 may be configured to determine pitch status information 306 (is_pitched) and pitch information 308 (midi_pitch), and the amplitude detector 304 of the feature detector 102 may be configured to determine amplitude information 310 (amp_ratio). Further, the downsampler 402 may be configured to downsample the feature information 112 before the feature information 112 is provided to the ML model 104. In some aspects, the feature information maybe downsampled to align with a specified interval of the ML model 104 for predicting the control information. As an example, if the sample rate of the device 101 is equal to 48000 Hz and ML model is trained with 250 frames per second, the downsampler 402 may provide every 192nd sample to the next subsystem as 48,000 divided 250 equals 192.
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the feature detector 102, the pitch detector 302, the amplitude detector 304, and/or the downsampler 402 may provide means for extracting, from the frame, amplitude information, pitch information, and pitch status information.
At block 1306, the method 1300 may include determining, by the machine learning model, control information for audio reproduction based on the amplitude information, the pitch information, and the pitch status information, the control information including pitch control information and noise magnitude control information. For example, the ML model 104 may receive the feature information 112(1) from the downsampler 402, and generate corresponding control information 114(1) based on the amplitude information, the pitch information, and the pitch status information detected by the feature detector 102. In some aspects, the control information 114(1) may include the pitch status information 306, the fundamental frequency in Hz 436, the harmonic distribution 430, the harmonic amplitude 432, and noise magnitude information 434. Further, the control information 114(1) provide independent control over pitch and loudness during synthesis.
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the ML model 104 may provide means for determining, by the machine learning model, control information for audio reproduction based on the amplitude information, the pitch information, and the pitch status information, the control information including pitch control information and noise magnitude control information.
At block 1308, the method 1300 may include generating filtered noise information by inverting the noise magnitude control information using an overlap and add technique. For example, the noise synthesizer 802 may receive the noise magnitude information 434 and generate the noise audio component 812 of the audio output based on the noise magnitude information 434. In addition, the noise synthesizer 802 may employ an overlap and add technique to generate the noise audio component 812 (i.e., the filtered noise information) at a size equal to the buffer size of device 101. In some aspects, the noise synthesizer 802 may perform the overlap and add technique via a circular buffer.
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the synthesis processor 106 and/or the noise synthesizer 802 may provide means for generating filtered noise information by inverting the noise magnitude control information using an overlap and add technique.
At block 1310, the method 1300 may include generating, based on the pitch control information, additive harmonic information by combining a plurality of scaled wavetables. For example, the pitch smoother 804 may be configured to receive the pitch status information 306 and the fundamental frequency in Hz 436, and generate a smooth foundation frequency in Hz 814. Further, the pitch smoother 804 may provide the smooth fundamental frequency in Hz 814 to the wavetable synthesizer 806. Upon receipt of the smooth foundation frequency in Hz 436, harmonic distribution 430, and harmonic amplification 432, the wavetable synthesizer 806 may be configured to convert, via a fast Fourier transformation (FFT), the harmonic distribution 430 into a first dynamic wavetable, scale the first wavetable to generate a first scaled wavetable based on multiplying first wavetable by the harmonic amplitude 432, and linearly crossfade the first scaled wavetable with a second scaled wavetable associated with the frame preceding the current frame and a third scaled wavetable associated with the frame succeeding the current frame to generate the harmonic audio component 816 (i.e., the additive harmonic information).
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the synthesis processor 106 and/or the wavetable synthesizer 806 may provide means for generating, based on the pitch control information, additive harmonic information by combining a plurality of scaled wavetables.
At block 1312, the method 1300 may include rendering the sound output based on the filtered noise information and the additive harmonic information. For example, the latency compensation module 810 may shift the noise audio component 812 and/or the harmonic audio component 816 so that the noise audio component 812 and/or the harmonic audio component 816 are properly aligned, and combine the noise audio component 812 and the harmonic audio component 816 to form the audio output 110. Once the audio output 110 is rendered, the audio output 110 may be reproduced via a speaker. As described herein, in some examples, the latency compensation module 810 may combine the noise audio component 812 and the harmonic audio component 816 to form an audio output 110 associated with an instrument differing from the instrument that produced the audio input 108. In some other examples, the latency compensation module 810 may combine the noise audio component 812 and the harmonic audio component 816 to form an audio output 110 of one or more notes of an instrument based on one or more samples of other notes of the instrument captured within the audio input 108.
In some examples, the latency compensation module 810 may receive the noise audio component 812 and/or the harmonic audio component 816 from the noise synthesizer 802 and the wavetable synthesizer 806 via the mix control 808. Further, in some aspects, the mix control 808 may modify the volume of the noise audio component 812 and/or the harmonic audio component 816 in a real-time safe manner based on user input.
Accordingly, the device 101, the computing device 1400, and/or the processor 1401 executing the synthesis processor 106 and/or the latency compensation module 810 may provide means for rendering the sound output based on the filtered noise information and the additive harmonic information.
While the operations are described as being implemented by one or more computing devices, in other examples various systems of computing devices may be employed. For instance, a system of multiple devices may be used to perform any of the operations noted above in conjunction with each other.
As depicted, the system/device 1400 includes a processor 1401 which is capable of performing various processes according to a program stored in a read only memory (ROM) 1402 or a program loaded from a storage unit 1408 to a random-access memory (RAM) 1403. In the RAM 1403, data required when the processor 1401 performs the various processes or the like is also stored as required. The processor 1401, the ROM 1402 and the RAM 1403 are connected to one another via a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.
The processor 1401 may be of any type suitable to the local technical network and may include one or more of the following: general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), graphic processing unit (GPU), co-processors, and processors based on multicore processor architecture, as non-limiting examples. The system/device 1400 may have multiple processors, such as an application-specific integrated circuit chip that is slaved in time to a clock which synchronizes the main processor.
A plurality of components in the system/device 1400 are connected to the I/O interface 1405, including an input unit 1406, such as a keyboard, a mouse, microphone (e.g., an audio capture device for capturing the audio input 108) or the like; an output unit 1407 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a speaker or the like (e.g., a speaker for reproducing the audio output 110); the storage unit 1408, such as disk and optical disk, and the like; and a communication unit 1409, such as a network card, a modem, a wireless transceiver, or the like. The communication unit 1409 allows the system/device 1400 to exchange information/data with other devices via a communication network, such as the Internet, various telecommunication networks, and/or the like.
The methods and processes described above, such as the method 1300, can also be performed by the processor 1401. In some embodiments, the method 1300 can be implemented as a computer software program or a computer program product tangibly included in the computer readable medium, e.g., storage unit 1408. In some embodiments, the computer program can be partially or fully loaded and/or embodied to the system/device 1400 via ROM 1402 and/or communication unit 1409. The computer program includes computer executable instructions that are executed by the associated processor 1401. When the computer program is loaded to RAM 1403 and executed by the processor 1401, one or more acts of the method 1300 described above can be implemented. Alternatively, processor 1401 can be configured via any other suitable manners (e.g., by means of firmware) to execute the method 1300 in other embodiments.
In closing, although the various embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessary limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter.
Number | Name | Date | Kind |
---|---|---|---|
6963833 | Singhal | Nov 2005 | B1 |
20010023396 | Gersho | Sep 2001 | A1 |
20150142456 | Lowe | May 2015 | A1 |
20220013132 | Engel et al. | Jan 2022 | A1 |
Number | Date | Country |
---|---|---|
WO-2019139430 | Jul 2019 | WO |
Entry |
---|
Engel et al., “DDSP: Differentiable Digital Signal Processing,” International Conference on Learning Representations 2020, Jan. 14, 2020, 19 pages. |
International Search Report in PCT/SG2023/050315, mailed Oct. 24, 2023, 3 pages. |
Shan et al., “Differentiable Wavetable Synthesis,” ICASSP 2022, Feb. 13, 2022, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20230377591 A1 | Nov 2023 | US |