The present invention relates to a generation system of synthesized sound in music instruments, in particular a church organ. A parameterization of a physical model is used to generate a synthesized sound. The invention relates to a parameterization system of a physical model used to generate a sound.
A physical model is a mathematical representation of a natural process or phenomenon. In the present invention, the modeling is applied to an organ pipe, thus obtaining a faithful physical representation of a music instrument. Such a methodology permits to obtain a music instrument capable of reproducing not only the sound, but also the associated sound generation process.
U.S. Pat. No. 7,442,869, in the name of the same Applicant, discloses a reference physical model for a church organ.
However, it must be considered that a physical model is not strictly connected to the generation of sounds and to the use in music instruments, but it can also be a mathematical representation of any system from the real world.
The parameterization methods of physical models according to the prior art are mostly heuristic and the sound quality largely depends on the music taste and experience of the Sound Designer. In view of the above, the character and the composition of the sounds are typical of the Sound Designer. Moreover, considering that parameterization occurs in human time, on the average the sounds have long realization periods.
Several methods for the parameterization of physical models are known in literature, such as in the following documents:
However, these documents disclose algorithms that refer to given physical models or to some parameters of the physical models.
Publications on the use of neural networks are known, such as: Leonardo Gabrielli, Stefano Tomassetti, Carlo Zinato, and Stefano Squartini. Introducing deep machine learning for parameter estimation in physical modeling. In Digital Audio Effects (DAFX), 2017. Such a document discloses an end-to-end approach (using Convolutional Neural Networks) that embeds an extraction of acoustic features learned from the neural network in the layers of the neural network. However, such a system is impaired by the fact that it is not suitable for being used in a music instrument.
The purpose of the present invention is to eliminate the drawbacks of the prior art, by disclosing a generation system of synthesized sound in music instruments that can be extended to multiple physical models and is independent from the intrinsic structure of the physical model used in its validation.
Another purpose is to disclose such a system that allows for developing and using objective acoustic metrics and iterative optimization heuristic processes, capable of exactly parameterizing the selected physical model according to a reference sound.
These purposes are achieved according to the invention with the characteristics of the independent claim 1.
Advantageous embodiments of the invention appear from the dependent claims.
The generation system of synthesized sound in music instruments according to the invention is defined in claim 1.
Additional features of the invention will appear manifest from the detailed description below, which refers to a merely illustrative, not limiting embodiment, as illustrated in the appended figures, wherein:
With reference to the Figures, the generation system of synthesized sound in music instruments according to the invention is described, which is generally indicated with reference numeral (100).
The system (100) allows for estimating the parameters that control a physical model of music instrument. Specifically, the system (100) is applied to a model of church organ, but can be generally used for multiple types of physical models.
With reference to
With reference to
With reference to
The raw audio signal (SIN) is analyzed by the system (100) inside the computing device (103). The system (100) extracts the final parameters (Pi) for the reconstruction of the synthesized signal (SOUT). Said final parameters (Pi) are stored in a storage (104) that is controlled by a user control (105). The final parameters (Pi) are transmitted to a sound generator (106) that is controlled by a musical keyboard (107) of the organ. According to the received parameters, the sound generator (106) generates the synthesized audio signal (SOUT) sent to a loudspeaker (108) that emits the sound.
The sound generator (106) is an electronic device capable of reproducing a sound that is very similar to the one detected by the microphone (101) according to the parameters obtained from the system (100). A sound generator is disclosed in U.S. Pat. No. 7,442,869.
First Stage (1)
The first stage (1) comprises extraction means (10) that extract some features (F) from the raw signal (SIN) and a set of neural networks (11) that evaluate parameters obtained from said features (F).
The features (F) have been selected based on the organ sound and creating a set of features that is not ordinary and differentiated, it being composed of multiple coefficients relative to different aspects of the raw signal (SIN) to be parameterized.
With reference to
The coefficients are extracted (F4) are extracted through analysis of the envelope of the raw audio signal (SIN), i.e. using an envelope detector according to the techniques of the prior art.
With reference to
Five coefficients are extracted for every part of signal that is analyzed, such as:
Moreover, aleatory and/or non-periodic components (F5) are extracted from the signal. The aleatory and/or non-periodic components (F5) are six coefficients that provide indicative information on the noise. The extraction of these components can also be done through a set of comb and notch filtering to remove the harmonic part of the raw signal (Si). The extracted useful information can be: the RMS vale of the aleatory component, its duty cycle (defined as noise duty cycle), the zero crossing rate, the zero crossing standard deviation and the envelope coefficients (attacks and sustain).
The time waveform relative to the aleatory part is shown in 201. The Ton and Toff analysis wherein the noise manifests its granularity characteristics is performed through two guard thresholds (203, 204) based on the techniques of the prior art. Such an analysis makes it possible to observe a square waveform with variable Duty-Cycle shown in 202. It must be noted that the square wave (202) does not correspond to a real waveform that is present in the sound, but it is a conceptual representation for the analysis of the intermittence and granularity features of the noise, which will be performed using the Duty-Cycle feature of said square wave.
The chart of
Since the noise of the organ is amplitude modulated, there will be a phase within a period wherein the noise is practically null, which is defined as Toff (205), as shown in
The four coefficients that characterized the noise are:
After extracting the features (F) from the raw signal (SIN), the parameters of said features are evaluated by a set of neural networks (11) that operate in parallel on the same sound to be parameterized, estimating parameters that are slightly different for each neural network because of small differences of each network.
Every neural network takes input features (F) and provides a complete set of parameters (P*1, . . . P*M) that are suitable for being sent to a physical model to generate a sound.
The neural networks can be of all the types included in the prior art that accept pre-processed input features (Multi-Layer Perceptron, Recurrent Neural Networks, etc.).
The number of neural networks (11) can change, generating multiple evaluations of the same features made by different networks. The evaluations will differ in acoustic accuracy and this will require the use of the second stage (2) to select the best physical model. The evaluations are all made on the entire set of features, the acoustic accuracy is evaluated by the second stage (2) that selects the set of parameters that are evaluated by the best performing neural networks.
Although the following description specifically refers to a type of Multi-Layer Perceptron (MLP) network, the invention is also extended to different types of neural network. In an MLP network, every layer is composed of neurons.
With reference to
wherein:
x1; x2; xm are the inputs, which in the case of the first stage are the features (F) extracted from the raw signal (SIN)
wk1; wk2; wkm are the weights of each input
uk is the linear combination of the inputs with the weights
bk is the bias
yk is the output of the neuron.
The use of MLP is given by the characteristics of training simplicity and by the speed that can be reached during the test. These characteristics are necessary given the use in parallel of a rather large number of neural networks. Another fundamental characteristic is the possibility to make handcrafting of the features, i.e. the audio characteristics that permit to use the knowledge of the sounds to be evaluated.
It must be considered that with an MLP neural network the extraction of the features (F) is made ad-hoc with DSP algorithms, achieving a better performance compared to an end-to-end neural network.
The MLP network is trained by using an error minimization algorithm according to the prior art of the error backpropagation. In view of the above, the coefficients of each neuron (weights) are iteratively modified until the optimum condition is found, which permits to obtain the lowest error with the dataset used during the training step.
The used error is the Mean Squared Error that is calculated on the coefficients of the physical model normalized in the range [−1; 1]. The network parameters (number of layers, number of neurons per layer) were explored with a random search in the ranges given in table 1.
The training of the neural network is made according to the following steps:
Forward Propagation
1. Forward propagation and output generation yk
2. Cost function calculation E=½ Σ∥y−y′∥2
3. Error backpropagation to generate the delta to be applied in order to update the weights for each training epoch
Weight Update
1. The error gradient is calculated relative to the weights
2. The weights are updated as follows:
where is the learning rate
A dataset of audio examples must be provided for learning. Each audio example is associated with a set of parameters of the physical model that are necessary to generate the audio example. Therefore, the neural network (11) learns how to associate the features of the sounds with the parameters that are necessary to generate them.
These sound-parameter pairs are obtained, generating sounds through the physical model, providing input parameters and obtaining the sounds associated with them.
Second Stage (2)
The second stage (2) comprises construction means of the physical model (11) that use the parameters (P*1, . . . P*M) evaluated by the neural networks to build physical models (M1, . . . MM). Otherwise said, the number of physical models that are built is equal to the number of neural networks used.
Each physical model (M1, . . . MM) emits a sound (S1, . . . SM) that is compared with a target sound (ST) by means of metric evaluation means (21). An acoustic distance (d1, . . . dM) between the two sounds is obtained at the output of each metric evaluation means (21). All acoustic distances (d1, . . . dM) are compared by means of the selection means (22) that select an index (i) relative to the lowest distance in order to select the parameters (P*i) of the physical model (Mi) with the lowest acoustic distance from the target sound (ST). The selection means (21) comprise an algorithm based on an iteration that individually examines the acoustic distances (d1, . . . dM) generated by the metric evaluation means, in such a way to find the index (i) of the lowest distance in order to select the parameters of said index.
The metric evaluation means (21) are a device used to measure the distance between two tones. The lower the distance is, the more similar the two sounds will be. The metric evaluation means (21) use two harmonic metrics and one metric for the analysis of the temporal envelopes, but this criterion can be extended to all types of usable metrics.
The acoustic metrics permit to objectively evaluate the similarity of two spectra. Variants of the Harmonic Mean Squared Error (HMSE) concept are used. It is the MSE calculated on the peaks of the FFT of the sound (S1, . . . SM) generated by the physical model compared with the target sound (ST), in such a way to evaluate the distance (d1, . . . dM) between homologous harmonics (the first harmonic of the target sound is compared with the first harmonic of the sound generated by the physical model, etc.).
Two comparison methods are possible.
In the first comparison method, the distances between two homologous harmonics are all weighed in the same way.
In the second comparison method, a higher weight is given to the harmonic differences, whose correspondents in the target signal had a higher amplitude. A basic psychoacoustics element is used, according to which the harmonics of the spectrum with higher amplitude are perceived as more important. Consequently the difference between homologous harmonics with the amplitude of the same harmonic in the target sound is multiplied. In this way, if the amplitude of the i-th harmonic in the target sound is extremely low, the importance of the evaluation error of the harmonic in the evaluated signal is reduced. Therefore, in this second comparison method, the importance of the error made on the harmonics, which had a low psychoacoustic importance already in the raw signal (SIN) because of reduced intensity, is limited.
Other spectral metrics of the prior art, such as RSD and LSD, are described mathematically below.
In order to evaluate the temporal features, a metrics based on the envelope of the waveform of the raw input signal (SIN) is calculated. The difference in square module of the evaluated signal relative to a target is used.
The following metrics are used:
wherein
subscript L is the number of harmonics taken into consideration, whereas superscript W identifies the HMSE Weighed variant
wherein
Ts is the end of the attack transitory,
H is the Hilbert transform of the signal, which is used to extract the envelope, whereas
s is the signal over time and
S is the module of the signal DFT over time.
For the harmonic distance metrics, H (relative to the entire spectrum), H10 and H10W (relative to the first ten harmonics) were used.
For the envelope metrics, ED, E1 and E2 were used, where the number refers to the harmonic whereon the envelope difference is calculated. The sum of the weighed metrics is composed by a weighed sum of the individual metrics, with weights established by the human operator that actuates the process.
The second stage (2) can be implemented by means of an algorithm that comprises the following steps:
1. Selection of first evaluated parameters (P*1) for the generation of a first physical model (M1) and calculation of a first distance (d1) between the sound (S1) of the first physical model and a target sound (ST).
2. Selection of second evaluated parameters (P*2) for the generation of a second physical model (M2) and calculation of a second distance (d2) between the sound (S2) of the second physical model and a target sound (ST).
3. The parameters of the second physical model are selected if the second distance (d2) is lower than the first distance (d1), otherwise the parameters of the second physical model are discarded;
4. The steps 4 and 3 are repeated until all evaluated parameters of all physical models generated by the first stage (1) are examined.
Third Stage (3) The third stage (3) comprises a memory (30) that stores the parameters (P*i) selected by the second stage (2) and physical model creation means (31) which are suitable for building a physical model (Mi) according to the parameters (P*i) selected by the second stage (2) and coming from the memory (30). From the physical model (Mi) of the third stage a sound is emitted (Si), which is compared with a target sound (ST) by means of metric evaluation means (32) that are identical to the metric evaluation means (21) of the second stage (2). The metric evaluation means (32) of the third stage calculate the distance (di) between the sound (Si) of the physical model and the target sound (ST). Such a distance (di) is sent to selection means (33) suitable for finding a minimum distance between the input distances.
The third stage (3) also comprises perturbation means (34) suitable for modifying the parameters stored in the memory (30) in such a way to generated perturbed parameters (P′i) that are sent to said physical model creation means (31) that create physical models with the perturbed parameters. Therefore, the metric evaluation means (32) find the distances between the sounds generated by the physical models with the perturbed parameters and the target sound. The selection means (33) select the minimum distance between the received distances.
The third stage (3) provides for a step-by-step search that explores the parameters of the physical model randomly, perturbing the parameters of the physical model and generating the corresponding sounds.
A discreetly high number of perturbation passages is necessary, because not all parameters relative to a set will be perturbed at each iteration. The objective is to minimize the value of the metrics used, perturbing the parameters, discarding all parameters sets and keeping only the best parameter set.
The third stage (3) can be implemented by providing:
An algorithm can be implemented for the operation of the third stage (3). Such an algorithm works on a normalized range [−1; 1] of the parameters and comprises the following steps:
1. Generation of a sound (Si) relative to the parameters (P*i) of iteration 0 (i.e. the parameters from the second stage (2))
2. Calculation of a first distance of the sound (Si) from a target sound (ST)
3. Perturbation of the parameters (P*i) in such a way to obtain perturbed parameters (P′i)
4. Generation of a sound from the new set of perturbed parameters (P′i)
5. Calculation of a second distance of the sound generated by the perturbed parameters (P″) from the target sound.
6. In case of a distance reduction, i.e. the second distance is lower than the first distance, the previous parameter set is discarded, and otherwise is maintained.
7. Repeat the steps 3, 4 and 5 until the end of the process, which will terminate accordingly when one of the following events occurs:
The free parameters of the algorithm are as follows:
The calculation of the new parameters is made according to the equation:
θi=μdi·(θb∘[r|∘g])
where:
r is a random vector with values [0; 1] of the same dimension as b,
g is a random perturbation vector that follows a Gaussian distribution and has the same dimensions as b.
Number | Date | Country | Kind |
---|---|---|---|
102018000008080 | Aug 2018 | IT | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2019/069339 | 7/18/2019 | WO | 00 |