This application claims the priority benefit of Taiwan application serial no. 113101271, filed on Jan. 11, 2024. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a sound signal processing technology, and particularly relates to a processing method and a processing apparatus of a sound signal.
Most hearing-impaired people cannot clearly receive certain high-frequency sound signals, but they can clearly hear low-frequency sound signals. If only an equalizer is used to amplify the high-frequency speech signals in order to improve the aforementioned problems, many unpleasant sounds may occur (for example, roaring, noise amplification, excessive ear pressure, etc.).
Research points out that if high-frequency breath sounds in speech (for example, “s” tudy) are transferred to specific low-frequency signals (for example, “z” tudy), listeners can distinguish their semantic meaning based on experience and learning. The high-frequency breath sounds in normal speech are about 6 kilohertz (kHz). Depending on the condition of ear damage of the hearing-impaired person, the high-frequency breath sounds can be down-converted by one-half or one-quarter to generate a sound signal corresponding to 3 KHz or 1.5 KHz.
The disclosure provides a processing method and a processing apparatus of a sound signal, which can provide a frequency shift effect of the sound signal.
A processing method of a sound signal according to an embodiment of the disclosure includes (but is not limited to) the following steps. A plurality of mel-frequency cepstrum coefficients (MFCCs) are extracted from a sound signal to be processed, which includes the following. A power corresponding to a plurality of mel-frequencies of the sound signal to be processed is obtained through a plurality of band-pass filters, in which each band-pass filter corresponds to a mel frequency, and the mel-frequencies corresponding to the band-pass filters are different. A first frequency among the mel-frequencies is mapped to a second frequency among the mel-frequencies, and the power corresponding to the second frequency is replaced with the power corresponding to the first frequency, in which the second frequency is lower than the first frequency. The power corresponding to the mel-frequencies is used to generate the mel-frequency cepstrum coefficients. A synthesized sound signal is generated using the mel-frequency cepstrum coefficients of the sound signal to be processed, in which the sound signal to be processed and the synthesized sound signal are configured to be played by a speaker.
A processing apparatus of a sound signal according to an embodiment of the disclosure includes a storage and a processor. The storage is configured to store a program code. The processor is coupled to the storage. The processor is configured to load the program code to: extract a plurality of mel-frequency cepstrum coefficients from a sound signal to be processed, and generate a synthesized sound signal using the mel-frequency cepstrum coefficients of the sound signal to be processed, in which the sound signal to be processed and the synthesized sound signal are configured to be played by a speaker. The processor is configured to: obtain a power corresponding to a plurality of mel-frequencies of the sound signal to be processed through a plurality of band-pass filters, in which each band-pass filter corresponds to a mel frequency, and the mel-frequencies corresponding to the band-pass filters are different; map a first frequency among the mel-frequencies to a second frequency among the mel-frequencies, and replace the power corresponding to the second frequency with the power corresponding to the first frequency, in which the second frequency is lower than the first frequency; and use the power corresponding to the mel-frequencies to generate the mel-frequency cepstrum coefficients.
Based on the above, the processing method and the processing apparatus of the sound signal according to the embodiment of the disclosure can replace the power of the lower mel frequency band with the power of the higher mel frequency, and thereby generate the frequency-shifted mel-frequency cepstrum coefficients and the corresponding synthetic sound signal. The synthesized sound signal can be used on the hearing impaired for listening and can retain the sound features.
In order to make the above-mentioned features and advantages of the disclosure clearer and easier to understand, the following embodiments are given and described in details with accompanying drawings as follows.
The speaker 110 can convert electronic sound signals into sound transducers and electronic components. One or more speakers 110 may also form an audio system group. In an embodiment, the speaker 110 is configured to play sound signals.
The storage 120 can be any form of fixed or movable random access memory (RAM), a read only memory (ROM), a flash memory, a hard disk drive (HDD), solid-state drive (SSD), or a similar element. In an embodiment, the storage 120 is configured to store program codes, software modules, configurations, data (e.g., sound signals, coefficients, algorithm parameters, etc.), or files, and the embodiment thereof will be described in details later.
The processor 130 is coupled to the speaker 110 and the storage 120. The processor 130 may be a central processing unit (CPU), a graphic processing unit (GPU), or other programmable general-purpose or special-purpose microprocessors, a digital signal processor (DSP), a programmable controller, a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a neural network accelerator, or other similar elements, or a combination thereof. In an embodiment, the processor 130 is configured to execute all or part of the operations of the processing apparatus 100 and can load and execute each program code, software module, file, and data stored in the storage 120. In some embodiments, the functions of the processor 130 may be implemented through software or a chip.
In the following, the method described in the embodiment of the disclosure will be described with reference to each component and module in the processing apparatus 100. Each process of the method can be adjusted according to the implementation situation and is not limited thereto.
In an embodiment, the sound signal to be processed may belong to or only include a signal in a specific frequency band. For example, the frequency band corresponding to breath sounds in speech is 4 kilohertz (kHz) to 8 kHz. However, the frequency band corresponding to the sound signal to be processed may still be adjusted according to the actual application scenario.
In an embodiment, the sound signal to be processed is a signal generated by noise suppression, gain amplification, filtering, sound source acquisition, or other sound signal processing.
In an embodiment, the sound signal to be processed is configured to be played by the speaker 110. That is, the speaker 110 can play the sound signal to be processed.
On the other hand, the mel-frequency cepstrum coefficients are coefficients constituting the mel-frequency cepstrum. The mel-frequency cepstrum is formed by the cepstrum of the audio segment, and the frequency bands thereof are divided according to the mel scale (each frequency band corresponds to a mel frequency). An approximate mathematical conversion can be made between the mel scale and the linear frequency scale hertz (Hz). For example, the formula to convert x Hertz to y mel (frequency) is:
One of the characteristics of mel-frequency cepstrum coefficients is the sound features that are close to the hearing features of the human ear. For example, the mel scale represents the human ear's perception of equidistant changes in pitch. Therefore, in some application scenarios, the mel-frequency cepstrum coefficients can be applied to the speech recognition function, but is not limited thereto.
Specifically, in pre-emphasis (step S311), the processor 130 passes the sound signal to be processed SB1 through a high-pass filter (SB1(n)−a·SB1(n), where a is a coefficient from 0.9 to 1) to eliminate the effects caused by the vocal cords and lips during phonation, and compensate for the high-frequency part of the speech signal that is suppressed by the pronunciation system.
In frame blocking (step S312), a set of i sampling points is defined as a frame, and the processor 130 can divide a frame for each i sampling point of the sound signal to be processed SB1, and then obtain a plurality of frames (or sound frames).
In windowing (step S313), the processor 130 multiplies each frame of the sound signal to be processed SB1 by a hamming window to increase the continuity of the left end and right end of the frame.
Next, the processor 130 extracts the mel-frequency cepstrum coefficients (step S320) of the preprocessed (step S310) sound signal to be processed SB1.
Specifically, in fast fourier transform (FFT) (step S321), the processor 130 converts each frame of the sound signal to be processed SB1 from the time domain into the frequency domain to obtain the energy distribution of each frame on the frequency spectrum (e.g., power or energy on the frequency spectrum). In another embodiment, a discrete fourier transform or other time domain to frequency domain transform may be used.
In the filtering processing (step S322), the processor 130 filters the spectral energy obtained in step S321 through a plurality of (triangular or cosine window) band-pass filters to obtain the mel scale energy spectrum. Each band-pass filter corresponds to a mel frequency in units of the mel scale, and the mel-frequencies corresponding to the band-pass filters are different. Therefore, the mel scale energy spectrum represents the power/energy corresponding to a plurality of mel-frequencies.
For example,
The formula for energy spectrum transform is:
where W[k] is the weight of the band-pass filter BPF. Yt[m] is the power of the bank number m corresponding to the mel-frequency on the mel scale energy spectrum. The power Y[m] is one of Yt[1] to Yt[M].
In the logarithm operation (step S323), the processor 130 takes the logarithm of the power at each mel-frequency of the mel scale energy spectrum to obtain the log energy or power of the logarithm.
In the discrete cosine operation (step S324), the processor 130 brings the log energy corresponding to the plurality of mel-frequencies into a discrete cosine transform (DCT) to obtain an L-order mel scale cepstrum coefficient, and L is, for example, 12 (but is not limited thereto).
In dynamic feature extraction (step S325), the processor 130 superimposes the frame energy obtained by the pre-processing in step S310 with the cepstrum coefficient obtained by the discrete cosine operation to obtain the mel-frequency cepstrum coefficient YMelFS.
Referring to
The processor 130 maps a first frequency among the plurality of mel-frequencies to a second frequency among the mel-frequencies and replaces the power corresponding to the second frequency with the power corresponding to the first frequency (step S212). Specifically, the second frequency is lower than the first frequency. That is, the first frequency is mapped to a lower second frequency. The power of the second frequency will change directly to the power of the first frequency. Similarly, other frequencies among the mel-frequencies (for example, a third frequency different from the first frequency) can also be mapped to a lower frequency (for example, a fourth frequency different from the second frequency), and accordingly, the power of a lower frequency may directly be replaced by the power of a higher frequency. The above “frequency mapping” and “power replacement” will be collectively referred to as “frequency shift” in the context.
In an embodiment, the processor 130 may displace the first frequency to the second frequency according to the displacement amount. The unit of the displacement amount corresponds to the mel scale. Taking
where YFS[j] is the frequency-shifted power of the mel-frequency with the bank number j, Yt[j] is the power of the bank number j corresponding to the mel-frequency (as shown in formula (2)), ΔFS is the displacement amount, f is the frequency (in units of Hertz) transformed by formula (1) from the mel-frequency with the bank number j, and F1 and F2 are the lower limit and upper limit of the target frequency band respectively. The target frequency band refers to the frequency band to which frequency shift is desired, for example, 2 kHz (corresponding to F1) to 4 kHz (corresponding to F2) corresponding to high-frequency breath sounds in speech, but is not limited thereto. That is to say, frequency shift is performed on the power corresponding to the bank number of the target frequency band. For frequencies less than or equal to the target frequency band, the power corresponding to the bank number remains unchanged (i.e., not frequency-shifted), and for frequencies more than or equal to the target frequency band, the power corresponding to the bank number remains unchanged (i.e., not frequency-shifted) or is reset to zero (i.e., filtered out).
For example, the displacement ΔFS is 3. Assuming that the first frequency is located in the target frequency band, and the bank number corresponding to the first frequency is M, then the bank number corresponding to the second frequency is M−3. Therefore, YFS[M−3]=Yt[(M−3)+3]=Yt[M].
Similarly, the processor 130 may displace the third frequency among the plurality of mel-frequencies to the fourth frequency among the mel-frequencies according to the (same) displacement amount, and the fourth frequency is lower than the third frequency. For example, assuming that the bank number corresponding to the third frequency is M−1, then the bank number corresponding to the fourth frequency is M−4. Therefore, YFS[M−4]=Yt[(M−4)+3]=Yt[M−1].
In another embodiment, the displacement amounts of the plurality of mel-frequencies may be different. For example, the displacement amount of the first frequency is 3, and the displacement amount of the third frequency is 4, but is not limited thereto.
Referring to
Next, the processor 130 generates a synthesized sound signal using the plurality of mel-frequency cepstrum coefficients of the sound signal to be processed (step S220). Specifically, mel-frequency cepstrum coefficients are a type of sound feature. The processor 130 can synthesize the sound signal according to the sound feature.
In an embodiment, the processor 130 may output the synthesized sound signal by inputting the plurality of mel-frequency cepstrum coefficients of the sound signal to be processed to a mel generative adversarial network (MelGAN).
By inputting a basic sound signal z and a mel-frequency cepstrum coefficient s to the generator GS, an estimated sound signal G(s,z) (e.g., a frequency-shifted sound signal) can be output/generated. The basic sound signal z can be a white noise signal, a brown noise signal, a pink noise signal, or other sound signals. The mel-frequency cepstrum coefficient s is the mel-frequency cepstrum coefficient of a down-converted sound signal x. The down-converted sound signal x is to perform down conversion processing on the training sound signal. For example, in patent TW 1557729, the frequency is reduced to one-quarter or one-half of the sampled speech signal (for example, training sound signal). Alternatively, the frequency of the training sound signal can be reduced through other down conversion algorithms. The training sound signal is a speech signal, for example, a speech signal generated by recording, receiving, or software editing.
On the other hand, in the training of the mel generative adversarial network, by inputting the down-converted sound signal x and the estimated sound signal G(s,z) to the discriminator DS, the discriminator DS can use the down-converted sound signal x to determine the authenticity of the estimated sound signal G(s,z) generated by the generator GS. That is, it is determined whether the estimated sound signal G(s,z) is the down-converted sound signal x. The generator GS and the discriminator DS constantly compete with each other and update the parameters of the generator GS accordingly. The trained generator GS can be configured to input a mel-frequency cepstrum coefficient y and output a frequency-shifted estimated sound signal G(y,z) accordingly. For example, by inputting the plurality of mel-frequency cepstrum coefficients of the sound signal to be processed to the trained generator GS, a synthesized sound signal (i.e., a frequency-shifted sound signal) can be output.
In other embodiments, the synthesized sound signal can be generated by other neural networks, and the neural networks are trained to know the correlation between the mel-frequency cepstrum coefficients and the synthesized sound signal.
In an embodiment, the synthesized sound signal is played by speaker 110. That is, the speaker 110 can play the synthesized sound signal. Alternatively, the processing apparatus 100 can transmit the synthesized sound signal to other devices and play the synthesized sound signal through other devices.
The processor 130 performs frequency-shifting processing on the sound signal to be processed SB1, and generates the mel-frequency cepstrum coefficients YMelFS corresponding to the frequency-shifted sound signal to be processed SB1 (step S620). For detailed description of step S620, reference may be made to the foregoing description of step S210, which are not repeated here.
The processor 130 generates the synthesized sound signal SMelFS based on the mel-frequency cepstrum coefficients YMelFS (step S630). For detailed description of step S630, reference may be made to the foregoing description of step S220, which are not repeated here.
The processor 130 combines the synthesized sound signal SMelFS and the bypass sound signal SB2 into an output sound signal Sout (step S640). For example, two sound signals are superimposed in the frequency domain or time domain. The output sound signal Sout is configured to be played by the speaker. That is, the speaker 110 can play the output sound signal Sout. Alternatively, the processing apparatus 100 can transmit the output sound signal Sout to other devices and play the output sound signal Sout through other devices.
In an application scenario, the sound signal to be processed played in a video conference, call, or dialogue can be transformed into the output sound signal Sout, so that the hearing-impaired person can distinguish the complete meaning, and can still be able to understand the meaning despite hearing the breath sounds in the speech.
To sum up, in the processing method and the processing apparatus of the sound signal according to the embodiment of the disclosure, in the process of extracting the mel-frequency cepstrum coefficients, the power of the mel-frequency spectrum is frequency-shifted, and the synthetic sound signal can be generated according to the frequency-shifted mel-frequency cepstrum coefficients. Other frequency shift algorithms may stretch the sound signal during the process, but in the end only the length before stretching will be retained and some features will be ignored. However, the embodiment of the disclosure can retain the complete sound feature and can also achieve the effect of frequency shift.
Although the disclosure has been described with reference to the embodiments above, the embodiments are not intended to limit the disclosure. Any person skilled in the art can make some changes and modifications without departing from the spirit and scope of the disclosure. Therefore, the scope of the disclosure will be defined in the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 113101271 | Jan 2024 | TW | national |