This application claims priority to Korean Patent Application No. 10-2022-0022902, filed on Feb. 22, 2022, with the Korean Intellectual Property Office (KIPO), the entire contents of which are hereby incorporated by reference.
The present disclosure relates to a technology for compressing and restoring audio signals, and more particularly, to a deep neural network-based audio signal compression method and apparatus for compressing and restoring audio signals using a multilayer structure and a training method thereof.
The content described in this part simply provides background information on the present embodiment and does not constitute any conventional technology.
Among conventional audio compression techniques, a lossy compression-based method is a technique for reducing the amount of information required to store a signal in a device or transmit the signal for communication by removing some signal components in an audio signal. In addition, the conventional lossy compression-based method determines signal components to be removed based on psychoacoustic knowledge such that distortion caused by loss of some signals is not perceived as possible by general listeners. As a representative example, major audio codecs currently used in various multimedia services, such as MP3 and AAC, analyze frequency components constituting signals using traditional signal conversion techniques such as modified discrete cosine transform (MDCT), and determine the number of bits to be allocated to each frequency component according to the importance of the degree of distortion determined, along the degree of actual human perception of each frequency component, when a lower number of bits is assigned to each frequency component for encoding on the basis of the Psychoacoustic Model designed through various listening experiments. As a result, current commercial codecs have achieved quality that is almost indistinguishable from the original signal even at compression rates of about 7 to 10 times.
In recent years, efforts are continuously being made to apply the deep neural network-based deep learning technology, which is rapidly developing beyond the traditional signal processing method, to the compression of audio signals. In particular, research is being attempted to find the optimal conversion and quantization method through various data after replacing all of the bit allocation-preceding signal conversion method, the input signal-dependent bit allocation method, and the converted signal time-axis waveform signal conversion method with a deep neural network module.
However, such end-to-end deep neural network structures have drawbacks of being still limited mainly to voice signals due to their structural limitations and showing limited performance for various general acoustic signals having wider frequency band components and diverse pattern.
The present disclosure has been an effort to solve the above problem, and it is an object of the present disclosure to provide a method for improving the sound quality after restoration compared to the same amount of data through improvement of encoding and decoding methods in constructing an end-to-end deep neural network structure for compressing an acoustic signal.
It is another object the present disclosure to provide a method capable of effectively performing compression even on a full-band signal having a wide bandwidth of 20 kHz or more.
It is still another object of the present disclosure to provide a training method for effective learning of a device capable of effectively performing compression even on a full-band signal having a wide bandwidth of 20 kHz or more.
According to a first exemplary embodiment of the present disclosure, a method being executed by a processor for compressing an audio signal in multiple layers may comprise: (a) restoring, in a highest layer, an input audio signal as a first signal; (b) restoring, in at least one intermediate layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in the highest layer or an immediately previous intermediate layer, from the input audio signal as a second signal; and (c) restoring, in a lowest layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in an intermediate layer immediately before the lowest layer, from the input audio signal as a third signal, wherein the first signal, the second signal, and the third signal are combined to output a final restoration audio signal, and the highest layer, the at least one intermediate layer, and the lowest layer each comprises an encoder, a quantizer, and a decoder.
The steps (a), (b), and (c) may each comprise: encoding, in the encoder, by downsampling the input signal; quantizing, in the quantizer, the encoded signal; and decoding, in the decoder, by up sampling the quantized signal.
The decoder, in the highest layer and the at least one intermediate layer, may have an upsampling ratio less than a downsampling ratio of the encoder.
The encoder and the decoder may be configured with a Convolutional Neural Network (CNN), and the quantizer may be configured with a vector quantizer trainable with a neural network.
The restored signal, in the at least one intermediate layer and the lowest layer, may have a sampling frequency for the corresponding layer that is greater than a sampling frequency of the restored signal in the previous layer.
The decoder of the at least one intermediate layer and the lowest layer may transmit an intermediate signal obtained inside a deep neural network structure of the decoder of a previous layer to the decoder of a subsequent layer.
The method may further comprise setting a number of bits to be allocated per layer.
According to a second exemplary embodiment of the present disclosure, an audio signal compression apparatus may comprise: a memory storing at least one instruction; and a processor executing the at least one instruction stored in the memory, wherein the at least one instruction enables the compression apparatus to perform (a) restoring, in a highest layer, an input audio signal as a first signal, (b) restoring, in at least one intermediate layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in the highest layer or an immediately previous intermediate layer, from the input audio signal as a second signal, and (c) restoring, in a lowest layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in an intermediate layer immediately before the lowest layer, from the input audio signal as a third signal, the first signal, the second signal, and the third signal being combined to output a final restoration audio signal and the highest layer, the at least one intermediate layer, and the lowest layer each comprising an encoder, a quantizer, and a decoder.
(a), (b), and (c) may each comprise: encoding, in the encoder, by downsampling the input signal; quantizing, in the quantizer, the encoded signal; and decoding, in the decoder, by up sampling the quantized signal.
The decoder, in the highest layer and the at least one intermediate layer, may have an upsampling ratio less than a downsampling ratio of the encoder.
The encoder and the decoder may be configured with a Convolutional Neural Network (CNN), and the quantizer may be configured with a vector quantizer trainable with a neural network.
The restored signal, in the at least one intermediate layer and the lowest layer, may have a sampling frequency for the corresponding layer that is greater than a sampling frequency of the restored signal in the previous layer.
The decoder of the at least one intermediate layer and the lowest layer may transmit an intermediate signal obtained inside a deep neural network structure of the decoder of a previous layer to the decoder of a subsequent layer.
The at least one instruction may enable the apparatus to further perform setting a number of bits to be allocated per layer.
According to a third exemplary embodiment of the present disclosure, a method being executed by a processor for training a neural network compressing an audio signal in multiple layers may comprise: (a) compressing and restoring, in each layer, an input signal; and (b) comparing and determining a signal restored in each layer and a guide signal of the corresponding layer, wherein the signal input, at step (a), to each of the layers remaining after excluding a highest layer is a signal obtained by removing an upsampled signal, which is obtained by upsampling a signal obtained by combining the signal restored in a previous layer and a guide signal of the previous layer at a predetermined ratio, from the input audio signal.
The multiple layers may each comprise a encoder, a quantizer, and a decoder.
The encoder and the decoder may be configured with a Convolutional Neural Network (CNN), and the quantizer may be configured with a vector quantizer trainable with a neural network.
The guide signal, at step (b), may comprise the guide signal of a lowest layer, which is the input audio signal, and the guide signals of the layers except the lowest layer, which are signals generated in the corresponding layers using a bandpass filter set to match the input audio signal to a frequency band of the corresponding layer.
Combining, at step (a), the signal restored in a previous layer and a guide signal of the previous layer at a predetermined ratio may comprise multiplying the restored signal of the preceding layer by α, multiplying the guide signal of the preceding layer by ‘1-α’, and combining the two signals.
α may be set to 0 in an initial stage of learning and gradually increased to 1.
According to the present disclosure, it is possible to compress a signal having a wide bandwidth more efficiently by adopting a structure combining multiple layers handling different frequency bands instead of using a single coding layer covering all frequency bands of broadband and full-band signals and by inducing learning to separate roles between layers in the design of a deep neural network-based end-to-end compression model.
In particular, the present disclosure makes it possible to expect higher restoration performance because the subsequent layer of the present disclosure compensates for errors occurring in the previous layer in addition to inheriting the existing subband coding method.
In particular, the structural improvement of the present disclosure, which has an independent relationship with the design method of the encoder, decoder, and quantizer constituting each layer, makes it possible to expect realization of a higher design utility after the improvement of each module with the development of the deep learning technology in the future.
Exemplary embodiments of the present disclosure are disclosed herein. However, specific structural and functional details disclosed herein are merely representative for purposes of describing exemplary embodiments of the present disclosure. Thus, exemplary embodiments of the present disclosure may be embodied in many alternate forms and should not be construed as limited to exemplary embodiments of the present disclosure set forth herein.
Accordingly, while the present disclosure is capable of various modifications and alternative forms, specific exemplary embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the present disclosure to the particular forms disclosed, but on the contrary, the present disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure. Like numbers refer to like elements throughout the description of the figures.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (i.e., “between” versus “directly between,” “adjacent” versus “directly adjacent,” etc.).
The terminology used herein is for the purpose of describing particular exemplary embodiments only and is not intended to be limiting of the present disclosure. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this present disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Hereinafter, exemplary embodiments of the present disclosure will be described in greater detail with reference to the accompanying drawings. In order to facilitate general understanding in describing the present disclosure, the same components in the drawings are denoted with the same reference signs, and repeated description thereof will be omitted.
With reference to
Here, because the individual coding layers 1000, 2000, and 3000 include encoders 1010, 2010, 3010, respective quantizers 1020, 2020, 3020, and respective decoders 1030, 2030, and 3030, the coding layers 1000, 2000, and 3000 may perform compression and decompression in different ways.
The encoders 1010, 2010, and 3010 may each generate a converted signal by performing conversion through a deep neural network on an input audio signal. The encoders 1010, 2010, and 3010 may be implemented as a Convolutional Neural Networks layer having a residual connection capable of converting the input audio signal to be suitable for vector quantization, which will be described later. The encoders 1010, 2010, and 3010 may downsample the input signal by a specific ratio.
The quantizers 1020, 2020, and 3020 may perform quantization by allocating the most appropriate code to the converted signal. The quantizers 1020, 2020, and 3020 may reduce the number of bits required to express an embedding vector by selecting a code vector closest to the converted signal using a vector quantized-variational autoencoder (VQ-VAE). The quantizers 1020, 2020, and 3020 may learn the VQ codebook using the VQ-related loss function in the VQ-VAE, and each layer may have its own VQ codebook.
The decoders 1030, 2030, and 3030 may pass the code transmitted from the quantizer through a deep neural network to finally restore a time-series audio signal. The decoders 1030, 2030, and 3030 may be implemented as a convolutional neural networks layer having a residual connection. The decoders 1030, 2030, and 3030 may perform upsampling of the signals received from the quantizers 1020, 2020, and 3020 by a specific ratio.
Here, in all the layers except the lowest layer 3000, i.e., the highest and intermediate layers 1000 and 2000, the upsampling ratios of the decoders 1030 and 2030 of the corresponding layers may be smaller than the downsampling ratios of the corresponding encoders 1010 and 2010. That is, signals restored in the corresponding layers 1000 and 2000 may have a lower sampling frequency than the input audio signal. Through this, the present disclosure can limit the bandwidth of the frequency that can be restored in each layer.
In addition, the encoders 1010, 2010, 3010, the quantizers 1020, 2020, 3020, and the decoders 1030, 2030, and 3030 are all configured with deep neural networks and trained through backpropagation of the calculated error between the recovered input signal and the input audio signal according to the existing deep learning training method.
In this way, it is possible, in the present disclosure, for one of the plurality of coding layers to handle only some bands rather than that one specific coding layer convert and compress all frequency bands of an input signal. It is also possible, in the present disclosure, to induce each layer to learn the characteristics of the unique signal of the frequency band corresponding thereto and resultant signal conversion and quantization methods in the network learning process.
In addition, in order to improve the restoration performance in the decoders 1030, 2030, and 3030, the intermediate signals obtained inside the deep neural network structure constituting the decoders 1030 and 2030 of the immediately previous layers may be provided to the decoders 2030 and 3030 of the corresponding layers. Here, the intermediate signals can be combined using various methods such as simply summing the intermediate signals of both sides or adding any deep neural network structure between them. However, the combining method should allow backpropagation of the loss function for network learning.
As long as the conditions for the ratio of upsampling and downsampling are satisfied in configuring each of the coding layers 1000, 2000, and 3000, the detailed internal structures of the encoder 1010, 2010, and 3010, the quantizers 1020, 2020, and 3020, and the decoders 1030, 2030, and 3030 are not bound by any special constraints, and the configuration of the entire system proposed in the present disclosure is not restricted due to the detailed structure of the internal module. Representative components constituting the encoders 1010, 2010, and 3010 and the decoders 1030, 2030, 3030 include convolutional neural network (CNN), multilayer perceptron (MLP), and various non-linear functions, and these components may be combined in any order to constitute each of the encoders 1010, 2010, and 3010, the quantizers 1020, 2020, and 3020, and decoders 1030, 2030, and 3030.
With reference to
In addition, the finally restored audio signal may be determined by combining the signal restored at the lowest layer 3000 and the signals restored at the previous layers 1000 and 2000, i.e., by summing the restoration results of all the layers 1000, 2000, and 3000.
For example, the signal restored at the highest layer 1000 may be referred to as the first signal, the signal restored at the intermediate layer 2000 as the second signal, and the signal restored at the lowest layer 3000 as the third signal. The highest layer 1000 may generate the first signal by receiving the input audio signal. The intermediate layer 2000 may generate the second signal by receiving the signal obtained by removing the signal obtained by upsampling the first signal from the input audio signal. The lowest layer 3000 may generate the third signal by receiving the signal obtained by removing the signal obtained by upsampling the second signal from the input audio signal. The first signal, the second signal, and the third signal may be combined to output the final restoration audio signal. Here, the first signal and the second signal may be upsampled at a predetermined ratio for synchronization.
Since there is a difference in sampling frequency between coding layers, it is necessary to perform upsampling or downsampling by an appropriate ratio for frequency matching. In the present disclosure, it is possible to connect layers in an order from a low sampling frequency to a high sampling frequency in order for the plurality of layers to perform compression more effectively. In this case, the sampling frequencies of the signals delivered to a subsequent layer may be synchronized through upsampling as shown in
However, in the case of configuring the network as described above, both the input and output of the intermediate coding layer 2000 depend on the results of the previous coding layer and, thus, it is likely that the parameters of the deep neural network constituting each layer are not properly learned, i.e., learning is not performed smoothly in the early stages of training. In addition, there is a need of criteria for more clearly dividing roles between layers because only the difference in downsampling or upsampling ratio is a factor providing differentiation between layers.
With reference to
The guide signal may be an input audio signal to which bandpass filters 1300 and 2300 are applied in accordance with the maximum frequency bandwidth that a signal restored at each of the layers 1000, 2000, and 3000 can have. As described above, the decoder 1030, 2030, and 3030 of each coding layer is configured to perform upsampling at a lower ratio than the encoder 1010, 2010, and 3010 such that the intermediate restoration signal has a low sampling frequency and a narrow bandwidth. Therefore, the guide signal needs to be given as much signal as the bandwidth that can be restored by each layer 1000, 2000, and 3000, which can be achieved by using band filter 1300 and 2300 that pass only a specific frequency band of the input audio signal by the bandwidth limit of each of the layers 1000, 2000, and 3000. For example, the guide signal of the lowest layer 3000 may be the input audio signal, and the guide signals of the intermediate layer 2000 and the highest layer 1000 may be the signals obtained by passing the input audio signal through the band filters 1300 and 2300 adjusted to the maximum frequency bandwidth of each layer 1000, 2000, and 3000. However, the band filters 1300 and 2300 should be designed such that the frequency band handled by any intermediate layer covers the frequency bands handled by all previous layers in consideration of the fact that the intermediate restoration signal of the previous layer is excluded from the compression process of the subsequent layer.
In addition, in order to compensate for the poor restoration performance of the decoders 1030, 2030 and 3030 at the beginning of training, the guide signal can be transmitted to the encoders 1010, 2010, and 3010 and the decoders 1030, 2030 and 3030 of the subsequent coding layers instead of the intermediate restoration signal obtained through the decoders 1030, 2030, and 3030 in an arbitrary intermediate layer 2000. This may be implemented by setting the value of ‘α’ in
After the independent training for each layer using the guide signal converges, the value of a may be gradually increased to induce training such that each layer reflects the restoration result of the preceding layer. In this process, the intermediate restoration signal and the guide signal are multiplied by a ratio of ‘α’ and ‘1-α’, respectively, and then summed, and the summed signal may be transmitted to a subsequent layer through an upsampling module.
In the final training stage, training may be induced such that all coding layers can compensate for the restoration errors of previous layers by fixing the value of ‘α’ to 1. Here, the criteria and time for changing the value of ‘α’ may be arbitrarily determined according to the judgment of the designer based on the degree of convergence of the loss function.
With reference to
With reference to
In addition to the MUSHRA Test, the Virtual Speech Quality Objective Listener (ViSQOL) and Signal to Distortion Ratio (SDR) values calculated for more objective evaluation are shown in Table 1.
With reference to Table 1, the results are similar to those of the MUSHRA Test that is a subjective evaluation index. A model created with multiple layers shows better performance than a model created with a single layer.
Table 2 shows the results of analyzing the bit rate for each step after training is completed.
The bit rate of each layer was calculated based on the ratio of entropy and code vector of the codebook in consideration of entropy coding. With reference to Table 2, the bit rate of the third stage designed to convert the highest frequency band was calculated to be the largest. In view of psychoacoustic knowledge, it is possible to allocate very few bits to the high frequency component that typically has low energy and relatively less important for human perception. That is, coding efficiency can be further improved by adjusting the bits allocated to each layer based on the existing psychoacoustic knowledge.
That is, the bit allocation to each layer may be autonomously performed within the network through training, or the bits to be allocated to each layer may be set manually. For example, when the low band is important, it may be possible to allocate more bits to the highest layer 1000 and fewer bits to the intermediate and lowest layers 2000 and 3000. Also, when the intermediate band is important, it may be possible to allocate more bits to the intermediate layer 2000 and fewer bits to the highest and lowest layers 1000 and 3000. The number of bits allocated to a layer may be set by a user using fields and characteristics to which the present disclosure is applied and existing psychoacoustic knowledge.
The operations of the method according to the exemplary embodiment of the present disclosure can be implemented as a computer readable program or code in a computer readable recording medium. The computer readable recording medium may include all kinds of recording apparatus for storing data which can be read by a computer system. Furthermore, the computer readable recording medium may store and execute programs or codes which can be distributed in computer systems connected through a network and read through computers in a distributed manner.
The computer readable recording medium may include a hardware apparatus which is specifically configured to store and execute a program command, such as a ROM, RAM or flash memory. The program command may include not only machine language codes created by a compiler, but also high-level language codes which can be executed by a computer using an interpreter.
Although some aspects of the present disclosure have been described in the context of the apparatus, the aspects may indicate the corresponding descriptions according to the method, and the blocks or apparatus may correspond to the steps of the method or the features of the steps. Similarly, the aspects described in the context of the method may be expressed as the features of the corresponding blocks or items or the corresponding apparatus. Some or all of the steps of the method may be executed by (or using) a hardware apparatus such as a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important steps of the method may be executed by such an apparatus.
In some exemplary embodiments, a programmable logic device such as a field-programmable gate array may be used to perform some or all of functions of the methods described herein. In some exemplary embodiments, the field-programmable gate array may be operated with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by a certain hardware device.
The description of the disclosure is merely exemplary in nature and, thus, variations that do not depart from the substance of the disclosure are intended to be within the scope of the disclosure. Such variations are not to be regarded as a departure from the spirit and scope of the disclosure. Thus, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the spirit and scope as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0022902 | Feb 2022 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
5737716 | Bergstrom | Apr 1998 | A |
20130110507 | Gao | May 2013 | A1 |
20190164052 | Sung et al. | May 2019 | A1 |
20200252629 | Ye et al. | Aug 2020 | A1 |
20210005209 | Beack et al. | Jan 2021 | A1 |
20210074308 | Skordilis | Mar 2021 | A1 |
20210195206 | Hannuksela et al. | Jun 2021 | A1 |
20210343302 | Gao | Nov 2021 | A1 |
20210366497 | Lim et al. | Nov 2021 | A1 |
20220141462 | Na et al. | May 2022 | A1 |
20220405884 | Kim et al. | Dec 2022 | A1 |
Number | Date | Country |
---|---|---|
3 799 433 | Mar 2021 | EP |
Number | Date | Country | |
---|---|---|---|
20230267940 A1 | Aug 2023 | US |