The present disclosure relates to audio processing technology and, more specifically, to an audio signal processing method, a device, and a storage medium for reducing signal delay.
In audio devices, signal delay during the processing of audio signals is undesired, especially for certain applications with high real-time requirements, such as hearing aid devices, where a total system delay from an audio input to an audio output is expected to be maintained below 10 milliseconds, and in any case, not greater than 20 milliseconds, otherwise the signal delay may have an impact on speech recognition. However, existing audio devices often struggle to meet the low delay requirements.
Therefore, it is desired to provide an audio signal processing method for audio devices to solve the problem of high delay in the existing technology.
An objective of the present application is to provide an audio signal processing method for reducing signal delay.
In one aspect of this application, an audio signal processing method is provided, The method comprises: providing an input audio signal comprising a plurality of input data frames offset from each other by a predetermined frame shift and each of the plurality of input data frames having a predetermined frame length; performing first windowing processing on the plurality of input data frames in sequence with a first window function, a start point and an end point of the first window function being aligned with two ends of each input data frame respectively; wherein the first window function comprises a starting function portion starting from a starting region of the first window function, an ending function portion in an ending region of the first window function and an intermediate function portion in an intermediate region of the first window function between the starting region and the ending region; and wherein the intermediate function portion has a first weighting factor, the starting function portion changes from 0 at the start point to the first weighting factor adjacent to the intermediate region, the ending function portion changes from the first weighting factor adjacent to the intermediate region to 0 at the end point; performing predetermined signal processing on the input audio signal after the first windowing processing and generating an output audio signal, wherein the output audio signal comprises a plurality of output data frames corresponding to the plurality of input data frames of the input audio signal, and each output data frame has the predetermined frame length; performing a second windowing processing on the plurality of output data frames in sequence with a second window function, a start point and an end point of the second window function being aligned with two ends of each output data frame respectively; wherein the second window function comprises a suppression function portion in a suppression region of the second window function, an output function portion in an output region of the second window function, and a compensation function portion in a compensation region of the second window function between the suppression region and the output region, wherein the output region has a length equal to that of the ending region; wherein the suppression function portion starts from 0 at the start point and for suppressing the output audio signal, the output function portion ends at 0 at the end point; and the compensation function portion is configured to provide signal weighting related to the output function portion and to compensate a difference in signal weighting between the ending function portion and the first weighting factor, and wherein the compensation function portion changes from the suppression function portion adjacent to the suppression region to the output function adjacent to the output region; and outputting the plurality of output data frames after the second windowing processing by superimposing the plurality of output data frames with the predetermined frame shift.
In other aspects of the present application, an audio signal processing device and a non-transitory computer storage medium are also provided.
The above is an overview of the application, which may be simplified, summarized and omitted in detail. Therefore, a person skilled in the art should realize that this part is only illustrative and is not intended to limit the scope of the application in any way. This summary is neither intended to determine the key features or essential features of the claimed subject matter, nor is it intended to be used as an auxiliary means to determine the scope of the claimed subject matter.
The above and other features of the contents of the present application will be more fully understood by the following specification and the appended claims in conjunction with the drawings. It will be understood that these drawings depict only several embodiments of the contents of the present application and should not be considered as limiting the scope of the contents of the present application. By using the drawings, the contents of the present application will be illustrated more clearly and in more detail.
In the following detailed description, reference is made to the drawings of which it forms part. In the drawings, similar reference number generally indicate similar components, unless the context indicates otherwise. The illustrative embodiments described in the detailed description, the drawings and the claims are not intended to be limiting. Other embodiments may be used, and other variations may be made without departing from the spirit or scope of the subject matter of the present application. It will be understood that a variety of different configurations, substitutions, combinations, designs, all of which clearly form part of the contents of this application, and all these clearly constitute a part of the content of the application.
Specifically, the audio sampling module is used to sample the original audio signal in analogue form and generate corresponding audio data samples in digital format. Generally, the audio sampling module can sample the original audio signal at a predetermined sampling rate, e.g. 16 kHz, and can frame the audio data samples according to a predetermined frame length, e.g. 10 milliseconds, to generate a plurality of input data frames with a predetermined frame length. These successive plurality of input data frames constitute the input audio signal. Each input data frame may include a corresponding number of audio data samples. For example, each input data frame may have 160 audio data samples when the audio signal is sampled at a sampling rate of 16 kHz and the frame length is 10 milliseconds. It will be appreciated that in the preceding example the frame length is measured by a length of time, while in other cases the frame length may also be measured by the number of audio data samples, e.g., a frame length of 160 audio data samples or 256 audio data samples. It can be appreciated that the sampling rate of the audio data samples and the number of audio data samples per frame correspond to the frame length measured by the length of time.
The sampling of the original audio signal by the audio sampling module may introduce an audio sampling delay 101. For some existing audio devices, the audio sampling module may not proceed with sampling the original audio signal and generating a next input data frame until an input data frame is generated. This means that every two adjacent input data frames do not overlap with each other, so the audio sampling delay 101 introduced by the audio sampling module may be equal to the frame length of the input data frames. In addition, a hardware input delay 103 may be introduced during the audio sampling process, which depends on the delay in the analog-to-digital signal conversion, which is typically 1 to 2 milliseconds. After this, the input audio signal generated by the sampling module may be sent to the signal processing module that processes the input audio signal based on a predetermined signal processing algorithm. The signal processing module may introduce an algorithmic processing delay 105, which is typically proportional to the frame length, for example 0.2 to 0.5 times the frame length. The output audio signal may have the same frame length as the input audio signal. For example, the output audio signal may include a plurality of output data frames that all have a predetermined frame length. The output audio signal may be sent to the audio playback module and be played back by the audio playback module for listening by a user of the audio device. During this process, the audio playback module may introduce a hardware output delay 107 and an audio playback delay 109. Like the hardware input delay 103, the hardware output delay 107 depends primarily on the digital-to-analog signal conversion, which is typically 1 to 2 milliseconds. In the existing audio device, the audio playback module plays and processes the output audio signal in a unit of output data frames, that is, the audio playback module may play an output data frame after receiving the entire output data frame. Thus, the audio playback delay 109 is also equal to the frame length of the output data frames. In general, the frame length of the data frames is at least 20 milliseconds to meet the requirement of subsequent spectrum analysis and processing.
It can be seen that audio sampling delay 101 and the audio playback delay 109, which depend on the frame length of the data frames, have the most significant influence on the total signal delay during the audio signal processing by the existing audio device shown in
In order to solve the problem of high signal latency in existing audio devices, the methods of the embodiments of the present application intercept audio data samples in a manner that some of the data samples are reused during the framing of audio sampling, i.e., the adjacent data frames may overlap with each other with frame shifts between different data frames. Correspondingly, during audio playback, adjacent data frames are offset with the same frame shift. This reduces the scale of the audio sampling delay and audio playback delay from the data frame length to the size of the frame shift, thus significantly reducing the total signal delay of the audio signal processing path. In addition, in the embodiments of the present application, windowing processing to the data frames with a specially designed window function is performed, which effectively preserves in the output audio signals the information of the original audio signals and thus enables the playback of the output audio signals with a better reproduction of the original audio signals.
As shown in
The audio device 200 also includes a first windowing module 203, which is used to sequentially perform first windowing process on a plurality of input data frames of the input audio signal using a first window function. Another advantage of using input data frames that overlap with each other by a frame shift is that a relatively stable signal can be obtained, which is advantageous for audio signals that require windowing processing. The windowing processing can reduce spectral leakage during the signal processing of time domain-to-frequency domain conversion and frequency domain-to-time domain conversion which is needed for frequency domain signal processing.
As shown in
The audio device 200 also includes a second windowing module 211, which is used to sequentially perform the second windowing processing on the plurality of output data frames of the output audio signal using a second window function. Further details of the second windowing processing and the first windowing processing performed by the first windowing module 203 are described in further details below in conjunction with examples.
After being processed by the second windowing module 211, the output audio signal may be sent to the audio playback module 213 and be played back by it to a user of the audio device 200 for listening. It can be understood that in the output audio signal, there is a predetermined frame shift between the start positions of two adjacent output data frames, and the size of the predetermined frame shift is less than the frame length. In some embodiments, each output data frame may include N segments, where N is an integer not less than 2, and the size of the frame shift may be equal to 1/N of the frame length. Since a new output data frame is provided to the audio playback module 213 after each interval equal to the frame shift, the audio playback delay is substantially reduced to a size same as the frame shift. For example, if the frame shift is 1/N of the frame length, the audio playback delay can be reduced to 1/N of the frame length.
As shown in
For the plurality of input data frames included in the input audio signal, the first windowing module may sequentially perform the first windowing processing using the first window function. Referring to
In the embodiment shown in
The values of the first window function 301 at the start point 301a and the end point 301b are both zero, which can effectively suppress spectrum leakage. The first weighting factor in the intermediate region 307 determines audio information that can be retained in the input data frame after the first windowing processing. In some embodiments, the first weighting factor may be 1, that is, the audio information of each input data frame aligned with the intermediate area 307 is not attenuated during the first windowing processing. In some other embodiments, the first weighting factor may also be of another value, e.g. a value ranging from 0.5 to 1. In practical applications, the intermediate region 307 may be as extended as possible. In the example shown in
As mentioned above, the start function portion in the starting region 303 varies from 0 at start point 301a to the first weighting factor (e.g., 1) at a position which is adjacent to the intermediate region 307, while the ending function portion in the ending region 305 varies from the first weighting factor (e.g., 1) in another position which is adjacent to the intermediate region 307 to 0 at the end point 301b. The starting function portion and the ending function portion may have a profile which is identical or similar to that of some existing window functions. In the embodiment shown in
After the sequentially windowing processing on the input data frames, these input data frames can be time domain-to-frequency domain conversed and then be processed in frequency domain. The signal resulting from the frequency domain signal processing can be an output audio signal having a plurality of output data frames, after a further frequency domain-to-time domain conversion. The second windowing module may perform second windowing processing on these output data frames in sequence using a second window function. Referring to
The window corresponding to the second window function 311 may be divided into a suppression region 313 starting at a start point 311a, an output region 315 ending at a point 311b, and a compensation region 317 located between the suppression region 313 and the output region 315. The suppression region 313 has a suppression function portion for suppressing the data output in the output data frames aligned with this region. In some embodiments, the suppression function portion may be set to equal to 0 over the entire length of the suppression region 313. In other words, data in the output data frame aligned to the suppression region 313 may not be sent to the audio playback module and thus not played back to the user of the audio device after second windowing processing. In some other embodiments, the suppression function portion may also have other function curves, which generally vary substantially from 0 at the start point 311a to a certain weighting value, for example a value less than 1. It will be understood that since the suppression function portion is used to suppress data output, the length of the suppression region is generally complementary to the length of the output data frame expected to be output from the audio device. In the example shown in
The length of the output region 315 is equal to the length of the ending region 303 of the first window function 301, so that the processing of the output data frame by the second window function 311 at the output region 315 corresponds substantially to the processing of the input data frame by the first window function at the ending region 303. Accordingly, the second window function 311 has an output function portion located in the output region 315 which changes from the compensation function portion at a position adjacent to the compensation region 317 to 0 at the end point 311b. The second window function 311 also has a compensation function portion located in the compensation region 317 which is used to provide signal weighting associated with the output function portion and to compensate for a difference in signal weighting between the ending function portion and the first weighting factor, which changes from the suppression function portion at a position adjacent to the suppression region 313 to the output function portion at another position adjacent to the output region 315. For example, the compensation function portion is a quotient of the product of the ending function portion and the output function portion divided by the first weighting factor. In the case where the first weighting factor is equal to 1, the compensation function portion is the product of the ending function portion in the end region 303 and the output function portion. Specifically, as can be shown in
It should be noted that when being output, each segment of the output data frame may correspond to the segments in the adjacent output data frames during the superimposition operation, and therefore, these segments are superimposed and output during the superimposition operation. For example, the third segment of the (i+2)-th output data frame corresponds to the fourth segment of the (i+1)-th output data frame in
It will be appreciated that the reason why the superimposed output of the output data frames shown in
In the example shown in
Thus, assuming that the length of both the starting and ending regions is equal to L/N, where L is the length of an input data frame or an output data frame and N is a positive integer greater than 2, then the first window function w1(n) in
The second window function w2(n) in
Thus, the first window function w1′(n) in
The second window function w2′(n) in
It will be appreciated that
It should be noted that in the above embodiments of the present application, the input data frame and the output data frame both include N equal-length segments for purpose of description, and the frame shift between adjacent data frames is equal to the length of one segment. In some other embodiments, the input data frame and the output data frame may have the same or a different number of segments, for example the input data frame may have M segments and the output data frame may have N segments, where M and N are positive integers greater than 2, and M may be equal to N or not equal to N. In some embodiments, at least a portion of the M segments may have unequal lengths, and/or at least a portion of the N segments may have unequal lengths. Furthermore, the frame shifts between adjacent input data frames as well as adjacent output data frames should be equal to each other, which enables the processing of the output data frames using the compensation function portion of the second window function, in order to compensate for the difference in signal weighting between the ending function portion of the first window function and the first weighting factor. For example, the frame shift should be equal to the length of the last input segment out of the M segments of the input data frame, and should be equal to the length of the last output segment out of the N segments of the output data frame.
In some embodiments, the present application also provides several computer program products having a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium includes computer-executable instructions for performing the steps in the method embodiment as shown in
Embodiments of the invention can be implemented by hardware, software or a combination of software and hardware. The hardware part may be implemented using dedicated logic; the software part may be stored in memory and executed by an appropriate instruction execution system, such as a microprocessor or dedicated designed hardware. A person skill in the art will appreciate that the devices and methods described above can be implemented using computer-executable instructions and/or included in processor control codes, such as those provided on a carrier medium such as a disk, CD or DVD-ROM, a programmable memory such as a read-only memory (firmware) or a data carrier such as an optical or electronic signal carrier. The devices and its modules of the present invention can be implemented by hardware circuits such as ultra-large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, or by software executed by various types of processors, or by a combination of the above hardware circuits and software such as firmware.
It should be noted that although several steps or modules of the audio signal processing method, device and storage medium are mentioned in the above detailed description, this division is only exemplary and not mandatory. In fact, according to embodiments of the present application, features and functions of two or more modules described above may be specified in a single module. Conversely, the features and functions of one module described above may be further divided to be specified by a plurality of modules.
Other variations to the disclosed embodiments can be understood and implemented by a person of skilled in the art by studying the specification, the disclosure. the drawings and appended claims thereto. In the claims, the word “comprising” does not exclude other elements and steps, and the words “one” and “a” do not exclude the plural. In the practical application of this application, a single part may perform the functions of more than one technical feature cited in the claims. Any reference number in the claims should not be construed as limiting the scope.
Number | Date | Country | Kind |
---|---|---|---|
202011072173.6 | Oct 2020 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/122630 | 10/8/2021 | WO |