This application claims the benefit of Taiwan application Serial No. 100108830, filed Mar. 15, 2011, the subject matter of which is incorporated herein by reference.
1. Field of the Invention
The invention relates in general to an audio time stretch method and associated apparatus, and more particularly to a method for audio time stretch by utilizing audio data with low energy and associated apparatus.
2. Description of the Related Art
Internet real-time audio/video transmission techniques, e.g., Voice over Internet Protocol (VoIP), offer people immediate and realistic multimedia services, and are thus one of the most important research and development targets for information technology developers.
In Internet real-time audio/video transmission, a transmitting end samples, digitalizes and encodes audio to be transmitted into a plurality of digital audio data each corresponding to an amplitude sample of the audio. A certain number of audio data are packaged in an Internet packet, which is transmitted to a receiving end. Upon receiving the packet at the receiving end, the packet is de-packetized, decoded and demodulated to the original digital audio data. The digital audio data are digital-to-analog converted to restore the original analog audio data that are then played.
At the transmitting end, each audio data corresponds to a predetermined sampling time sequence. Therefore, at the receiving end, it is essential that the audio data be digital-to-analog converted according to the same sampling time sequence, so as to reconstruct the audio to be transmitted by the transmitting end. In order to perform digital-to-analog conversion according to the predetermined time sequence, the receiving end needs to provide the audio data to the digital-to-analog converting mechanism according to a specific time sequence. However, since the audio data are obtained from the packets, the quality of audio played at the receiving end is undesirably affected in the event that the time sequence of the packets transmitted to the receiving end is irregular.
The time sequence of packets transmitted in the Internet real-time audio/video transmission is in fact affected by various factors, e.g., jitter and clock drift. When the packets are transmitted via the Internet, the packets arrive at the receiving end after being routed through different paths due to Internet protocols, such that the packets do not arrive at the receiving end according to the time sequence based on which they are transmitted—such is referred to as “jitter”. Further, different reference clocks utilized by the transmitting end and the receiving end may also lead to differences in the packets transmitted. For example, suppose a packet length according to a predetermined protocol is 10 ms, the transmitting end transmits an audio packet every 10.01 ms, and the receiving end plays a packet every 9.99 ms. In a period during which 100 packets are transmitted, an acknowledgement time difference between the two ends reaches as high as 2 ms—such is referred to as “clock drift”.
At the receiving end, in order to provide audio data to the digital-to-analog conversion mechanism according to a predetermined time sequence, audio time stretch is required by the time sequence. When the receiving end fails to in time acquire the audio data from the packets, additional audio data needs to be inserted; in contrast, the receiving end removes/discards a certain amount of audio data when the packets provide more audio data than the receiving end can buffer.
However, inappropriate time stretch may degrade the quality of audio playback such that noticeable audio imperfections are observed by a listener at the receiving end.
The present invention discloses a method for audio time stretch comprises receiving a plurality of audio data, calculating an energy level according to amplitudes of the audio data, and selectively performing a waveform search for the audio data according to the energy level. Preferably, the waveform search is performed when the energy level is lower than a threshold. Preferably, a plurality of third audio data among the audio data are selected to be removed according to waveform similarities in the audio data. Upon identifying the removable audio data, a removable flag is set as an enable value. A plurality of fourth audio data among the audio data are selected as addible audio data according to waveform similarities. An addible flag is set as an enable value upon identifying the addible audio data.
When providing the audio data to a digital-to-analog conversion mechanism, an audio repository is checked. When the audio repository is greater than a water level and the removable flag matches the enable value, the removable audio data are removed from the audio data. Alternatively, when the audio repository is lower than the water level and the addible flag matches the enable value, the addible audio data are inserted into the audio data.
Preferably, the threshold is adjustable by a feedback mechanism. To process another plurality of second audio data after having outputted the above audio data, the threshold is updated according to the energy level of the above audio data. An energy level of the second audio data is compared with the updated threshold to selectively perform the waveform search.
The present invention further discloses an apparatus comprising an energy level module, a waveform search module, a determining module, a threshold module, a flag register and a buffer control module. The energy level module calculates a corresponding energy level according to amplitudes of a plurality of audio data. The determining module determines whether the waveform search module performs a waveform search among the audio data according to the energy level. Preferably, when an energy level of a predetermined amount of audio data is greater than a threshold, the waveform search module stops the waveform search among the predetermined amount of audio data. When the energy level is smaller than the threshold, the waveform search module performs the waveform search among the predetermined number of audio data, and identifies removable audio data and addible data from the predetermined amount of audio data according to waveform similarities. Further, a removable flag and an addible flag in the flag register are respectively set as an enable value.
The buffer control module checks an audio repository. When the audio repository is greater than a water level and the removable flag matches the enable value, the buffer control module removes the removable audio data from the predetermined number of audio data. Alternatively, when the audio repository is lower than the water level and the addible flag matches the enable value, the buffer control module inserts the addible audio data into the predetermined number of audio data.
The threshold module provides the threshold, and updates the energy level for a current audio data according to the energy level of a previous audio data.
The above and other aspects of the invention will become better understood with regard to the following detailed description of the preferred but non-limiting embodiments. The following description is made with reference to the accompanying drawings.
In Step 102, a plurality of audio data as input are received. For example, the plurality of audio data are provided by a de-packetizing/decoding /demodulating mechanism in the receiving end. For example, the plurality of audio data are obtained from a same packet, and are pulse code modulation (PCM) audio data.
In Step 104, a corresponding energy level B of the audio data is calculated according to amplitudes of the audio data. For example, the energy level B is calculated according to the RMS of the amplitudes of the audio data.
In Step 106, the energy level B is compared with a threshold A. Step 108 is performed when the energy level B is smaller than the threshold A, or else Step 114 is performed.
In Step 108, a waveform search is performed. For example, a first number of audio data as removable audio data and a second number of addible audio data are selected from the plurality of audio data. The removable audio data and the addible audio data may be the same or different, and the first number and the second number may be the same or different. Preferably, the waveform search may be performed according to the waveform similarity based synchronized overlap-add (WSOLA) algorithm or similar derived algorithms to identify the removable and addible audio data. Among the audio data, a set of audio data may serve as the removable audio data when the waveform of the set of audio data is similar to that of a neighboring set of audio data. When the set of audio data is removed from the audio data, a count of the audio data is decreased without changing a pitch to reduce a time period of the audio data. Based on similar principles, the addible audio data are identified to increase the count of the audio data without changing the pitch to lengthen the time period of the audio data.
In Step 110A, a position and/or start and end points of the removable audio data are tagged, and a flag removeFlag (i.e., the removable flag) is set as logic true (i.e., an enable value, indicated as True in
In Step 1106, the method proceeds to Step 114 when the flag removeFlag is logic true. Other additional processing steps (not shown) can be performed when the flag removeFlag is set as logic true. For example, parameters of the waveform search are modified to iterate the waveform search in Step 108, or the removable audio data are identified according to other principles.
In Step 112A, when the addible audio data are identified, a position and/or start and end points of the addible audio data are tagged, and another flag addFlag (i.e., the addible flag) is set as logic true.
In Step 112B, the method proceeds to Step 114 when the flag add Flap is logic true.
In Step 116, an audio repository is checked to determine whether a count of the audio data being buffered satisfies a time sequence of a digital-to-analog conversion mechanism. When the audio repository is normal, Step 122 is performed, and the flags removeFlag and addFlag are reset to logic false. In contrast, when the audio repository is abnormal and encounters overflow or underflow, Step 118 or 120 is performed according to statuses of the flags removeFlag and addFlag. For example, when the audio repository is greater than a predetermined water level and the flag removeFlag is logic true, Step 118 is performed; when the audio repository is lower than the water level and the the flag addFlag is logic true, Step 120 is performed. A repository greater than the water level indicates a count of the audio data is excessive so that a part of the audio data needs to be removed. When the flag removeFlag is logic true, it means that the removable audio data are identified from the original audio data by Step 110A, so as to perform Step 118. When the flag removeFlag is not logic true, other additional processing steps (not shown) may be performed. For example, the removable audio data are identified according to other principles. Further, a repository lower than the water level means the count of the audio data falls short so that the count of the audio data needs to be increased. When the flag addFlag is logic true, it indicates that the addible audio data from the original audio data are identified, and Step 120 is performed.
In Step 118, the removable audio data are selectively removed from the original audio data. For example, the removable audio data are selectively removed according to the tags set in Step 110A to reduce the time period of the audio data.
In Step 120, the addible audio data are inserted into the original audio data. For example, the addible audio data are inserted according to the tags in Step 112A to lengthen the time period of the audio data.
In Step 122, the audio data are outputted. For example, the audio data are outputted according to a digital-to-analog conversion mechanism (not shown) at the receiving end.
In Step 124, when providing the threshold A for the audio data, the threshold A may be updated according to one or more previous audio data (e.g., an energy level thereof). By appropriately adjusting the threshold A, a minimal overall energy level of the audio is reflected by the threshold A to correctly distinguish the voice intervals between the syllables. For example, when buffering the (n-1)th audio data, supposing a corresponding energy level B[n-1] is smaller than a current threshold A[n-1], a threshold A[n] smaller than the threshold A[n-1] is applied for the (n)th audio data. Conversely, supposing the energy level B[n-1] is greater than the threshold A[n-1], the threshold A[n] equal to the threshold A[n-1] is provided. However, in the event that the energy level of a continuous number of audio data is greater than the threshold A, the threshold A may be increased when updating the threshold A. It is known to a person skilled in the art that other approaches for dynamically adjusting the threshold A may be applied so that the threshold A is given adequate discernment.
It is observed from Step 106 that, the present invention utilizes a period having lower energy level and volume in the audio to perform audio time stretch, so that audio quality imperfections due to time stretch are masked by parts that are likely to stay unnoticed from a listener and thus reduce audio quality degradation resulted from time stretch.
The buffer control module 20 checks an audio repository. When the audio repository is greater than a water level and the flag removeFlag is logic true, the buffer control module 20 selectively removes the removable audio data from the audio data. In contrast, when the audio repository is lower than the water level and the flag addFlag is logic true, the buffer control module 20 selectively inserts the addible audio data into the audio data.
The threshold module 14 is capable of updating the threshold A for the current audio data according to one or more previous audio data (e.g., the energy level thereof). The apparatus 10 is implemented in the receiving end of Internet real-time audio/video transmission to receive digital audio data via a de-packetizing/decoding/demodulating mechanism (not shown) and output the buffered audio data to a digital-to-analog conversion mechanism (not shown). The apparatus 10 may be implemented by software, firmware and/or hardware.
In conclusion, in the present invention, audio time stretch is performed according to an energy level, and parts with lower energy level and volume are utilized to perform time stretch, so that effects due to time stretch are likely to stay unnoticed to a listener to effectively reduce audio quality degradation resulted from time stretch. Although the Internet real-time audio/video transmission is take as an example in the foregoing description, the present invention is applicable to various applications where audio time stretch is required. For example, the present invention may be applied to applications of language learning and conversions of speech to text to accelerate or delay a speech speed without changing a pitch.
While the invention has been described by way of example and in terms of the preferred embodiments, it is to be understood that the invention is not limited thereto. On the contrary, it is intended to cover various modifications and similar arrangements and procedures, and the scope of the appended claims therefore should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements and procedures.
Number | Date | Country | Kind |
---|---|---|---|
100108830 | Mar 2011 | TW | national |