[Not Applicable]
[Not Applicable]
In many audio applications, an audio signal may be modified or processed to achieve a desired characteristic or quality. One of the characteristics of an audio signal that is frequently processed or modified is the speed of the signal. When sounds are recorded, they are often recorded at the normal speed and frequency at which the source plays or produces the signal. When the speed of the signal is modified, however, the frequency often changes, which may be noticed in a changed pitch. For example, if the voice of a woman is recorded at a normal level then played back at a slower rate, the woman's voice will resemble that of a man, or a voice at a lower frequency. Similarly, if the voice of a man is recorded at a normal level then played back at a faster rate, the man's voice will resemble that of a woman, or a voice at a higher frequency.
Some applications may require that an audio signal be played at a slower rate, while maintaining the same frequency, i.e. keeping the pitch of the sound at the same level as when played back at the normal speed.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of ordinary skill in the art through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.
Presented herein are system(s) and method(s) for frequency domain audio speed up or slow down, while maintaining pitch.
In one embodiment, there is presented a method for changing the speed of an encoded audio signal. The method comprises receiving the encoded audio signal; retrieving frames from the encoded audio signal; transforming the frames of the audio signal into a frequency domain, wherein each of said frames are associated with a plurality of initial phases, and a corresponding plurality of ending phases; and replacing the initial phases of at least one of the frames with the ending phases of another frame.
In another embodiment, there is presented a machine readable storage. The machine-readable storage has stored thereon, a computer program having at least one code section that changes the speed of an encoded audio signal. The at least one code section is executable by a machine, causing the machine to receive the encoded audio signal; retrieve frames from the encoded audio signal; transform the frames of the audio signal into a frequency domain, wherein each of said frames are associated with a plurality of initial phases, and a corresponding plurality of ending phases; and replace the initial phases of at least one of the frames with the ending phases of another frame.
In another embodiment, there is presented a system that changes the speed of an encoded audio signal. The system comprises a first circuit, a second circuit, a third circuit, and a fourth circuit. The first circuit receives the encoded audio signal. The second circuit retrieves frames from the encoded audio signal. The third circuit transforms the frames of the audio signal into a frequency domain, wherein each of said frames are associated with a plurality of initial phases, and a corresponding plurality of ending phases. The fourth circuit replaces the initial phases of at least one of the frames with the ending phases of another frame.
These and other features and advantages of the present invention may be appreciated from a review of the following detailed description of the present invention, along with the accompanying figures in which like reference numerals refer to like parts throughout.
The present invention relates generally to audio decoding. More specifically, this invention relates to decoding of audio signals to obtain an audio signal at a different speed while maintaining the same pitch as the original audio signal. Although aspects of the present invention are presented in terms of a generic audio signal, it should be understood that the present invention may be applied to many other types of systems.
The frames 213 (F0 . . . Fn) are then replicated or skipped at a rate consistent with the desired slow rate. For example, if the desired audio speed is half the original speed, then each frame is repeated, resulting in frames 212 If the desired audio speed is twice the original speed, then every other frame is skipped, resulting in frames 212 (FR0 . . . FRm) of 1024 samples, where FR0=F0, FR1=F2, and FR2=F4, etc. Additionally, m depends on the desired slow rate. In the example, where the desired audio speed is half the original speed, m=2n. If, for example, the desired audio speed is two-thirds of the original speed, then every other frame is repeated, so frames 213 (F0 . . . Fn) result in frames (FR0 . . . FRm), where FR0=F0, FR1=FR2=F1, FR3=F2, FR4=FR5=F3, etc., and m=3n/2. If for example, the desired audio speed is 1.5 times the original speed, then every third frame is skipped. Accordingly, frames 213 (F0 . . . Fn) result in frames (FR0 . . . FRm), where FR0=F0, FR1=F1, FR2=F3, FR3=F4, FR4=F6, etc.
A window function WF is then applied to frames 212 (FR0 . . . FRm) to “smooth out” the samples and ensure that the resulting signal does not have any artifacts that may result from repeating each frame. The window function results in the windowed frames 214 (WF0 . . . WFL) of 1024 samples. The window function WF can be one of many widely known and used window functions, or can be designed to accommodate the requirements of the system.
The Discrete Fourier Transformation (DFT) is then applied to the windowed frames 214. Application of DFT to the windowed frames 214 results in frequency domain windowed samples 216. The frequency domain windowed samples 216 are generally a collection of amplitudes w(f0, f1, f2, . . . ), and initial phases Θ(f0, f1, f2, . . . ) corresponding to a plurality of frequencies. Accordingly, the frequency domain windowed samples 216 can be expressed as:
Each of the plurality of frequencies also correspond to an ending phase Ψ(f0, f1, f2, . . . ). The ending phases Ψ(f0, f1, f2, . . . ) are the phases of the corresponding frequencies at the ending boundary of the frame F, and are generally a function of the initial phases Θ(f), the frequency f, and the length of time represented by the frame.
The initial phases Θ1(f0, f1, f2, . . . ) of frame F1 for each frequency are replaced with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 for the corresponding frequencies. Because the ending phases Ψ1(f0, f1, f2, . . . ) are dependent on the initial phases, changing the initial phases Θ1(f0, f1, f2, . . . ) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 will result in a new set of ending phases Ψ1′(f0, f1, f2, . . . ). The initial phases of Θ2(f0, f1, f2, . . . ) of frame F2 are replaced with the new set of ending phases of Ψ1′(f0, f1, f2, . . . ) of frame F1. The foregoing process will result in a new set of frequency domain windowed samples 218 that can be expressed as:
The Inverse DFT (IDFT) is applied to the frequency domain windowed samples 218, resulting in windowed frames 220. The windowed frames 220 (WF0 . . . WFL) of 1024 samples are then run through a digital-to-analog converter (DAC) to get an analog signal 201. The analog signal 211 is a longer version of the analog input signal 111 of
At a next block 425, a window function WF is applied to the frames (FR0 . . . FRm) to “smooth out” the samples and ensure that the resulting signal does not have any artifacts that may result from repeating each frame. The window function results in the windowed frames (WF0 . . . WFL). The window function WF can be one of many widely known and used window functions, or can be designed to accommodate the design requirements of the system.
The Discrete Fourier Transformation (DFT) is then applied (427) to the windowed frames 214. Application of DFT to the windowed frames 214 results in frequency domain windowed samples 216. The frequency domain windowed samples 216 are generally a collection of amplitudes w(f0, f1, f2, . . . ), and initial phases Θ(f0, f1, f2, . . . ) corresponding to a plurality of frequencies. Accordingly, the frequency domain windowed samples 216 can be expressed as:
Each of the plurality of frequencies also correspond to an ending phase Ψ(f0, f1, f2, . . . ). The ending phases Ψ(f0, f1, f2, . . . ) are the phases of the corresponding frequencies at the ending boundary of the frame F, and are generally a function of the initial phases Θ(f), the frequency f, and the length of time represented by the frame.
The initial phases Θ1(f0, f1, f2, . . . ) of frame F1 for each frequency are replaced (429) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 for the corresponding frequencies. Because the ending phases Ψ1(f0, f1, f2, . . . ) are dependent on the initial phases, changing the initial phases Θ1(f0, f1, f2, . . . ) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 will result in a new set of ending phases Ψ1′(f0, f1, f2, . . . ). The initial phases of Θ2(f0, f1, f2, . . . ) of frame F2 are replaced with the new set of ending phases of Ψ1′(f0, f1, f2, . . . ) of frame F1. The foregoing process will result in a new set of frequency domain windowed samples 218 that can be expressed as:
The Inverse DFT (IDFT) is applied (431) to the frequency domain windowed samples 218, resulting in windowed frames 220. The windowed frames (WF0 . . . WFL) are then sent through the DAC at a next block 433 to produce the audio signal at the desired slower or faster speed, with the same pitch as the original because the playback frequency is kept the same as the original signal.
Standards such as, for example, MPEG-1, Layer 3 (MPEG stands for Motion Pictures Experts Group) have been devised for compressing audio signals. In certain embodiments of the present invention, the audio signal can be compressed in accordance with such standards for compressing audio signals.
The frames 103 (F0 . . . Fn) are then grouped into windows 105 (W0 . . . Wn) each one of which comprises 2048 samples or two frames such as, for example, (Wx(0) . . . Wx(2047)) comprising frames (Fx(0) . . . Fx(1023)) and (Fx+1(0) . . . Fx+1(1023)) However, each window 105 Wx has a 50% overlap with the previous window 105 Wx−1. Accordingly, the first 1024 samples of a window 105 Wx are the same as the last 1024 samples of the previous window 105 Wx−1. For example, W0=(W0(0) . . . W0(2047))=(F0(0) . . . F0(1023)) and (F1(0) . . . F1(1023)), and W1=(W1(0) . . . W1(2047))=(F1(0) . . . F1(1023)) and (F2(0) . . . F2(1023)). Hence, in the example, W0 and W1 contain frames (F1(0) . . . F1(1023)).
A window function w(t) is then applied to each window 105 (W0 . . . Wn), resulting in sets (wW0 . . . wWn) of 2048 windowed samples 107 such as, for example, (wWx(0) . . . wWx(2047)). A modified discrete cosine transform (MDCT) is then applied to each set (wW0 . . . wWn) of windowed samples 107 (wWx(0) . . . wWx(2047)), resulting sets (MDCT0 . . . MDCTn) of 1024 frequency coefficients 109 such as, for example, (MDCTx(0) . . . MDCTx(1023)).
The sets of frequency coefficients 109 (MDCT0 . . . MDCTn) are then quantized and coded for transmission, forming an audio elementary stream (AES). The AES can be multiplexed with other AESs. The multiplexed signal, known as the Audio Transport Stream (Audio TS) can then be stored and/or transported for playback on a playback device. The playback device can either be at a local or remote location from the encoder. Where the playback device is remotely located, the multiplexed signal is transported over a communication medium such as, for example, the Internet. The multiplexed signal can also be transported to a remote playback device using a storage medium such as, for example, a compact disk.
During playback, the Audio TS is de-multiplexed, resulting in the constituent AES signals. The constituent AES signals are then decoded, yielding the audio signal. During playback the speed of the signal may be decreased to produce the original audio at a slower speed.
An inverse window function wI(t) is then applied to each set (wW0 . . . wWn) of 2048 windowed samples 207, resulting in windows 205 (W0 . . . Wn) each one of which comprises 2048 samples. Each window 205 (wW0 . . . wWn) comprises 2048 samples from two frames such as, for example, (Wx(0) . . . Wx(2047)) comprising frames (Fx(0) . . . Fx(1023)) and (Fx+1(0) . . . Fx+1(1023)) as illustrated in
The frames 213 (F0 . . . Fn) are then replicated or skipped at a rate consistent with the desired slow rate. For example, if the desired audio speed is half the original speed, then each frame is repeated, resulting in frames 212 If the desired audio speed is twice the original speed, then every other frame is skipped, resulting in frames 212 (FR0 . . . FRm) of 1024 samples, where FR0=F0, FR1=F2, and FR2=F4, etc. Additionally, m depends on the desired slow rate. In the example, where the desired audio speed is half the original speed, m=2n. If, for example, the desired audio speed is two-thirds of the original speed, then every other frame is repeated, so frames 213 (F0 . . . Fn) result in frames (FR0 . . . FRm), where FR0=F0, FR1=FR2=F1, FR3=F2, FR4=FR5=F3, etc., and m=3n/2. If for example, the desired audio speed is 1.5 times the original speed, then every third frame is skipped. Accordingly, frames 213 (F0 . . . Fn) result in frames (FR0 . . . FRm), where FR0=F0, FR1=F1, FR2=F3, FR3=F4, FR4=F6, etc.
A window function WF is then applied to frames 202 (FR0 . . . FRm) to “smooth out” the samples and ensure that the resulting signal does not have any artifacts that may result from repeating each frame. The window function results in the windowed frames 204 (WF0 . . . WFL) of 1024 samples. The window function WF can be one of many widely known and used window functions, or can be designed to accommodate the requirements of the system.
The Discrete Fourier Transformation (DFT) is then applied to the windowed frames 204. Application of DFT to the windowed frames 204 results in frequency domain windowed samples 206. The frequency domain windowed samples 206 are generally a collection of amplitudes w(f0, f1, f2, . . . ), and initial phases Θ(f0, f1, f2, . . . ) corresponding to a plurality of frequencies. Accordingly, the frequency domain windowed samples 206 can be expressed as:
Each of the plurality of frequencies also correspond to an ending phase Ψ(f0, f1, f2, . . . ). The ending phases Ψ(f0, f1, f2, . . . ) are the phases of the corresponding frequencies at the ending boundary of the frame F, and are generally a function of the initial phases Θ(f), the frequency f, and the length of time represented by the frame.
The initial phases Θ1(f0, f1, f2, . . . ) of frame F1 for each frequency are replaced with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 for the corresponding frequencies. Because the ending phases Ψ1(f0, f1, f2, . . . ) are dependent on the initial phases, changing the initial phases Θ1(f0, f1, f2, . . . ) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 will result in a new set of ending phases Ψ1′(f0, f1, f2, . . . ). The initial phases of Θ2(f0, f1, f2, . . . ) of frame F2 are replaced with the new set of ending phases of Ψ1′(f0, f1, f2, . . . ) of frame F1. The foregoing process will result in a new set of frequency domain windowed samples 208 that can be expressed as:
The Inverse DFT (IDFT) is applied to the frequency domain windowed samples 208, resulting in windowed frames 210. The windowed frames 220 (WF0 . . . WFL) of 1024 samples are then run through a digital-to-analog converter (DAC) to get an analog signal 212. The analog signal 201 is a longer version of the analog input signal 101 of
At a next block 410, a window function WF is applied to the frames (FR0 . . . FRm) to “smooth out” the samples and ensure that the resulting signal does not have any artifacts that may result from repeating each frame. The window function results in the windowed frames (WF0 . . . WFL). The window function WF can be one of many widely known and used window functions, or can be designed to accommodate the requirements of the system.
The Discrete Fourier Transformation (DFT) is then applied (411) to the windowed frames 214. Application of DFT to the windowed frames 214 results in frequency domain windowed samples 216. The frequency domain windowed samples 216 are generally a collection of amplitudes w(f0, f1, f2, . . . ), and initial phases Θ(f0, f1, f2, . . . ) corresponding to a plurality of frequencies. Accordingly, the frequency domain windowed samples 216 can be expressed as:
Each of the plurality of frequencies also correspond to an ending phase Ψ(f0, f1, f2, . . . ). The ending phases Ψ(f0, f1, f2, . . . ) are the phases of the corresponding frequencies at the ending boundary of the frame F, and are generally a function of the initial phases Θ(f), the frequency f, and the length of time represented by the frame.
The initial phases Θ1(f0, f1, f2, . . . ) of frame F1 for each frequency are replaced (412) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 for the corresponding frequencies. Because the ending phases Ψ1(f0, f1, f2, . . . ) are dependent on the initial phases, changing the initial phases Θ1(f0, f1, f2, . . . ) with the ending phases Ψ0(f0, f1, f2, . . . ) in frame F0 will result in a new set of ending phases Ψ1′(f0, f1, f2, . . . ). The initial phases of Θ2(f0, f1, f2, . . . ) of frame F2 are replaced with the new set of ending phases of Ψ1′(f0, f1, f2, . . . ) of frame F1. The foregoing process will result in a new set of frequency domain windowed samples 218 that can be expressed as:
The Inverse DFT (IDFT) is applied (413) to the frequency domain windowed samples 218, resulting in windowed frames 220. The windowed frames (WF0 . . . WFL) are then sent through the DAC at a next block 414 to produce the audio signal at the desired slower speed or faster speed, with the same pitch as the original because the playback frequency is kept the same as the original signal.
The sets of frequency coefficients 109 (MDCT0 . . . MDCTn) of
Additionally, tools including the mono/stereo 313, prediction 315, intensity stereo coupling 317, TNS 319, and filter bank 321 can apply further functions to the sets of frequency coefficients 109 (MDCT0 . . . MDCTn). The gain control 323 transforms the frequency coefficients 109 (MDCT0 . . . MDCTn) into a time-domain audio signal. The gain control 323 transforms the frequency coefficients 109 by applying the IMDCT, the inverse window function, and inverse window overlap as explained above in reference to
The output of the gain control 323, which is frames (F0 . . . Fn) such as, for example, frames 203 or frames 213, is then sent to the audio processing unit 325 for additional processing, playback, or storage. The audio processing unit 325 receives an input from a user regarding the speed at which the audio signal should be played or has access to a default value for the factor of slowing the audio signal at playback. The audio processing unit 325 then processes the audio signal according to the factor for slow playback by replicating the frames (F0 . . . Fn) at a rate consistent with the desired slow rate. For example, if the desired audio speed is half the original speed, then each frame is repeated, resulting in frames (FR0 . . . FRm) such as, for example, frames 202 or frames 212, of 1024 samples, where FR0=FR1=F0, and FR2=FR3=F1, etc. The factor m depends on the desired slow rate. In the example, where the desired audio speed is half the original speed, m=2n. If, for example, the desired audio speed is two-thirds of the original speed, then every other frame is repeated, so frames (F0 . . . Fn) result in frames (FR0 . . . FRm), where FR0=F0, FR1=FR2=F1, FR3=F2, FR4=FR5=F3, etc., and m=3n/2.
A window function WF is then applied to frames (FR0 . . . FRm) to “smooth out” the samples and ensure that the resulting signal does not have any artifacts that may result from repeating each frame. The window function results in the windowed frames (WF0 . . . WFL) such as, for example, frames 204 or frames 214, of 1024 samples. The window function WF can be one of many widely known and used window functions, or can be designed to accommodate the requirements of the system.
At this point the signal is still in digital form, so the output of the audio processing unit 325 is run through a DAC 327, which converts the digital signal to an analog audio signal to be played through a speaker 329.
In an embodiment of the present invention, the playback speed is pre-determined in the design of the decoder. In another embodiment of the present invention, the play back speed is entered by a user of the decoder, and varies accordingly.
The embodiments described herein may be implemented as a board level product, as a single chip, application specific integrated circuit (ASIC), or with varying levels of the decoder system integrated with other portions of the system as separate components. The degree of integration of the decoder system will primarily be determined by the speed and cost considerations. Because of the sophisticated nature of modern processor, it is possible to utilize a commercially available processor, which may be implemented external to an ASIC implementation. Alternatively, if the processor is available as an ASIC core or logic block, then the commercially available processor can be implemented as part of an ASIC device wherein certain functions can be implemented in firmware.
While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims.
This application is a continuation of U.S. application Ser. No. 10/803,416, filed Mar. 18, 2004, and is related to Manoj Kumar Singhal, et al. U.S. application Ser. No. 10/803,286 (Attorney Docket No. 15473US01) entitled “System and Method for Time Domain Audio Slow Down, While Maintaining Pitch” filed Mar. 18, 2004, the complete subject matter of which is hereby incorporated herein by reference, in its entirety. This application is also related to Manoj Kumar Singhal, et al. U.S. application Ser. No. 10/803,420 (Attorney Docket No. 15474US01) entitled “System and Method for Time Domain Audio Speed Up, While Maintaining Pitch” filed Mar. 18, 2004, the complete subject matter of which is hereby incorporated herein by reference, in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 10803416 | Mar 2004 | US |
Child | 12268013 | US |