The subject matter of this application is related to Russian patent application no. TBD filed as attorney docket no. L09-0721RU1 on the same day as this application, the teachings of which are incorporated herein by reference in their entirety.
1. Field of the Invention
The present invention relates to signal processing, and, more specifically but not exclusively, to techniques for detecting music in an acoustical signal.
2. Description of the Related Art
Music detection techniques that differentiate music from other sounds such as speech and noise are used in a number of different applications. For example, music detection is used in sound encoding and decoding systems to select between two or more different encoding schemes based on the presence or absence of music. Signals containing speech, without music, may be encoded at lower bit rates (e.g., 8 kb/s) to minimize bandwidth without sacrificing quality of the signal. Signals containing music, on the other hand, typically require higher bit rates (e.g., >8 kb/s) to achieve the same level of quality as that of signals containing speech without music. To minimize bandwidth when speech is present without music, the encoding system may be selectively configured to encode the signal at a lower bit rate. When music is detected, the encoding system may be selectively configured to encode the signal at a higher bit rate to achieve a satisfactory level of quality. Further, in some implementations, the encoding system may be selectively configured to switch between two or more different encoding algorithms based on the presence or absence of music. A discussion of the use of music detection in sound encoding systems may be found, for example, in U.S. Pat. No. 6,697,776, the teachings of which are incorporated herein by reference in their entirety.
As another example, music detection techniques may be used in video handling and storage applications. A discussion of the use of music detection in video handling and storage applications may be found, for example, in Minami, et al., “Video Handling with Music and Speech Detection,” IEEE Multimedia, Vol. 5, Issue 3, pgs. 17-25, July-September 1998, the teachings of which are incorporated herein by reference in their entirety.
As yet another example, music detection techniques may be used in public switched telephone networks (PSTNs) to prevent echo cancellers from corrupting music signals. When a consumer speaks from a far end of the network, the speech may be reflected from a line hybrid at the near end, and an output signal containing echo may be returned from the near end of the network to the far end. Typically, the echo canceller will model the echo and cancel the echo by subtracting the modeled echo from the output signal.
If the consumer is speaking at the far end of the network while music-on-hold is playing from the near end of the network, then the echo and music are mixed producing a mixed output signal. However, rather than cancelling the echo, in some cases, the non-linear processing module of the echo canceller suppresses the echo by clipping the mixed output signal and replaces fragments of the mixed output signal with comfort noise. As a result of this improper and unexpected echo canceller operation, instead of music, the consumer may hear intervals of silence and noise while the consumer is speaking into the handset. In such a case, the consumer may assume that the line is broken and terminate the call.
To prevent this scenario from occurring, music detection techniques may be used to detect when music is present, and, when music is present, the non-linear processing module of the echo canceller may be switched off. As a result, echo will remain in the mixed output signal; however, the existence of echo will typically sound more natural than the clipped mixed output signal. A discussion of the use of music detection techniques in PSTN applications may be found, for example, in Avi Perry, “Fundamentals of Voice-Quality Engineering in Wireless Networks,” Cambridge University Press, 2006, the teachings of which are incorporated herein by reference in their entirety.
A number of different music detection techniques currently exist. In general, the existing techniques analyze tones in the received signal to determine whether or not music is present. Most, if not all, of these tone-based music detection techniques may be separated into two basic categories: (i) stochastic model-based techniques and (ii) deterministic model-based techniques. A discussion of stochastic model-based techniques may be found in, for example, Compure Company, “Music and Speech Detection System Based on Hidden Markov Models and Gaussian Mixture Models,” a Public White Paper, http://www.compure.com, the teachings of which are incorporated herein by reference in their entirety. A discussion of deterministic model-based techniques may be found, for example, in U.S. Pat. No. 7,130,795, the teachings of which are incorporated herein by reference in their entirety.
Stochastic model-based techniques, which include Hidden Markov models, Gaussian mixture models, and Bayesian rules, are relatively computationally complex, and as a result, are difficult to use in real-time applications like PSTN applications. Deterministic model-based techniques, which include threshold methods, are less computationally complex than stochastic model-based techniques, but typically have higher detection error rates. Music detection techniques are needed that are (i) not as computationally complex as Stochastic model-based techniques, (ii) more accurate than deterministic model-based techniques, and (iii) capable of being used in real-time low-latency processing applications such as PSTN applications.
In one embodiment, the present invention is a processor-implemented method for processing audio signals to determine whether or not the audio signals correspond to music. According to the method, a plurality of tones are identified corresponding to long-duration spectral peaks in a received audio signal (e.g., Sin). A value is generated for a first metric based on number of the identified tones, and a value is generated for a second metric based on duration of the identified tones. A determination is as to whether or not the received audio signal corresponds to music based on the first and second metric values.
In another embodiment, the present invention is an apparatus comprising a processor for processing audio signals to determine whether or not the audio signals correspond to music. The processor is adapted to identify a plurality of tones corresponding to long-duration spectral peaks in a received audio signal. The processor is further adapted to generate a value for a first metric based on number of the identified tones, and a value for a second metric based on duration of the identified tones. The processor is yet further adapted to determine whether or not the received audio signal corresponds to music based on the first and second metric values.
Other aspects, features, and advantages of the present invention will become more fully apparent from the following detailed description, the appended claims, and the accompanying drawings in which like reference numerals identify similar or identical elements.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation.”
Received signal Rin is routed to back end 108 through hybrid 106, which may be implemented as a two-wire-to-four-wire converter that separates the upper and lower channels. Back end 108, which is part of user equipment such as a telephone, may include, among other things, the speaker and microphone of the communications device. Signal Sgen generated at the back end 108 is routed through hybrid 106, where unwanted echo may be combined with signal Sgen to generate signal Sin that has diminished quality. Echo canceller 102 estimates echo in signal Sin based on received signal Rin and cancels the echo by subtracting the estimated echo from signal Sin to generate output signal Sout, which is provided to the far-end.
When music-on-hold is playing at near end 100 and the far-end user is speaking, the resulting signal Sin may comprise both music and echo. As described above in the background, in some conventional public switched telephone networks, rather than cancelling the echo, the non-linear processing module of the echo canceller suppresses the echo by clipping the mixed output signal and replaces the echoed sound fragments with comfort noise. To prevent this from occurring, the non-linear processing module of echo canceller 102 is stopped when music is detected by music detection module 104. Music detection module 104, as well as echo canceller 102 and hybrid 106, may be implemented as part of the user equipment or may be implemented in the network by the operator of the public switched telephone network.
In general, music detection module 104 detects the presence or absence of music in signal Sin by using spectral analysis to identify tones in signal Sin characteristic of music, opposed to tones characteristic of speech or background noise. Tones that are characteristic of music are represented in the frequency domain by relatively sharp peaks. Typically, music contains a greater number of tones than speech, and those tones are generally longer in duration and more harmonic than tones in speech. Since music typically has more tones than speech and tones that have longer durations, music detection module 104 identifies portions of audio signals having a relatively large number of long-lasting tones as corresponding to music. The operation of music detection module 104 is discussed in further detail below in relation to
Music detection module 104 preferably receives signal Sin in digital format, represented as a time-domain sampled signal having a sampling frequency sufficient to represent telephone quality speech (i.e., a frequency≧8 kHz). Further, signal Sin is preferably received on a frame-by-frame basis with a constant frame size and a constant frame rate. Typical packet durations in PSTN are 5 ms, 10 ms, 15 ms, etc., and typical frame sizes for 8 kHz speech packets are 40 samples, 80 samples, 120 samples, etc. Music detection module 104 makes determinations as to whether music is or is not present on a frame-by-frame basis. If music is detected in a frame, then music detection module 104 outputs a value of one to echo canceller 102, instructing echo canceller 102 to not operate the non-linear processing module of echo canceller 102. If music is not detected, then music detection module 104 outputs a value of zero to echo canceller 102, instructing echo canceller 102 to operate the non-linear processing module to cancel echo. Note that, according to alternative embodiments, music detection module 104 may output a value of one when music is not detected and a value of zero when music is detected.
When speech and/or music is detected, voice activity detection generates an output value of one, and, when neither speech nor music is detected, voice activity detection generates an output value of zero. The output value is employed by the finite automaton processing of step 236 as discussed in relation to
When computational resources are available (as discussed below in relation to the FFT processing in step 218), received data frame Fn is also preprocessed (step 206) to increase the quality of music detection. Preprocessing may include, for example, high-pass filtering to remove the DC component of signal Sin and/or a pre-emphasis technique that emphasizes spectrum peaks so that the peaks are easier the detect.
Returning to
In order to reduce the overall computational complexity of music detection module 104, it is preferred that an FFT frame comprise more than one received data frame Fn. For example, for a received data frame size equal to 40 samples, three consecutive received data frames may be accumulated to generate 120 accumulated samples, which are then padded (step 214) with eight samples, each having a value of zero, to generate an FFT frame having 128 samples. To ensure that three frames have been saved in the frame buffer (step 208), a determination is made in step 210 as to whether or not enough frames (e.g., 3) have been accumulated. For this discussion, assume that each FFT frame comprises three received data frames Fn. If enough frames have not been accumulated, then old tones are loaded (step 212) as discussed further below. Following step 212, processing continues to step 228, which is discussed below.
If enough frames have been accumulated (step 210), then a sufficient number of padding samples are appended to the accumulated frames (step 214). After the padding values have been appended to generate an FFT frame (e.g., 128 samples), a weighted windowing function (step 216) is applied to avoid spectral leakage that can result from performing FFT processing (step 218). Spectral leakage is an effect well known in the art where, in the spectral analysis of the signal, some energy appears to have “leaked” out of the original signal spectrum into other frequencies. To counter this effect, a suitable windowing function may be used, including a Hamming window function or other windowing function known in the art that mitigates the effects of spectral leakage, thereby increasing the quality of tone detection. According to alternative embodiments of the present invention, the windowing function of step 216 may be excluded to reduce computational resources or for other reasons.
The windowed FFT frame is applied to the FFT processing of step 218 to generate a frequency-domain signal, comprising 2K complex Fourier coefficients fftt[k], where the FFT frame index t=0, 1, 2, etc. The 2K complex Fourier coefficients fftt[k] correspond to an FFT spectrum, and each complex Fourier coefficient fftt[k] corresponds to a different frequency k in the spectrum, where k=0, . . . , 2K−1. Note that, if the FFT processing of step 218 is implemented using fixed-point arithmetic, then frame normalization (not shown) may be needed before performing the FFT processing in order to improve the numeric quality of fixed-point calculations.
Referring back to
As shown in line 1, code 500 is performed for each frequency k, where k=0, . . . , K. In line 2, the kth power spectrum coefficient aspt[k] for the current FFT frame t is generated by squaring the magnitude value absolute_value(fftt[k]) of the kth complex Fourier coefficient fftt[k]. In line 3, the smoothed power spectrum FFT coefficient FFTsmt[k] for the current frame t is generated based on the smoothed power spectrum FFT coefficient FFTsmt-1[k] for the previous frame (t−1), the smoothing coefficient FFT_gamma, and the power spectrum coefficient aspt[k] for the current frame t. The result of applying code 500 to a plurality of FFT frames t is a smoothed power spectrum.
Returning to
If the value output by the voice activity detection of step 204 is equal to one, indicating that music and/or speech is present, then the determination of step 608 is made as to whether or not there is a local maximum at frequency k. This determination may be performed by comparing the value of smoothed power spectrum coefficient FFTsmt[k] corresponding to frequency k to the values of smoothed power spectrum coefficients FFTsmt[k−1] and FFTsmt[k+1] corresponding to frequencies k−1 and k+1. If the value of smoothed power spectrum coefficient FFTsmt[k] is not larger than the values of both smoothed power spectrum coefficients FFTsmt[k−1] and FFTsmt[k+1], then the smoothed power spectrum coefficient FFTsmt[k] does not correspond to a candidate musical tone. In this case, variable TONEt[k] is set to zero (step 610) and processing proceeds to step 622, which is described further below.
If, on the other hand, the value of the smoothed power spectrum coefficient FFTsmt[k] is larger than the values of both smoothed power spectrum coefficients FFTsmt[k−1] and FFTsmt[k+1], then a local maximum corresponds to frequency k. In this case, up to two sets of threshold conditions are considered (steps 612 and 616) to determine whether the identified local maximum is a sufficiently sharp peak. If either of these sets of conditions is satisfied, then variable TONEt[k] is set to one. Setting variable TONEt[k] indicates that the smoothed power spectrum coefficient FFTsmt[k] corresponds to a candidate musical tone.
The first set of conditions of step 612 comprises two conditions. First, smoothed power spectrum coefficient FFTsmt[k] is divided by smoothed power spectrum coefficient FFTsmt[k−1] and the resulting value is compared to a constant δ1. Second, smoothed power spectrum coefficient FFTsmt[k] is divided by smoothed power spectrum coefficient FFTsmt[k+1] and the resulting value is compared to constant δ1. Constant δ1 may be selected empirically and may depend on variables such as FFT frame size, the type of spectral smoothing used, the windowing function used, etc. In one implementation, constant δ1 was set equal to 3 dB (i.e., ˜1.4 in linear scale). If both resulting values are greater than constant δ1, then the first set of conditions of step 612 is satisfied, and variable TONEt[k] is set to one (step 614). Processing then proceeds to step 622 discussed below. Note that the first set of conditions of step 612 may be implemented using fixed-point arithmetic without using division, since FFTsmt[k]/FFTsmt[k−1]>δ1 is equivalent to FFTsmt[k]−δ1×FFTsmt[k−1]>0 and FFTsmt[k]/FFTsmt[k+1]>δ1 is equivalent to FFTsmt[k]−δ1×FFTsmt[k+1]>0.
If either resulting value is not greater than constant δ1, then the first set of conditions of step 612 is not satisfied, and a determination is made (step 616) as to whether a second set of conditions is satisfied. The second set of conditions comprises three conditions. First, smoothed power spectrum coefficient FFTsmt[k] is divided by smoothed power spectrum coefficient FFTsmt[k−2] and the resulting value is compared to a constant δ2. Second, it is determined whether the current frequency index k has a value greater than one and less than K−1. Third, smoothed power spectrum coefficient FFTsmt[k] is divided by smoothed power spectrum coefficient FFTsmt[k+2] and the resulting value is compared to constant δ2. Similar to constant δ1, constant δ2 may be selected empirically and may depend on variables such as FFT frame size, the type of spectral smoothing used, the windowing function used, etc. In one implementation, constant δ2 was set equal to 12 dB (i.e., ˜4 in linear scale). If both resulting values are greater than constant δ2 and 1≦k≦K−1, then the second set of conditions of step 616 is satisfied and variable TONEt[k] is set to one (step 618). Processing then proceeds to step 622 discussed below. Note that FFTsmt[k]/FFTsmt[k−2]>δ2 may be implemented using fixed-point arithmetic without using divisions because this comparison is equivalent to FFTsmt[k]−δ2×FFTsmt[k−2]>0. Similarly, FFTsmt[k]/FFTsmt[k+2]>δ2 may be implemented as FFTsmt[k]−δ2×FFTsmt[k+2]>0.
If any one of the conditions in the second set of conditions of step 616 is not satisfied, then variable TONEt[k] is set to zero (step 620). The determination of step 622 is made as to whether or not there are any more smoothed power spectrum coefficients FFTsmt[k] for the current FFT frame t to consider. If there are more smoothed power spectrum coefficients FFTsmt[k] to consider, then processing returns to step 602 to receive the next smoothed power spectrum coefficient FFTsmt[k]. If there are no more smoothed power spectrum coefficients FFTsmt[k] to consider for the current FFT frame t, then processing is stopped.
Returning to
Note that steps 214 to 226 are performed only once for each FFT frame t (e.g., upon receiving every third data frame Fn. When the first and second data frames F1 and F2 are received, steps 214 to 226 are not performed. Rather, variables TONEt[k] for k=0, . . . , K are initialized to zero, and steps 228 to 238 are performed based on the initialized values. For all other data frames n that are received when variables TONEt[k] are not generated, the previously stored set of variables TONEt[k] are loaded (step 212) and used to update tone accumulators An[k] (step 228).
Since the first FFT frame t=1 does not exist until after the third data frame F3 is received, an initial set of variables TONE0[k] is set to zero. Upon receiving each of the first and second data frames F1 and F2, the initial set of variables TONE0[k] is loaded (step 212) and used to update the sets of tone accumulators A1[k] and A2[k] for the first two data frames (step 228). Upon receiving the third data frame F3, the set of variables TONE1[k] for the first FFT frame is generated and saved (steps 214-226). This first set of variables TONE1[k] is used to update the set of tone accumulators A3[k] corresponding to the third received data frame F3 (step 228). Since the second FFT frame t=2 does not exist until after the sixth data frame F6 is received, for the fourth and fifth received data frames F4 and F5, the first set of variables TONE1[k] is loaded (step 212) to update (step 228) the sets of tone accumulators A4[k] and A5[k] corresponding to the fourth and fifth received data frames F4 and F5. Upon receiving the sixth data frame F6, the set of variables TONE2[k] is generated for the second FFT frame. This second set of variables TONE2[k] is used to update (step 228) the sets of tone accumulators A6[k], A7[k], and A8[k] for the sixth, seventh, and eighth received data frames F6, F7, and F8.
Typically, the FFT processing of step 218 uses a relatively large amount of computational resources. To reduce computational resources when FFT processing is performed (e.g., upon receiving every third data frame Fn), the voice activity detection of step 204 and the frame preprocessing of step 206 are skipped. In such instances, the finite automaton processing of step 236 uses a fixed value of one in lieu of the output from the voice activity detection of step 204. When FFT processing is not performed (e.g., after receiving the first, second, fourth, fifth, seventh, eighth, and so on data frames), the voice activity detection of step 204 and the frame preprocessing of step 206 are performed.
According to alternative embodiments of the present invention, one of the voice activity detection of step 204 and the frame preprocessing of step 206 may be skipped when the FFT processing of step 218 is performed, rather than skipping both the voice activity detection and the frame preprocessing. According to further embodiments of the present invention, the voice activity detection and the frame preprocessing are performed at all times, even when the FFT processing is performed. According to yet further embodiments of the present invention, the voice activity detection and/or the frame preprocessing may be omitted from the processing performed in flow diagram 200 altogether. Simulations have shown that music detection works relatively well when voice activity detection and frame preprocessing are not employed; however, the quality of music detection increases (i.e., error rate and detection delay decrease) when voice activity detection and frame preprocessing are employed.
Returning to
C
n=sum(Wgt[k]×sign(Bn[k]),k=0, . . . ,K) (1)
D
n=sum(Wgt[k]×Bn[k],k=0, . . . ,K) (2)
where “sign” denotes the signum function that returns a value of positive one if the argument is positive, a value of negative one if the argument is negative, and a value of zero if the argument is equal to zero. Note that pseudocode 700 of
Once the weighted number Cn of candidate musical tones and the weighted sum Dn of candidate musical tone durations are determined, the results are applied to the finite automaton processing of step 236 along with the decision from the voice activity detection of step 204 (i.e., 0 for noise and 1 for speech and/or music). Finite automaton processing, described in further detail in relation to
Transitions between these states are performed based on three rules: a soft-decision rule, a hard-decision rule, and a voice activity detection rule. The voice activity detection rule is merely the output of the voice activity detection of step 204 of
From graph 1000, the hard-decision rule may be derived by determining the pairs of Cn and Dn values (i.e., points in the Cartesian plane having coordinate axes of Cn and Dn depicted in
Referring back to
When state machine 900 is in pause-in-speech or -music state 908, state machine 900 will transition to (i) pause state 902 if the output of the voice activity detection switches back to a value of zero for the next received data frame, (ii) speech state 910 if the output of the voice activity detection remains equal to one for the next received data frame and the hard-decision rule is not satisfied (i.e., music is not detected in the next received data frame), or (iii) music state 916 if the output of the voice activity detection remains equal to one for the next received data frame and the hard-decision rule is satisfied (i.e., music is detected in the next received data frame).
When state machine 900 is in pause-in-speech state 904, state machine 900 will transition to (i) pause state 902 if the output of the voice activity detection is equal to zero or (ii) speech state 910 if the output of the voice activity detection is equal to one.
When state machine 900 is in speech state 910, state machine 900 will transition to (i) pause-in-speech state 904 if the voice activity detection outputs a value of zero or (ii) music-like speech state 912 if the hard-decision rule is satisfied (i.e., music is detected). State machine 900 will remain in speech state 910, as indicated by the arrow looping back into speech state 910, if the hard-decision rule is not satisfied (i.e., music is not detected).
When state machine 900 is in music-like speech state 912, state machine 900 will transition to (i) speech state 910 if the hard-decision rule is not satisfied (i.e., music is not detected) or (ii) music state 916 if the hard-decision rule is satisfied (i.e., music is detected).
When state machine 900 is in speech-like music state 914, state machine 900 will transition to (i) speech state 910 if the soft-decision rule is not satisfied, indicating that music is not present or (ii) music state 916 if the soft-decision rule is satisfied, indicating that music may be present.
When state machine 900 is in music state 916, state machine 900 will transition to (i) speech-like music state 914 if the soft-decision rule is not satisfied, indicating that music is not present or (ii) pause-in-music state 906 if the output of the voice activity detection has a value of zero. State machine 900 will remain in music state 916, as indicated by the arrow looping back into music state 916, if the soft-decision rule is satisfied, indicating that music may be present.
When state machine 900 is in pause-in-music state 906, state machine 900 will transition to (i) pause state 902 if the output of the voice activity detection has a value of zero or (ii) music state 916, if the output of the voice activity detection has a value of one.
In some embodiments of the present invention, a transition from one state to another in state machine 900 occurs immediately after one of the rules is satisfied. For example, a transition from pause state 902 to pause-in-speech or -music state 908 occurs immediately after the output of the voice activity detection switches from a value of zero to a value of one.
According to alternative embodiments, in order to smooth the outputs of state machine 900, a transition from one state to another occurs only after one of the rules is satisfied for a specified number (>1) of consecutive frames. These embodiments may be implemented in many different ways using a plurality of hangover counters. For example, according to one embodiment, three hangover counters may be used, where each hangover counter corresponds to a different one of the three rules. As another example, each state may have its own set of one or more hangover counters.
The hangover counters may be implemented in many different ways. For example, a hangover counter may be incremented each time one of the rules is satisfied, and reset each time one of the rules is not satisfied. As another example, a hangover counter may be (i) incremented each time a relevant rule that is satisfied for the current frame is the same as in the previous data frame and (ii) reset to zero each time the relevant rule that is satisfied changes from the previous data frame. If the hangover counter becomes larger than a specified hangover threshold, then state machine 900 transitions from the current state to the next state. The hangover threshold may be determined empirically.
As an example of the operation of a hangover counter according to one embodiment, suppose that state machine 900 is in pause state 902, and the output of the voice activity detection switches from a value of zero, indicating that neither speech nor music is present in the previous data frame, to a value of one, indicating that speech or music is present in the current data frame. State machine 900 does not switch states immediately. Rather, a hangover counter is increased each time that the output of the voice activity detection remains equal to one. When the hangover counter exceeds the hangover threshold, state machine 900 transitions from pause state 902 to pause-in-speech or -music state 908. If the voice activity detection switches to zero before the hangover counter exceeds the hangover threshold, then the hangover counter is reset to zero.
According to further alternative embodiments, transitions from some states may be instantaneous and transitions between other states may be performed using hangover counters. For example, transitions from the intermediate states (i.e., pause-in-speech state 904, pause-in-speech or -music state 908, music-like speech state 912, speech-like music state 914, and pause-in-music state 906) may be performed using hangover counters, while transitions from pause state 902, speech state 910, and music state 916 may instantaneous. Each different state can have its own unique hangover counter and hangover threshold value. Further, instantaneous transitions can be achieved by specifying a value of zero for the relevant hangover threshold.
Compared to stochastic model-based techniques, the present invention is less complex, allowing the present invention to be implemented in real-time low-latency processing. Compared to deterministic model-based techniques, the present invention has lower detection error rates. Thus, the present invention is a compromise between low computational complexity and high detection quality. Unlike other methods that use encoded speech features, and are thus limited to being used with a specific coder-decoder (CODEC), the present invention is more universal because it does not require any additional information other than the input signal.
The complexity of the processing performed in flow diagram 200 of
According to embodiments of the present invention in which frame preprocessing, voice activity detection, windowing, frame normalization, and time-axis smoothing are performed at all times, the total number of integer multiplications performed for music detection is approximately N+NVAD+(2K+1)+2K log2 K+2K+2(K+1)+2(K+1)+2K+2K=N+NVAD+12K+5+2K log2 K multiplications. Typical voice activity detection uses approximately 4×N multiplications per frame if exponential smoothing of the samples' energy is used. For a typical value of K=64 (i.e., 5 ms frame for 8 kHz signal) and N=40, the peak complexity is equal to about 0.35 million multiplications per second.
According to embodiments of the present invention in which frame preprocessing, voice activity detection, windowing, and time-axis smoothing are not performed, the total number of integer multiplications performed for music detection is approximately 2K log2 K+2K+2(K+1)+2(K+1)+2K+2K. For K=64, the peak complexity is equal to approximately 0.28 million multiplications per second. Note that these estimates do not account for the number of summations and subtractions, as well as processing time needed for memory read and write operations.
Although the present invention was described as accumulating three received data frames Fn to generate an FFT frame for FFT processing, the present invention is not so limited. The present invention may be implemented such that (i) fewer than three received data frames Fn are accumulated to generate an FFT frame, including as few as one received data frame Fn, or (ii) greater than three received data frames Fn are accumulated to generate an FFT frame. In embodiments in which an FFT frame comprises only one received data frame Fn, steps 210, 212, and 226 may be omitted, such that processing flows from step 208 directly to step 214 and steps 214 to 224 are performed for each received data frame Fn, and the set of variables TONEt[k] generated for each received data frame Fn is used immediately to update (step 228) tone accumulators An[k].
Further, although the spectral-peak finding of step 600 of
Even further, although state machine 900 was described as having eight states, the present invention is not so limited. According to alternative embodiments, state machines of the present invention may have more than or fewer than eight states. For example, according to some embodiments, the state machine could have six states, wherein pause-in-speech state 904 and pause-in-music state 906 are omitted. In such embodiments, speech state 910 and music state 916 would transition directly to pause state 902. In addition, as described above, hangover counters could be used to smooth the transitions to speech state 910 and music state 916.
Even yet further, although music detection modules of the present invention were described relative to their use with public switched telephone networks, the present invention is not so limited. The present invention may be used in suitable applications other than public switched telephone networks.
The present invention may be implemented as circuit-based processes, including possible implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, general-purpose computer, or other processor.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of program code embodied in tangible media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other non-transitory machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of program code, for example, stored in a non-transitory machine-readable storage medium including being loaded into and/or executed by a machine, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor or other processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
The present invention can also be embodied in the form of a bitstream or other sequence of signal values stored in a non-transitory recording medium generated using a method and/or an apparatus of the present invention.
Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value of the value or range.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the scope of the invention as expressed in the following claims.
The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. For example, voice activity detection 204 in
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
The embodiments covered by the claims in this application are limited to embodiments that (1) are enabled by this specification and (2) correspond to statutory subject matter. Non-enabled embodiments and embodiments that correspond to non-statutory subject matter are explicitly disclaimed even if they fall within the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
2010152225 | Dec 2010 | RU | national |