The subject matter of this application is related to Russian patent application no. ______ filed as Attorney Docket number L09-0669RU1 on the same day as this application, the teachings of which are incorporated herein by reference in their entirety.
1. Field of the Invention
The present invention relates to signal processing, and, more specifically but not exclusively, to techniques for detecting music in an acoustical signal.
2. Description of the Related Art
Music detection techniques that differentiate music from other sounds such as speech and noise are used in a number of different applications. For example, music detection is used in sound encoding and decoding systems to select between two or more different encoding schemes based on the presence or absence of music. Signals containing speech, without music, may be encoded at lower bit rates (e.g., 8 kb/s) to minimize bandwidth without sacrificing quality of the signal. Signals containing music, on the other hand, typically require higher bit rates (e.g., >8 kb/s) to achieve the same level of quality as that of signals containing speech without music. To minimize bandwidth when speech is present without music, the encoding system may be selectively configured to encode the signal at a lower bit rate. When music is detected, the encoding system may be selectively configured to encode the signal at a higher bit rate to achieve a satisfactory level of quality. Further, in some implementations, the encoding system may be selectively configured to switch between two or more different encoding algorithms based on the presence or absence of music. A discussion of the use of music detection in sound encoding systems may be found, for example, in U.S. Pat. No. 6,697,776, the teachings of which are incorporated herein by reference in their entirety.
As another example, music detection techniques may be used in video handling and storage applications. A discussion of the use of music detection in video handling and storage applications may be found, for example, in Minami, et al., “Video Handling with Music and Speech Detection,” IEEE Multimedia, Vol. 5, Issue 3, pgs. 17-25, July-September 1998, the teachings of which are incorporated herein by reference in their entirety.
As yet another example, music detection techniques may be used in public switched telephone networks (PSTNs) to prevent echo cancellers from corrupting music signals. When a consumer speaks from a far end of the network, the speech may be reflected from a line hybrid at the near end, and an output signal containing echo may be returned from the near end of the network to the far end. Typically, the echo canceller will model the echo and cancel the echo by subtracting the modeled echo from the output signal.
If the consumer is speaking at the far end of the network while music-on-hold is playing from the near end of the network, then the echo and music are mixed producing a mixed output signal. However, rather than cancelling the echo, in some cases, the non-linear processing module of the echo canceller suppresses the echo by clipping the mixed output signal and replaces fragments of the mixed output signal with comfort noise. As a result of this improper and unexpected echo canceller operation, instead of music, the consumer may hear intervals of silence and noise while the consumer is speaking into the handset. In such a case, the consumer may assume that the line is broken and terminate the call.
To prevent this scenario from occurring, music detection techniques may be used to detect when music is present, and, when music is present, the non-linear processing module of the echo canceller may be switched off. As a result, echo will remain in the mixed output signal; however, the existence of echo will typically sound more natural than the clipped mixed output signal. A discussion of the use of music detection techniques in PSTN applications may be found, for example, in Avi Perry, “Fundamentals of Voice-Quality Engineering in Wireless Networks,” Cambridge University Press, 2006, the teachings of which are incorporated herein by reference in their entirety.
A number of different music detection techniques currently exist. In general, the existing techniques analyze tones in the received signal to determine whether or not music is present. Most, if not all, of these tone-based music detection techniques may be separated into two basic categories: (i) stochastic model-based techniques and (ii) deterministic model-based techniques. A discussion of stochastic model-based techniques may be found in, for example, Compure Company, “Music and Speech Detection System Based on Hidden Markov Models and Gaussian Mixture Models,” a Public White Paper, http://www.compure.com, the teachings of which are incorporated herein by reference in their entirety. A discussion of deterministic model-based techniques may be found, for example, in U.S. Pat. No. 7,130,795, the teachings of which are incorporated herein by reference in their entirety.
Stochastic model-based techniques, which include Hidden Markov models, Gaussian mixture models, and Bayesian rules, are relatively computationally complex, and as a result, are difficult to use in real-time applications like PSTN applications. Deterministic model-based techniques, which include threshold methods, are less computationally complex than stochastic model-based techniques, but typically have higher detection error rates. Music detection techniques are needed that are (i) not as computationally complex as Stochastic model-based techniques, (ii) more accurate than deterministic model-based techniques, and (iii) capable of being used in real-time low-latency processing applications such as PSTN applications.
In one embodiment, the present invention is a processor-implemented method for processing audio signals to determine whether or not the audio signals correspond to music. According to the method, the processor characterizes whether pauses exist in a received audio signal. Further, the processor makes a pause-based determination of whether or not the received audio signal corresponds to music based on the characterization of whether pauses exist in the received audio signal.
In another embodiment, the present invention is an apparatus comprising a processor for processing audio signals to determine whether or not the audio signals correspond to music. The processor is adapted to characterize whether pauses exist in a received audio signal. Further, the processor is adapted to make a pause-based determination of whether or not the received audio signal corresponds to music based on the characterization of whether pauses exist in the received audio signal.
Other aspects, features, and advantages of the present invention will become more fully apparent from the following detailed description, the appended claims, and the accompanying drawings in which like reference numerals identify similar or identical elements.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation.”
Received signal Rin is routed to back end 108 through hybrid 106, which may be implemented as a two-wire-to-four-wire converter that separates the upper and lower channels. Back end 108, which is part of user equipment such as a telephone, may include, among other things, the speaker and microphone of the communications device. Signal Sgen generated at the back end 108 is routed through hybrid 106, where unwanted echo may be combined with signal Sgen to generate signal Sin that has diminished quality. Echo canceller 102 estimates echo in signal Sin based on received signal Rin and cancels the echo by subtracting the estimated echo from signal Sin to generate output signal Sout which is provided to the far-end.
When music-on-hold is playing at near end 100 and the far-end user is speaking, the resulting signal Sin may comprise both music and echo. As described above in the background, in some conventional public switched telephone networks, rather than cancelling the echo, the non-linear processing module of the echo canceller suppresses the echo by clipping the mixed output signal and replaces the echoed sound fragments with comfort noise. To prevent this from occurring, the non-linear processing module of echo canceller 102 is stopped when music is detected by music detection module 104. Music detection module 104, as well as echo canceller 102 and hybrid 106, may be implemented as part of the user equipment or may be implemented in the network by the operator of the public switched telephone network.
Music detection module 104 preferably receives signal Sin in digital format, represented as a time-domain sampled signal having a sampling frequency sufficient to represent telephone quality speech (i.e., a frequency≧8 kHz). Further, signal Sin is preferably received on a frame-by-frame basis with a constant frame size and a constant frame rate. Typical packet durations in PSTN are 5 ms, 10 ms, 15 ms, etc., and typical frame sizes for 8 kHz speech packets are 40 samples, 80 samples, 120 samples, etc. Music detection module 104 makes determinations as to whether music is or is not present on a frame-by-frame basis. If music is detected in a frame, then music detection module 104 outputs a value of one to echo canceller 102, instructing echo canceller 102 to not operate the non-linear processing module of echo canceller 102. If music is not detected, then music detection module 104 outputs a value of zero to echo canceller 102, instructing echo canceller 102 to operate the non-linear processing module to cancel echo. Note that, according to alternative embodiments, music detection module 104 may output a value of one when music is not detected and a value of zero when music is detected.
Pause-based music detection sub-module 204, described in further detail below in relation to
Note that a pause may last an entire frame, less than an entire frame, or multiple frames. Further, the beginning of a pause does not necessarily correspond to the beginning of a frame, and the end of a pause does not necessarily correspond to the end of a frame. Determining that the energy of a frame is less than the energy threshold indicates that the frame is a “pause frame” that either (i) contains one or more pauses or (ii) is part of a pause spanning multiple frames.
A sum of the number of pause frames is computed for a specified number Wtd of consecutive frames, where the specified number Wtd includes the current frame and the last (Wtd−1) frames. If the sum is equal to zero, indicating that there have been no pauses in the most recent Wtd frames, then it is presumed that the current frame contains music, and pause-based music detection sub-module 204 outputs a value of one for that frame. If, on the other hand, the sum is not equal to zero, then it is presumed that the current frame does not contain music (i.e., corresponds to a pause in speech or a long period of silence), and pause-based music detection sub-module 204 outputs a value of zero for that frame.
The outputs of tone-based music detection sub-module 202 and pause-based music detection sub-module 204 are applied to Boolean “OR” logic 206, which performs logical disjunction on the outputs to generate the final decision as to whether or not music is present in the current frame. When tone-based music detection sub-module 202, pause-based music detection sub-module 204, or both tone-based music detection sub-module 202 and pause-based music detection sub-module 204 output a value of one, Boolean “OR” logic 206 outputs a value of one, indicating that music is present. When both tone-based music detection sub-module 202 and pause-based music detection sub-module 204 output a value of zero, Boolean “OR” logic 206 outputs a value of zero, indicating that music is not present.
where Fn[i] refers to the ith sample of received data frame Fn, and M is the number of samples in frame Fn.
In step 308, the calculated energy En for frame Fn is compared to the sum of (i) an energy threshold value Energy_Thr and (ii) an energy threshold offset value Δ, which is initialized to zero, to determine whether the frame Fn contains only background noise or contains sound due to speech or music in addition to background noise. If the calculated energy is less than the sum, then the frame is determined to contain only background noise. Otherwise, the frame is determined to contain sound due to music or speech in addition to the background noise. Energy threshold value Energy_Thr is adaptively updated in step 306, which may be performed before, after, or in parallel with step 304. Energy threshold value Energy_Thr is updated as described in further detail below in relation to
If calculated energy En for frame Fn is less than the sum of Energy_Thr and Δ (i.e., En<(Energy_Thr+Δ)), then a pause detection parameter an corresponding to frame Fn is set equal to one (i.e., an=1), indicating that frame Fn corresponds to a pause (i.e., may be part of a pause or contain a whole pause). If, on the other hand, calculated energy En is greater than or equal to the sum of Energy_Thr and Δ (i.e., En≧(Energy_Thr+A)), then pause detection parameter an is set equal to zero (i.e., an=0), indicating that frame Fn does not correspond to a pause (i.e., is not part of a pause and does not contain a whole pause).
In step 310, a sum Hist_Num_Pauses[n] of the number of frames in the Wtd most-recent frames that may be part of a pause is calculated as shown in Equation (2) below:
where ak is the pause detection parameter, k is the frame index, and k=n for the current frame Fn. The number Wtd of frames used in Equation (2) may be determined empirically. For example, in one implementation, Wtd was determined to be 100. Note that the total delay of pause-based music detection is greater than or equal to Wtd×M/Samples_Per_Sec, where the constant Samples_Per_Sec is the number of samples per second in the received signal Sin, which corresponds to the signal sampling frequency (e.g., 8 kHz).
The sum Hist_Num_Pauses[n] is compared to zero (step 312). If Hist_Num_Pauses[n] is equal to zero, then the current frame Fn is presumed to contain music, and a value of one is output (step 314) to, for example, Boolean “OR” logic 206 of
After updating energy threshold offset value Δ, a determination is made in step 322 as to whether or not more frames Fn are available for music detection. If more frames Fn are available, then processing returns to step 302, and the next frame Fn is received. If more frames Fn are not available, then music detection is stopped.
To understand one implementation of processing that may be performed by energy threshold updating step 306, consider
Five dashed lines are shown on histogram to highlight sound levels of interest that are used in determining the energy threshold value Energy_Thr. Dashed line 402 corresponds to the minimum sound level Lmin of the conversation, which dashed line 406 corresponds to the median background noise level Lbkg_med of the conversation, dashed line 408 corresponds to the maximum background noise level Lbkg_max of the conversation, dashed line 410 corresponds to the mean background noise level Lmean, and dashed line 414 corresponds to the maximum sound level Lmax of the conversation. The relevance of these dashed lines is discussed in further detail below in relation to
Pseudocode 500 generates a number of bins j, where each bin j has a width of 1 dBm0. The bin levels j range from 1 to a parameter Min_dBm0_Level that is initialized to 90 in line 2 of pseudocode 500. Thus, the number of bins generated by pseudocode 500 is equal to Min_dBm0_Level (i.e., 90). Each bin j corresponds to a sound level −j dBm0 on the x-axis of the histogram. Further, each bin j corresponds to a bin level Level_Stat(j) that is initialized to zero in lines 3 to 5, where Level_Stat(j) represents the number of frames having a sound level −j dBm0 on the x-axis of the histogram. Thus, Level_Stat(1) is the level of bin j=1 corresponding to frames having the highest sound level (i.e., from 0 dBm0 to −1 dBm0), while Level_Stat(Min_dBm0_Level) is the level of bin j=Min_dBm0_Level corresponding to frames having the lowest sound level (i.e., from (−Min_dBm0_Level+1) dBm0 to −Min_dBm0_Level dBm0).] In line 6, a counter Level_Stat_Counter(j), which is used to prevent numerical overflows in the histogram as described below, is initialized to zero.
In lines 7 to 25 of pseudocode 500 in
dBm0(x)=max(−90,6.02×log2(x/16020.0)) (3)
In lines 9 to 11, the bin levels Level_Stat(j) of the histogram are updated. For each sample Fn[i] of the received frame Fn, an absolute sound level value Level is determined in line 10. The bin level Level_Stat(j) corresponding to the absolute sound level value Level is then increased by one in line 11. Each bin level Level_Stat(j) is a counter for a bin j that is increased each time a sample Fn[i] of the signal Sin that has the corresponding absolute sound level Level is received. For example, suppose that a sample Fn[i] has a sound level of −40 dBm0. In that case, in line 10, the absolute sound level value Level is determined to be 40. In line 11, the bin level Level_Stat (j) corresponding to an absolute sound level value Level of 40 (i.e., Level_Stat (40)) is increased by one.
After all of the samples Fn[i] of the received frame Fn have been used to update the bin levels Level_Stat(j), Level_Stat_Counter is increased by M as shown in line 13. Level_Stat_Counter tracks the sum of all bin levels Level_Stat(j) (i.e., the amount of processed statistics).
As more frames Fn are received, bin levels Level_Stat(j) become large. To prevent numerical overflows of bin levels Level_Stat(j), bin levels Level_Stat(j) are adjusted in lines 15 to 24 when Level_Stat_Counter becomes larger than Samples_Per_Second (i.e., the number of input signal samples received per second). As shown in lines 15 and 16, if Level_Stat_Counter is larger than Samples_Per_Second, then Level_Stat_Counter is reset to zero. The bin levels Level_Stat(j) are then compared to a value of 100, and the binary representation of each bin level Level_Stat(j) that is greater than 100 is shifted one bit to the right as shown in lines 17 to 20. Shifting a bin level Level_Stat(j) one bit to the right is equivalent to dividing the bin level Level_Stat(j) by a value of two. Note that, according to alternative embodiments of the present invention, all bin levels Level_Stat(j) may be divided by two. Upon considering each bin level Level_Stat(j), Level_Stat_Counter is updated to reflect the new Level_Stat(j) value as shown in line 22. Once all sound levels j have been considered, the value of Level_Stat_Counter is equal to the sum of all bin levels Level_Stat(j).
In lines 25 to 40 of pseudocode 500 in
After generating Lmax and Lmin, a mean sound level Lmean is calculated as shown in line 28, and a sum cumsum of all bin levels Level_Stat(j) corresponding to bins −Lmean to −Lmin is calculated as shown in line 29. In lines 31 to 36, the sound level Lbkg_med corresponding to the median sound level between Lmin and Lmean is determined. This is accomplished by incrementally summing the bin levels Level_Stat(j) starting from sound level Lmin until the resulting sum cumsum2 is greater than half of cumsum. The median sound level Lbkg_med corresponds to dashed line 406 in exemplary histogram 400 in
In line 37, sound level Lbkg_max, which approximates the boundary between (i) energy levels that correspond to background noise only and (ii) energy levels that correspond to background noise in addition to other sounds, such as music and speech, is determined Sound level Lbkg_max corresponds to dashed line 408 in exemplary histogram 400 in
Pause-based music detection sub-modules of the present invention are relatively low in complexity compared to tone-based music detection sub-modules. When implemented together with a tone-based music detection sub-module as shown in
According to alternative embodiments of the present invention, Boolean Logic other than Boolean “OR” logic may be used with tone-based and pause-based music detection sub-modules. For example, if the tone-based music detection sub-module is prone to false positive music detection (i.e., determining that frames without music do have music), then Boolean “OR” logic may be replaced with Boolean “AND” logic. Boolean “AND” logic requires the outputs of both music detection sub-modules to be one before module 200 outputs a one.
According to further embodiments of the present invention, pause-based music detection sub-module 204, tone-based music detection sub-module 202, both the pause-based and tone-based music detection sub-modules, or music detection module 200 may require their processing to indicate the presence or absence of music for a specified number of consecutive frames, or for a specified percentage of frames during the previous specified number of frames (e.g., 80% of the last ten frames) before they output a one.
According to yet further embodiments of the present invention, music detection module 104 of
Although the present invention was described relative to its use with public switched telephone networks, the present invention is not so limited. The present invention may be used in suitable applications other than public switched telephone networks.
Energy threshold updating step 306 and sound detection step 308 of
According to alternative embodiments of the present invention, the energy threshold value Energy_Thr updating in step 306 may be omitted to decrease computational complexity of flow diagram 300 or for other reasons. In such embodiments, energy threshold value Energy_Thr may be fixed to a predefined value that sufficiently estimates the noise level for most real-world scenarios.
The complexity of the processing performed in flow diagram 300 of
Hist_Num_Pauses[k]=Hist_Num_Pauses[n−1]+a[n]−a[n−Wtd] (4)
The energy threshold updating step 306, as implemented in pseudocode 500 of
In embodiments of the present invention that use a fixed energy threshold Energy_Thr value, the complexity is approximately 2M+3 summations per frame Fn. For a typical frame size of M=40 (5 ms frame for 8 kHz signal), the complexity is approximately 16,600 summations per second. To implement the logarithmic operations in pseudocode 500, a look-up table may be used. In one implementation of a look-up table method described in scheme 2 of M. Zhang, et al., “Table-Driven Newton Scheme for High Precision Logarithmic Generation,” IEEE Proc.-Comput. Digital Tech., Vol. 141, #5, September 1994, the teachings of which are incorporated herein by reference in their entirety, logarithmic operations are performed using 7 multiplications and 2 summations. For a typical frame size of M=40, the complexity of pseudocode 500 is approximately 16,600 summations plus an additional 104,000 arithmetic operations per second. Thus, in embodiments of the present invention that update energy threshold Energy_Thr as shown in pseudocode 500 of
The present invention may be implemented as circuit-based processes, including possible implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, or computer.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of program code embodied in tangible media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other non-transitory machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of program code, for example, stored in a non-transitory machine-readable storage medium including being loaded into and/or executed by a machine, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
The present invention can also be embodied in the form of a bitstream or other sequence of signal values stored in a non-transitory recording medium generated using a method and/or an apparatus of the present invention.
Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value of the value or range.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the scope of the invention as expressed in the following claims.
The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. Likewise, additional steps may be included in such methods, and certain steps may be omitted or combined, in methods consistent with various embodiments of the present invention.
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
The embodiments covered by the claims in this application are limited to embodiments that (1) are enabled by this specification and (2) correspond to statutory subject matter. Non-enabled embodiments and embodiments that correspond to non-statutory subject matter are explicitly disclaimed even if they fall within the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
2010152224 | Dec 2010 | RU | national |