This disclosure pertains generally to computerized telephony and audio enhancement technology, and more specifically to automatic chroma detection among music, speech, and noise in communication systems.
Music is becoming more and more popular in telephony applications, such as music on hold, tele-conferencing, and video communications using smart phones, etc., particularly, as sampling rates increase. For instance, with increasing bandwidth and sampling rate in telephony applications, from the original narrow-band 8000 Hz, to wide-band 16000 Hz, and even to full-band 48000 Hz, high fidelity music is practicable. As a result, there is a trend to use more music in telephony applications.
Audio enhancement may be performed in telephony applications to improve voice quality by removing impairments such as noise and echo from an audio signal; however audio enhancement to voice or other sounds may negatively affect music. Accordingly, previous technologies fail to address the constraints presented by encountering music of varying genres among speech, noise, or tones, which may share the same bandwidth of frequencies with the music.
Audio data describing an audio signal may be received and a set of frames of the audio signal may be determined using the audio data. The set of frames of the audio signal may be determined by performing a Fast Fourier Transform using a windowing function.
One or more potential music events may be identified based on a spectral analysis of the set of frames. Identifying the one or more potential music events based on the spectral analysis may include determining one or more chroma values for frequencies in the audio signal, estimating an energy for each of the one or more chroma values, identifying a chroma value of the one or more chroma values with a maximum energy in each of a plurality of octaves based on the estimated energies for the one or more chroma values, and determining a quantity of the plurality of octaves that includes a matching chroma value with the maximum energy. Identifying the one or more potential music events may include determining a chroma match counter value based on the quantity of the plurality of octaves that includes the matching chroma value with the maximum energy in the set of frames, and determining a potential music event based on the chroma match counter value.
One or more music states of the audio signal may be determined based on the one or more potential music events. In some instances, declaring that the audio signal includes music may be based on a transition of the one or more music states to a final state in a finite state machine. The transition of the one or more music states to the final state in the finite state machine may be based on a tone detection counter value accumulated over a subset of the set of frames satisfying a threshold, and the tone detection counter value may identify a tone event based on the spectral analysis. In some instances, the one or more music states of the audio signal may be determined based on a quantity of the one or more potential music events occurring within the set of frames. In some instances, a tone detection counter value may be set based on a quantity of chroma value changes over a defined time period, and music in the audio signal may be declared based on the one or more music states and the tone detection counter value.
Audio enhancement of the audio signal may be modified based on the one or more music states. Modifying the audio enhancement of the audio signal may comprise ceasing noise cancelation of the audio signal.
The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
The Figures depict various example implementations for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative examples of the structures and methods illustrated herein may be employed without departing from the principles described herein.
The technology described herein monitors the content and/or sound characteristics of audio signals, automatically detects music, and, in some instances, may adjust audio enhancement based on the detection of music.
For instance, the disclosure describes a system and method for chroma detection in a communication system. Smart voice enhancement may improve voice quality by removing impairments such as noise and echo in telephony applications. In some implementations, the technology may detect music in real-time and bypass performing certain audio enhancement (e.g., reducing noise and echo) on it in order to deliver music to end users, because, for example, noise cancellation may distort music. It should be noted that although the term smart “voice” enhancement is used herein, the technology may be used to process and/or enhance any type of audio.
The technology described herein detects music in real-time as soon as possible among music, speech, and noise whenever music packets show up in telephony applications. For instance, to avoid an unpleasant experience for an end user, music detection time should be as short (e.g., half a second to two seconds) as possible for telephony applications, and detection accuracy should be very high. However, music detection in real-time by a computing device (e.g., on a client or server side) is difficult, in part, because music, speech, noise, and noisy speech share a common frequency bandwidth. Additionally, there are many different kinds of music and assumptions that a particular kind of music will be encountered may lead to decreased performance for other music types in audio streams. For example, music genres span an enormous range of forms and styles, from popular, rock, and jazz music, to symphonies with a full orchestra. Further, musical instruments may include, among others, percussion (e.g., piano, drum, bell, etc.,), string (violin, viola, cello, guitar, etc.), woodwind (flute, clarinet, etc.), or brass (trombone, tuba, trumpet, etc.).
While previous technologies focused on heuristics for detecting specific songs, specific instruments, or specific genres of music, the technology described herein works across a variety of types of music, for example, by looking at underlying notes themselves. For example, the technology may perform music detection in real-time solely or partially based on processing incoming audio, which allows it to, for example, remove noise during speech without degrading music quality.
With reference to the figures, reference numbers may be used to refer to components found in any of the figures, regardless whether those reference numbers are shown in the figure being described. Further, where a reference number includes a letter referring to one of multiple similar components (e.g., component 000a, 000b, and 000n), the reference number may be used without the letter to refer to one or all of the similar components.
A smart voice enhancement engine 101 is illustrated as residing on a server 115. It is to be understood that, in different implementations, the smart voice enhancement engine 101 can reside on different servers 115 or client devices 103, or be distributed between multiple computing systems in different ways, without departing from the scope of this disclosure.
Many different networking technologies can be used to provide connectivity from endpoint computer systems 103 to servers 115. Some examples include: LAN, WAN, and various wireless technologies. Endpoint systems 103 are able to access applications and/or data on server 115 using, for example, a web browser or other endpoint software (not shown). Endpoint client devices 103 can be in the form of, for example, desktop computers, laptop computers, smartphones, analog phones, or other communication devices capable of sending and/or receiving audio. Servers 115 can be in the form of, for example, rack mounted or tower computers or virtual servers implemented as software on a computing device, depending on the implementation.
Although
As illustrated in
The telephony engine 105 may include a switch 107 and, in some implementations, a smart voice enhancement engine 101. In some implementations, the switch 107 may include an application server that enables real-time communication of audio and/or video using telecommunications and/or Voice over Internet Protocol (VoIP), for example. The switch 107 may run one or more media bugs 109a and 109b, an audio mixer 111, and, in some instances, a smart voice enhancement engine 101 or components thereof.
In some implementations, a media bug 109 may include a dynamic library that provides an interface between one or more of the client devices 103, the smart voice enhancement engine 101, the audio mixer 111, the switch 107, and one or more other components of the telephony engine 105, such as a management interface (not shown). The audio mixer 111 may adjust volume levels, tones, or other elements of an audio signal, or perform other operations, depending on the implementation. The management interface may provide configuration and parameter setup for the modules smart voice enhancement engine 101, such as are shown in
In some implementations, the smart voice enhancement engine 101 may include a library implemented on top of the switch 107 platform, but independent of the switch 107 as a stand-alone library. The smart voice enhancement engine 101 may operate on the server 115, although it is possible for it to operate on one or more of the client devices 103 without departing from the scope of this disclosure. The smart voice enhancement engine 101 may improve voice quality in a communication system by removing impairments such as noise and echo in telephony applications. For instance, as described in further detail in reference to
One or more of the components of the telephony engine 105 (e.g., the switch 107, media bug 109, audio mixer 111, or smart voice enhancement engine 101) may include software including logic executable by a processor to perform their respective acts, although the component may be implemented in hardware (e.g., one or more application specific integrated circuits (ASICs) coupled to a bus for cooperation and communication with the other components of the telephony engine 105 and/or network architecture 100; sets of instructions stored in one or more discrete memory devices (e.g., a PROM, FPROM, ROM) that are coupled to a bus for cooperation and communication with the other components of the system; a combination thereof; etc.).
Other components (not illustrated) may be connected in a similar manner (e.g., document scanners, digital cameras, printers, etc.). Conversely, all of the components illustrated in
The bus 212 allows data communication between the processor 214 and system memory 217, which, as noted above may include ROM and/or flash memory as well as RAM. The RAM is typically the main memory into which the operating system and application programs are loaded. The ROM and/or flash memory can contain, among other code, the Basic Input-Output system (BIOS) which controls certain basic hardware operations. Application programs can be stored on a local computer readable medium (e.g., hard disk 244, solid state drive, flash memory) and loaded into system memory 217 and executed by the processor 214. Application programs can also be loaded into system memory 217 from a remote location (i.e., a remotely located computer system 210), for example via the network interface 248. In
The storage interface 234 is coupled to one or more hard disks 244 (and/or other standard storage media). The hard disk(s) 244 may be a part of computer system 210, or may be physically separate and accessed through other interface systems.
The network interface 248 can be directly or indirectly communicatively coupled to a network such as the Internet, a PSTN, etc. Such coupling can be wired or wireless.
As used herein, the terms “computer system,” “computer,” “endpoint,” “endpoint computer,” “server,” “server computer” and “computing device” mean one or more computers configured and/or programmed to execute the described functionality. Additionally, program code to implement the functionalities of the smart voice enhancement engine 101 can be stored on computer-readable storage media. Any form of tangible computer readable storage medium can be used in this context, such as magnetic, optical or solid state storage media. As used herein, the term “computer readable storage medium” does not mean an electrical signal separate from an underlying physical medium.
The smart voice enhancement engine 101 may use speech signal processing algorithms to enhance voice quality for VoIP, wireless, and PSTN telephony applications. As shown in the example illustrated in
Depending on the implementation, the FFT module 301 may convert an original time domain signal {x(n)} to frequency domain. A voice activity detection algorithm may operate in the frequency domain, which employs the fact that the frequency spectral for noise tends to be flat. Similar to voice activity detection algorithm, the smart music detection module 305 may operate in the frequency domain. The other modules (e.g., 307, 309, 311, or 313) may use the output of the smart music detection module to identify music, speech, or noise.
The SNC module 307 may remove ambient noise in frequency domain, so that the listener feels much more comfortable when listening to the speech with the noise removed. The IFFT module 309 may convert the frequency domain signal back to time domain by using the Inverse Fast Fourier Transform. The AEC 311 and SLC 313 may operate in the time domain to cancel acoustic eco and control audio volume levels, respectively. The output audio signal after smart voice enhancement processing is illustrated as {(n)}.
The AQE module 303 may use objective voice quality measurement algorithms to monitor smart voice enhancement for the audio signals before and after smart voice enhancement. In some implementations, the AQE module 303 may use ITU (International Telecommunications Union) standards for quality assessment, such as a G.107 E-model and/or a Perceptual Evaluation of Speech Quality (PESQ) test(s) to monitor quality of the audio signal. For example, the AQE module 303 may compare speech output in the outgoing audio signal with original clean audio in the incoming audio signal in order to get a mean opinion score (MOS). In some implementations, the G.107 E-model in the AQE module 303 may provide real-time and non-intrusive voice quality measurement, for example, in terms of the MOS value for each call. The MOS may represent a score of ratings gathered in a quality evaluation test, which may be manually or algorithmically performed.
The smart music detection module 305 may perform some or all of the operations described in reference to
For example, the smart music detection module 305 may increase a chroma consecutive match counter by one count if two or more octaves (e.g., among octaves 4-9) have the same chroma value with a maximum energy (also referred to as a peak chroma value), but may reset the counter to zero if one or fewer octaves have the same peak chroma value. Note that a chroma value may represent a note, frequency, or frequency range in a particular octave, as described in further detail below.
In some implementations, if chroma shows up in plural P consecutive frames consistently (e.g., a chroma shows up in a given percentage of a consecutive number of frames, such as 8 out of 10 consecutive frames), then a music event may be declared. Since the peak note in each octave for speech and noise normally shows a random pattern, the false detection probability of such a music event rather than speech or noise may be as small as one ten-millionth of a percent, depending on the circumstances. In some implementations, the smart music detection module 305 may also detect one or more noise or tone events during music detection based on spectral analysis of frames of the audio signal in order to rule out a false positive.
The smart music detection module 305 may include a finite state machine to further increase the music detection accuracy in the context of music, speech, and noise. One or more potential music events may be combined to form a music state of the finite state machine. In some implementations, detection of noise or a tone may reset a music state of the finite state machine. With increasing music events and satisfaction of other conditions, the finite state machine may move from state to state until a final state is reached, based upon which, the smart music detection module 305 may declare that music is present in an audio signal.
It should be noted that the smart music detection module 305 may include sub-components, algorithms, or routines, for example, which may perform one or more of the operations described in reference to the smart music detection module 305.
At 404, the smart voice enhancement engine 101 may determine a set of frames of the audio signal using the audio data. For instance, the smart voice enhancement engine 101 (e.g., the FTT module 301) may perform Fast Fourier Transform framing with a windowing function.
For example, the discrete Fourier transform (DFT) of the time-domain signal {x(n)} is given as follows:
where m is the frame number, k is the frequency bin, H is the frame hop size, N is the fast Fourier transform (FFT) size, and w(n) is the window function, n∈[0,N−1]. Example window functions that may be used may include rectangular, Bartlett, Hanning, Hamming, Blackman, and Kaiser windows, etc.
Similarly, it should be noted that, for use by the IFFT module 309 (or another component of the smart voice enhancement engine 101), the inverse DFT is given by
for the m-th frame.
One music symbolic representation is the Musical Instrument Digital Interface (MIDI) standard. Using MIDI note numbers, the equal-tempered scale gives the center frequency (Hz):
Fpitch(p)=440*2(p-69)/12, 0≤p≤127, (3)
for each pitch P∈[0,127]. For example, for the reference pitch number p=69 corresponding to note A4, the frequency Fpitch(p)=440 Hz. For other notes from C1-B8, the corresponding frequencies can be found in the table illustrated in
BW(p)=Fpitch(p+0.5)−Fpitch(p−0.5), 0≤p≤127. (4)
From the relationship (4), the bandwidth BW(p) may be monotonically increasing with respect to the pitch p.
For each octave, there are twelve different notes. For example, each note may have a chroma value, ranging from [0, 11], where note C has chroma value 0 and note B has chroma value 11 respectively. In some instances, the note center frequency follows an exponential formula as in relationship (3), so the note with same chroma value in octave i+1 has double frequency as that in octave i, for 0≤i≤9.
In the DFT formula (1), frequency bin k corresponds to the physical frequency
in Hz, where F is the sampling frequency in Hz, and N is the FFT size. It should be noted that, as illustrated in relationship (5), the frequencies corresponding to FFT bins may be linearly distributed, whereas the frequencies corresponding to pitches may follow logarithmic perception from (3). For given pitch p, within its bandwidth BW(p), there may be multiple FFT bins, or single, or none at all. For pitch p, the smart music detection module 305 may define the FFT bin set as
BIN(p)={k:Fpitch(p−0.5)≤Fcoef(k)<Fpitch(p+0.5)}, 0≤p≤127. (6)
For m-th frame, the pitch p has a log-frequency (LF) spectrogram corresponding to:
For chroma c∈[0,1 I], the smart music detection module 305 may define the chromagram as follows:
In a public switched telephone network (PSTN), the sampling rate may be fixed at Fs=8000 Hz, resulting in maximum speech bandwidth 4000 Hz, based on sampling theorem, which corresponds to the narrow-band case. This sampling rate may also be used in voice-over-internet (VOIP) and wireless cellular networks, for example, when the following speech codecs are used: G. 711 (a-law and μ-law), G.729, G.723, G.726, AMR, GSM, GSM-HR, GSM-FR, etc. In some instances, a wide-band with sampling rate Fs=16000 Hz and an efficient signal bandwidth of 8000 Hz may be used. A wide band coder may include AMR-WB and G.722. Similarly, a full-band sampling rate Fs=48000 with efficient signal bandwidth up to 24000 Hz, including Opus codec, may be used.
In the narrow band case, N=256 points and the FFT has minimum granularity 8000/256=31.25 Hz based on (5) for the N bins, which may also be true for the wide band case with N=512. In the full band case, N=1024 points and the FFT has minimum granularity 48000/1024=46.875 Hz.
Although it should be noted that other implementations are possible, for clarity of description, this disclosure is described using the narrow band case, although wide band or full bands may also be used. Based on the relationships (3)-(6), for the FFT size N=256, the frequency bins corresponding to each octave may be distributed as illustrated in
The last three columns 907, 909, and 911 in the table in
At 406, the smart music detection module 305 may identify one or more potential music events based on a spectral analysis of the set of frames. For instance, the smart music detection module 305 may perform spectral analysis per frame in the incoming audio signal and, based on the analysis of a set of frames, may construct one or more music events. For example, the smart music detection module 305 may determine a music event based on consecutive P frames where chroma shows up consistently (e.g., a threshold quantity in a given set) and may update chroma detection statistics in a storage device.
Performing spectral analysis for the incoming audio signal may include calculating a signal/spectral energy in a frequency domain per note in each octave (e.g., of octaves 4-9) based on frequencies for an equal-tempered scale. The smart music detection module 305 may also calculate energy per octave for octaves 0-9.
In some implementations, the smart music detection module 305 may find a peak note with maximum energy in each octave in the linear domain, as well as that with maximum averaged energy in decibel (dB) domain. If the smart music detection module 305 determines that, within a small dB range, there are too many chroma values (e.g., four to ten values) achieving the same maximum energy value, then the smart music detection module 305 may determine that the frame is a noise frame and no music is present and, depending on the implementation, may reset the state of the finite state machine to the initial state S0. These and other operations are described in further detail at least in reference to
At 408, the smart music detection module 305 may determine whether the one or more potential music events include a noise or tone event based on the spectral analysis. For example, a fixed-spectral pattern, such as a tone, noise, tone-like noise, sirens, etc., may be differentiated from a music event by implementing a tone detection algorithm.
In some implementations, the smart music detection module 305 may compare power spectral density per critical band with a previous frame and, within a small dB range, if the power spectral density does not change too often (e.g., a quantity of changes falls below a defined threshold, such as eight times in consecutive ten frames), then the smart music detection module 305 may determine that no music is present. Similarly, the smart music detection module 305 may sum power spectral density differences over the critical bands, and may determine, based on frequent (e.g., beyond a defined threshold, such as five times in a consecutive ten frames) peak note changes, that fixed-pattern noise is present. These and other operations are described in further detail at least in reference to
At 410, the smart music detection module 305 may determine one or more music states of the audio signal based on the one or more potential music events. For example, a finite state machine may be implemented for chroma detection to increase the music detection accuracy in the context of music, speech, and noise. The finite state machine may require multiple instances of music event detection (e.g., five to twenty times), within specified time duration, in order to declare the final music detection.
For instance, the finite state machine may include plural R music states. The finite state machine may transition between states based on the quantity of music events detected and, in some implementations, based on other conditions, as described in further detail below. Additionally, the smart music detection module 305 may reset or reduce the state of the finite state machine based on other conditions, such as an insufficient consistency or frequency of peak chroma values or detection of tone or noise events. For instance, the smart music detection module 305 may reset the finite state machine state to the initial state S if Q music events are not found within specified plural L frames in any state or may move the finite state machine to the next state otherwise.
In some instances, the smart music detection module 305 may reduce or reset the finite state machine to the original state Sif speech or noise is identified. In some implementations, the smart music detection module 305 may accumulate a chroma match counter, total note change counter, etc., across frames in the finite state machine. For example, in some implementations the total note changes in the finite state machine may not exceed a boundary threshold in order to declare the final music detection, so that noise is excluded from a potential music event. Similarly, tone or tone-like events are differentiated from a music event. The smart music detection module 305 may also accumulate a tone detection counter from the potential music events in the finite state machine. If the total tone detection counter exceeds a boundary threshold, the smart music detection module 305 may declare a tone event and, in some instances, reset the state of the finite state machine based on the tone event. The finite state machine and transitions between the states of the finite state machine based on music events, tones, and noise are described in further detail below in reference to
At 412, the smart music detection module 305 may declare that the audio signal includes music based on the one or more music states and whether the music events include a noise or tone event. For example, the smart music detection module 305 may declare music in the audio signal based on a transition to a final state of the finite state machine, such as is described in further detail in reference to
At 414, the smart voice enhancement engine 101 may modify audio enhancement of the audio signal based on the music declaration and/or music states. For example, if music is detected, the smart music detection module 305 may transmit a signal indicating the music to the SNC module 307, AEC module 311, or SLC module 313, which may cease or modify audio enhancement for a duration of the detected music. For example, smart voice enhancement engine 101 may cease noise cancelation of the audio signal during the frames that include detected music.
At 504, the smart music detection module 305 may estimate the energy for each chroma value in one or more frames in the audio signal.
For example, the smart music detection module 305 may calculate LF spectrogram (7) per chroma value in each octave (e.g., the set of octaves 4-7, as described above), based on the table described in
In some instances, the smart music detection module 305 may determine the signal energy estimate for each chroma value i in the m-th frame:
where α is a smoothing factor, 0≤α<1, BH(i) and BL(i) are the highest and lowest FFT bins corresponding to chroma value i, respectively. For example, BH(3)=81 and BL(3)=78 for octave 7; BH(3)=40 and BL(3)=39 for octave 6. In some implementations, the smart music detection module 305 may select α from examples: α=0.55, α=0.75, or α=0.9.
In some implementations, the smart music detection module 305 may evaluate an averaged chroma energy per FFT bin in dB domain, which may be defined by the relationship
where E(m,i) is given by the relationship (9).
In some implementations, the smart music detection module 305 may repeat the computations at (9) and (10) per chroma value in each octave (e.g., in the set of octaves 4-7), based on the table illustrated in
It should be noted that additional or alternative operations for spectral analysis, such as determining the maximum averaged energy in a dB domain, are described in reference to
At 506, the smart music detection module 305 may identify chroma value(s) with maximum energy (also referred to herein as a peak chroma values) in one or more octaves based on the estimate. For instance, the smart music detection module 305 may find the peak note or chroma value with maximum energy in each octave in a linear domain. For example, the spectrogram per chroma value in each octave of octaves 5-7 may be given by the relationship (9), using which the smart music detection module 305 may determine the chroma value with a maximum energy in each octave. In some implementations, identifying the peak chroma value may include sorting the energies for the chroma values (e.g., determined at 504) in each octave and then selecting the chroma value with the highest energy, although other implementations are possible.
At 508, the smart music detection module 305 may set a chroma match score for current frame based on number of octaves with chroma value(s) with the same maximum energy. For instance, the smart music detection module 305 may count octaves that have matching peak chroma values with maximum energy.
For example, among a defined set of octaves (e.g., octaves 5-7), if two octaves or three octaves have the same note with peak energy, the smart music detection module 305 may assign a chroma match score for a current frame. In some instances, as shown in relationship (19) below, the smart music detection module 305 may assign a double match score if three octaves have the same peak chroma value. If no chroma is found, then the smart music detection module 305 may set the chroma match score to zero. As an example, the chroma match score may be defined as follows:
Match scores of four and eight may be chosen in (19) to represent the cases two or three octaves have the same peak chroma value; however, it should be noted that other real numbers may be used without departing from the scope of this disclosure.
In some implementations, the smart music detection module 305 may perform one or more of the operations described in reference to
At 510, the smart music detection module 305 may set a chroma match counter value based on number of octaves with chroma value(s) with the same maximum energy. For instance, the chroma match counter value may be determined based on the chroma match score, which may be based on a quantity of the plurality of octaves that include a matching chroma value with maximum energy in the set of frames. For example, if the chroma match score is positive for current frame, the smart music detection module 305 may increase a chroma consecutive match counter by one. In some implementations, if the chroma match score for the frame is zero, the smart music detection module 305 may reset the chroma match counter to zero, although it should be noted that, in other implementations, the smart music detection module 305 may forgo increasing or may decrease the chroma match counter.
In some implementations, at 512, the smart music detection module 305 may identify a noise or tone event based on spectral analysis for one or more of the set of frames. For example, in some instances, before declaring a music event being present, the smart music detection module 305 may exclude noise and tone-like signals from a potential music event by defining pre-requisite conditions for determination of a music event. For example, noise and tone spectrums tend to be relatively flat, so the smart music detection module 305 may declare a noise or tone event, for example based on multiple maxima in an octave (e.g., based on relationship (9)). An example method for identifying a noise or tone event is described in further detail in reference to
At 514, the smart music detection module 305 may determine a potential music event, for example, based on a chroma match counter satisfying threshold. For instance, if the smart music detection module 305 determines that a chroma is identified consistently (e.g., at a threshold percentage) in a set of consecutive frames (e.g., ten frames), then a music event is declared. For example, a peak chroma value may be identified as sufficiently consistent if it shows up in a threshold quantity of frames.
At 516, the smart music detection module 305 may determine music state of finite state machine based on potential music event(s). As discussed above, a finite state machine may include multiple music states to increase the music detection accuracy in the context of music, speech, and noise. For example, a single music event, or multiple music events, form a state, such that determining one or more music states of the audio signal may be based on a quantity of the one or more potential music events that occur within a set of frames.
In some implementations, the smart music detection module 305 may consider two music events to form a music state, although other implementations are possible and contemplated herein. For example, in implementations where the finite state machine includes a total of eight states: S0-S7, the final state S7 may be a music detected state. In some instances, each state in the state S0-S6 may have maximum life length L, for example L=200, 300, or 400 frames. After L frames in a state Si, if the smart music detection module 305 does not detect two music events (or another defined quantity), then it may reset the finite state machine to the initial state S0. However, in some instances, if the smart music detection module 305 detects two music events within L frames, it may move the finite state machine to the next state Si+1, 0≤i≤6.
In some implementations, as described in reference to
At 518, the smart music detection module 305 may determine whether the music state of the finite state machine is at a final state transition. In some implementations, a transition between various of the states may include additional or different conditions for the transition. For example, if the finite state machine is at a state S6 (e.g., in the 8 state implementation described above), a transition to a final state may require additional conditions to be met. For instance, if two music events are found when at a state S6, the smart music detection module 305 may determine whether conditions are satisfied, for example, as described in reference to the operation at 522.
In response to determining, at 518, that the music state of the finite state machine is at a final state transition, the smart music detection module 305 may verify, at 522 that any additional criteria for transitioning to the final state are satisfied. For example, the smart music detection module 305 may verify whether the audio signal includes music based on tone detection counter and chroma match score.
As described above, the smart music detection module 305 may accumulate a music note change counter num_note_change: for each music event and finite state machine state. At state S6, if two music events are found, before the smart music detection module 305 declares a final music detection, the smart music detection module 305 may verify whether the following condition is satisfied
num_note_changes<Δ3, (29)
where Δ3 is a constant (e.g., 20, 30, or 40). In some implementations, the smart music detection module 305 may reset the state to an initial state S in the case of a tone event and a note change counter satisfying or exceeding a threshold. For example, the smart music detection module 305 may determine that there are too many note changes in a short time based on condition (29) not being satisfied, which may indicate that music is not present. In some implementations, based on this determination, the smart music detection module 305 may reset the finite state machine state to an initial state S0.
Additionally, in some implementations, the smart music detection module 305 may accumulate the chroma match score (e.g., defined in (19)) during music events across the finite state machine states. The total match score may be tracked using a variable chroma_match_score. For instance, the smart music detection module 305 may accumulate the chroma match score over plural P consecutive frames and over the states in the finite state machine. Similarly, the smart music detection module 305 may accumulate a tone detection counter num_tone_detect (e.g., described in reference to 732 below) for each music event in the states of the finite state machine.
In some implementations, at state S6, if two music events are found, the smart music detection module 305 may, before declaring a final music detection, verify whether the following conditions are satisfied
num_tone_detect≥Δ4, (30)
chroma_match_score<Δ5, (31)
where Δ4 and Δ5 are some constants (e.g., Δ4=15, 25, or 35, and Δ5=560, 660, or 760). In some implementations, if both (30) and (31) are satisfied simultaneously, then the smart music detection module 305 may determine that a tone event is present and, in some instances, may reset the state to S0.
In some implementations, if one of (30) and (31) are satisfied, the smart music detection module 305 may advance the state to a final state at 524 and 526. For example, at 524, the smart music detection module 305 may determine whether audio signal has verified music and in response to a positive determination, may declare that music is detected at 526. In response to a negative determination at 524, the smart music detection module 305 may declare that the audio signal and/or the set of analyzed frames, do not include music. In some implementations, whether the smart music detection module 305 declares music as present at 526 or not present at 528, the method described in
In response to determining, at 518, that the music state of the finite state machine is not a final state, the smart music detection module 305 may determine, at 520, whether the audio data includes another frame to analyze. In response to a positive determination at 520, it may return to the operation 502 for the next frame in the set of frames to be analyzed.
In response to determining, at 520, that the audio signal and/or set of frames of the audio signal does not include additional frames to analyze, the smart music detection module 305 may proceed to 528, where a non-music state may be declared by the smart music detection module 305.
The description herein indicates that a music event may consist of P consecutive frames, Q music events to form a state, and total R states in the finite state machine. The description uses P=10, Q=2, and R=7, but it should be noted that there are many combinations of (P,Q,R) that may be used without departing from the scope of this disclosure and that these values are provided by way of example.
In some implementations, at 602, the smart music detection module 305 may estimate the energy for critical bands in the audio signal, for example, in a dB domain.
In some implementations, in order to discriminate a music event from speech or noise, the smart music detection module 305 may perform spectral analysis based on critical bands. In the voice spectrum, critical bands may be defined using the Bark scale: 100 Hz, 200 Hz, 300 Hz, 400 Hz, 510 Hz, 630 Hz, 770 Hz, 920 Hz, 1080 Hz, 1270 Hz, 1480 Hz, 1720 Hz, 2000 Hz, 2320 Hz, 2700 Hz, 3150 Hz, 3700 Hz, 4400 Hz, 5300 Hz, 6400 Hz, 7700 Hz, 9500 Hz, 12000 Hz, and 15500 Hz. In the case of narrow band, wide band, and full band, there may be eighteen, twenty-two, twenty-five critical bands, respectively.
The smart music detection module 305 may estimate the signal energy for the i-th critical band using
where 0≤i<Nc, a is a smoothing factor, 0≤α<1, Nc is the number of total critical bands, and CBH(i) and CBL(i) are the highest and lowest FFT bins for the i-th critical band, respectively. Nc=18, 22, and 25 for the narrow, wide, and full bands, respectively. In some instances, the dB value of the signal spectral energy for the i-th critical band is defined by
EdBcb(m,i)=10 log10Ecb(m,i), 0≤i<Nc. (12)
The total signal energy in dB based on all critical bands may be given by
for the m-th frame.
At 604, the smart music detection module 305 may identify chroma values(s) with maximum averaged energy, and may determine a noise event for frame(s) based on threshold quantity of chroma values with maximum energy being within defined range of maximum averaged energy, at 606.
In some implementations, the smart music detection module 305 may find the peak note with maximum averaged energy in the dB domain. For example, the smart music detection module 305 may use the formula (10) to determine the chroma value with maximum averaged energy in the dB domain. In some instances, the peak note with maximum averaged energy in the dB domain may coincide with the peak note with maximum energy in the linear domain (e.g., as described in reference to
In some instances, the incoming audio may satisfy a minimum total energy requirement
EdBtotal(m)≥Δ1, (14)
where Δ1 is a small constant, e.g., −55 dB, −60 dB, or −65 dB. Within a small dB range (e.g., 1/20 dB, 1/10 dB, or ⅕ dB), the smart music detection module 305 may identify chroma values closing to the maximum averaged energy (e.g., within a defined range) in dB from (10) in each octave. If the total number of identified chroma values within a defined range (e.g., based on (14) above) is bigger than a threshold (e.g., five to ten), then the smart music detection module 305 may determine that the frame is a noise frame, no chroma is present, and may reset the state of the finite state machine to the initial state S0. If chroma is present, the smart music detection module 305 may continue the chroma analysis for the frame.
In some implementations, when evaluating the chroma values with maximum averaged energy per frame, the smart music detection module 305 may calculate the dB values for chroma values in octave 5-7 (e.g., 36 times of logo function calls). In some instances, to save CPU usage, the smart music detection module 305 may create an equivalent linear domain evaluation. In a linear domain, the following inequality of the maximum averaged energy Emax and the note energy Enote:
Emax−Enote≤γ0Emax, (15)
is equivalent to the following equality in dB domain
where γ0 is a constant. From (15) and (16), it follows that
γ0=1−10Δ
Thus, by choosing γ0 as in (17), the dB evaluation in (16) may be replaced by an equivalent linear domain evaluation (15), where Δ0 is a small dB number (e.g., 1/20 dB, 1/10 dB, or ⅕ dB).
Similar to (14), the maximum averaged energy in dB domain in each octave may be bigger than a constant
where Δ2 is a small constant, for example, −55 dB, −60 dB, or −65 dB. In case that at least one octave among octaves 5-7 does not satisfy (18), then this frame may not satisfy the music event condition.
At 608, the smart music detection module 305 may determine a state of finite state machine based on noise event. For example, as described above, if the smart music detection module 305 detects a noise event, it may reset the state of the finite state machine to an initial state, depending on the implementation.
The chroma match score (19) may be based on the peak chroma value in octaves 5-7 in a current frame, for example. The smart music detection module 305 may also track peak chroma value changes across consecutive frames, because music notes tend to last for a while (e.g., 100 ms-2 seconds), depending on factors, such as tempo and the sheet music. For example, if the FFT frame time is 10 ms then ten frames last 100 ms. Frequent peak note change in consecutive ten frames may indicate that no music event is present, as described in further detail below.
In some implementations, at 702, the smart music detection module 305 may store peak chroma values in each octave and frame in arrays. For example, the smart music detection module 305 may quantify peak chroma values in each octave (e.g., in the set of octaves 5-7) for one or more frames including saving the peak chroma values in arrays peak_note[ ] and peak_pre_note[ ] for the current and previous frames, respectively.
At 704, the smart music detection module 305 may determine peak chroma value changes over frames. For instance, the smart music detection module 305 may use
which represents the peak chroma value changes in the previous frames in octaves (e.g., past two frames in octaves 5-7).
At 706, the smart music detection module 305 may determine whether chroma change criteria are satisfied. If the criteria are satisfied, the smart music detection module 305 may proceed to the operation at 714, depending on the implementation. In some implementations, if the criteria are not satisfied, the smart music detection module 305 may proceed to the operation at 708.
In some implementations, the chroma change criteria may include that, for music, i) D0 should be less or equal to a small number (e.g., 3), and that, ii) at least two peak notes in octaves 5-7 remain the same (or a different quantity in a different set of octaves).
At 708, the smart music detection module 305 may determine a value of the music note change counter based on criteria not being satisfied. For instance, the smart music detection module 305 may increase a music note change counter num_note_change: by one if both of the criteria i) and ii) are not satisfied. For example, the smart music detection module 305 may increase the note change counter if, in a set of two consecutive frames, no two peak notes remain the same. In some implementations, the smart music detection module 305 may increase the note change counter if the peak note changes more than a threshold quantity of times
At 710, the smart music detection module 305 may determine whether a threshold for the music note change counter has been satisfied. In some implementations, if the music note change counter threshold is satisfied, the smart music detection module 305 may declare that no music event(s) are present in the frames at 712. For example, if in a consecutive ten frames (or other quantity), the music note change counter exceeds or satisfies a defined threshold (e.g., 5, 7, 8, etc.), then music is not present in the past ten frames, the smart music detection module 305 may declare that music is not present, and, in some implementations, may reset the state of the finite state machine to the initial state.
In some implementations, at 714, the smart music detection module 305 may compute power spectral density per critical band over a set of frames, and, at 716, the smart music detection module 305 may determine power spectral density change over the critical bands and over the set of frames.
In some implementations, to find signals with fixed spectral patterns (e.g., noise), the smart music detection module 305 may employ power spectral density per critical band introduced in (11)-(13). For example, the power spectral density change between consecutive frames may be determined as follows:
D1(m,i)=|EdBcb(m,i)−EdBcb(m−1,i), 0≤i<Nc. (21)
The total power spectral density change over Nc critical bands may be given by
Similarly, the power spectral density change between the m-th frame and the (m−2)-th frame may be given by
D2(m,i)=|EdBcb(m,i)−EdBcb(m−2,i)|, 0≤i<Nc. (23)
The total power spectral density change over Nc critical bands between the m-th frame and the (m−2)-th frame may be given by
At 718, the smart music detection module 305 may determine whether the quantity of critical bands satisfies threshold and/or whether total power spectral density change satisfies criteria over the set of frames.
For example, the smart music detection module 305 may check how many critical bands satisfy
D1(m,i)≤δ1, 0≤i<Nc, (25)
where δ1 is a small constant (e.g., ⅕ dB or 1/10 dB). The smart music detection module 305 may additionally or alternatively check the total power spectral density change/difference
D1(m)≤δ2, (26)
where δ2 is a small constant (e.g., ½ dB or ⅓ dB).
At 720, the smart music detection module 305 may determine whether the condition is satisfied for a threshold quantity of frames. In some implementations, if the condition is satisfied, the smart music detection module 305 may proceed to the operation at 722, where the smart music detection module 305 may declare that the analyzed set of frames include a fixed spectral pattern event, such as a noise or tone event, based on number of frames that satisfy criteria in a set of frames (e.g., consecutive frames).
For example, if the total quantity of critical bands satisfying (25) is bigger than a threshold (e.g., 13), or the total power spectral density change satisfies (26), then the smart music detection module 305 may increase the critical band match counter num_cb_match by one. Similarly, the smart music detection module 305 may compare the power spectral density changes between the m-th frame and the (m−2)-th frame, defined by (23) and (24), against the thresholds S and 4, respectively.
In some implementations, if num_cb_match is increased at least eight times in consecutive ten frames (or another quantity in a different set of frames), the smart music detection module 305 may determine (e.g., based on the power spectral density not changing in consecutive frames) that noise with a fixed spectral pattern is present. In such an instance, the smart music detection module 305 may determine that the analyzed set of frames do not include a music event at 724.
In some implementations, at 726, the smart music detection module 305 may sum log frequency spectrogram per chroma value in each octave, and, at 728, the smart music detection module 305 may compare energy of a chroma value against the sum of energy for the other chroma values in octaves using the log frequency spectrogram(s).
For example, the smart music detection module 305 may also differentiate a tone event from a music event. The smart music detection module 305 may sum up the LF spectrogram per chroma value (9) in each octave of a set (e.g., octaves 4-7). In some instances, the smart music detection module 305 may then sum up the total note energy for the 44 notes for octaves 4-7 as shown in the table in
Eother(m)=Etotal(m)−E(m,(i−1)*)−E(m,i*)−E(m,(i+1)*). (27)
In some instances, a tone event may be determined based on one note energy being bigger than the sum of the other notes. Since music may also have harmonics, a music event is different from tone event. For example, the smart music detection module 305 may identify a tone event using the following criterion
Emax(m)≥γ1Eother(m), (28)
where γ1 is a constant (e.g., 2, 3, or 6).
At 730, the smart music detection module 305 may determine whether the compare condition is satisfied (e.g., using the criterion at (28)). In response to determining that the compare condition is satisfied at 730, the smart music detection module 305 may proceed to the operation at 732, where it may identify a tone event in the audio signal.
At 734, the smart music detection module 305 may set a value for the tone detection counter, for example, based on a quantity of chroma value changes over a defined time period. For instance, if the condition (28) is satisfied, then the smart music detection module 305 may increase the tone detection counter num_tone_detect by one. The smart music detection module 305 may accumulate the tone detection counter across tone events in the finite state machine.
In response to determining that the compare condition is satisfied at 730, the smart music detection module 305 may proceed to the operation at 736, where it may determine whether there is another frame in a set of frames and/or the audio signal to analyze for music. If there is another frame or set of frames to analyze, the method may proceed at 726 or another operation. In some instances, in addition to processing a subsequent frame or if processing a given set of frames has completed, the method may continue to 738.
At 738, the smart music detection module 305 may determine a state of finite state machine based on a total tone detection counter value and/or the one or more music states. For example, in some implementations, as described above, the smart music detection module 305 may declare that the audio signal includes music based on a transition of the one or more music states to a final state in a finite state machine. In some implementations, the transition of the one or more music states to the final state in the finite state machine may be based on a tone detection counter value satisfying a threshold accumulated over a set frames, for example, as described in reference to 522-526 above.
As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the portions, modules, agents, managers, components, functions, procedures, actions, layers, features, attributes, methodologies, data structures, and other aspects are not mandatory, and the mechanisms that implement the invention or its features may have different names, divisions and/or formats. The foregoing description, for purpose of explanation, has been described with reference to specific examples. However, the illustrative discussions above are not intended to be exhaustive or limiting to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The examples were chosen and described in order to best explain relevant principles and their practical applications, to thereby enable others skilled in the art to best utilize various examples with or without various modifications as may be suited to the particular use contemplated.
This application claims benefit as a continuation of U.S. application Ser. No. 16/399,738, filed on Apr. 30, 2019.
Number | Name | Date | Kind |
---|---|---|---|
6785645 | Khalil et al. | Aug 2004 | B2 |
7386217 | Zhang | Jun 2008 | B2 |
20120158401 | Mazurenko et al. | Jun 2012 | A1 |
Entry |
---|
“Series P: Telephone Transmission Quality, Telephone Installations, Local Line Networks”, Telecommunication Standardization Sector of ITU, P.862, Feb. 2001, 30 pgs. |
Bello, Juan Pablo et al., “A Tutorial on Onset Detection in Music Signals”, IEEE Transactions on Speech and Audio Processing, Aug. 6, 2003, 13 pgs. |
“Series G: Transmission Systems And Media, Digital Systems And Networks, International telephone connections and circuits—General definitions”, Telecommunication Standardization Sector of ITU, G.107, Mar. 2005, 28 pgs. |
Grosche, Peter et al., “Extracting Predominant Local Pulse Information From Music Recordings”, IEEE Transactions on Audio, Speech, and Language Processing, Aug. 2011, 14 pgs. |
Muller, Meinard, “Fundamentals of Music Processing”, © Springer International Publishing Switzerland 2015, Chapter 6, pp. 303-346. |
Number | Date | Country | |
---|---|---|---|
Parent | 16399738 | Apr 2019 | US |
Child | 17030136 | US |