This application claims priority from European Patent Application No. 16306350.6, entitled “DEVICE AND METHOD FOR AUDIO FRAME PROCESSING”, filed on Oct. 13, 2016, the contents of which are hereby incorporated by reference in its entirety.
The present disclosure relates generally to audio recognition and in particular to calculation of audio recognition features.
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate. However, it is also a challenging task that in order to be successful often requires intensive computing operations.
A principle constraint for user acceptance of audio recognition is preservation of privacy. Therefore, the audio processing should, preferably, be performed locally, instead of using a cloud service. As a consequence, CPU consumption and, in some cases, battery life could be a serious limitation to the deployment of such service in portable devices.
An opposing constraint is technical: many distinct audio events have very similar characteristics requiring cumbersome processing power to extract the features that enable discrimination between them. Recognition could be enhanced by exploiting fine time-frequency characteristics of an audio signal, however, at an increased computational cost. Indeed, among the functions composing the audio recognition, features extraction is the most demanding. It corresponds to computation of certain signature coefficients per audio frame (buffer), which characterize the audio signal over time, frequency or both.
Particularly efficient features for audio recognition, able to achieve high recognition accuracy, have been provided by Andén and Mallat, see
Their method has been theoretically and empirically verified as superior to baseline methods commonly used for acoustic classification, such as Mel Frequency Cepstral Coefficients (MFCC), see P. Atrey, M. Namunu, and K. Mohan, “Audio based event detection for multimedia surveillance” ICASSP—IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. and D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange and M. Plumbley, “Detection and classification of acoustic scenes and events” IEEE Transactions on Multimedia, 2015.
Their method comprises the computation of scattering features. First, from the captured raw audio signal, a frame (an audio buffer of fixed duration), denoted by x, is obtained. This frame is convolved with a complex wavelet filter bank, comprising bandpass filters ψA (λ denoting the central frequency index of a given filter) and a low-pass filter ϕ, designed such that the entire frequency spectrum is covered. Then, a modulus operator (|⋅|) is applied, which pushes the energy towards lower frequencies [see S. Mallat: “Group invariant scattering.” Communications on Pure and Applied Mathematics, 2012]. The low pass portion of this generated set of coefficients, obtained after the application of modulus operator, is stored and labelled as “0th order” scattering features (S0). For computing the higher “scattering order” coefficients (S1, S2, . . . ), these operations are recursively applied to all remaining sequences of coefficients generated by bandpass filters. This effectively yields a tree-like representation, as illustrated in
It will be appreciated that there is a desire for a solution that addresses at least some of the shortcomings of the conventional solutions. The present principles provide such a solution.
In a first aspect, the present principles are directed to a device for calculating scattering features for audio signal recognition. The device includes an interface configured to receive an audio signal and at least one processor configured to process the audio signal to obtain audio frames, calculate first order scattering features from at least one audio frame, and only in case energy in the n first order scattering features with highest energy is below a threshold value, where n is an integer, calculate second order scattering features from the first order scattering features.
Various embodiments of the first aspect include:
In a second aspect, the present principles are directed to a method for calculating scattering features for audio signal recognition. At least one hardware processor processes a received audio signal to obtain at least one audio frame, calculates first order scattering features from the at least one audio frame, and, only in case energy in the n first order scattering features with highest energy is below a threshold value, where n is an integer, calculates second order scattering features from the first order scattering features.
Various embodiments of the second aspect include:
In a third aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the second aspect.
Preferred features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
An idea underpinning the present principles is to reduce adaptively the computational complexity of audio event recognition by including a feature extraction module adaptive to the time varying behaviour of audio signal, which is computed on a fixed frame of an audio track and represents a classifier-independent estimate of belief in the classification performance of a given set of scattering features. Through the use of the metric, the order of a scattering transform can be optimized.
The present principles preferably use the “scattering transform” described hereinbefore as an effective feature extractor. As shown in
It can thus be seen that the present principles can achieve possibly significant processing power savings if the scattering order is chosen adaptively per frame with respect to the observed time varying behaviour of an audio signal.
The device 200 further comprises an input interface 240 and an output interface 250. The input interface 240 is configured to obtain audio for processing; the input interface 240 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio. The output interface 250 is configured to output information about analysed audio, for example for presentation on a screen or by transfer to a further device.
The device 200 is preferably implemented as a single device, but its functionality can also be distributed over a plurality of devices.
In “Group Invariant Scattering,” S. Mallat argues that the energy of the scattering representation approaches the energy of the input signal as the scattering order increases. The present principles use this property as a proxy indicator for the information content (thus discriminative performance) of a scattering representation.
It is assumed that there exists a pool of pre-trained classifiers based on the scattering features of different orders. Therefore, once the necessary scattering order for a given audio frame has been estimated, and the corresponding features have been computed, classification is performed using an appropriate model. The classification is an operation of a fairly low computational complexity.
In the description hereinafter, the expression “signal” is to be interpreted as any sequence of coefficients Uλm=|ψλ
The resulting sequence of positive numbers {γλ} adds up to 1. The larger values of γλ indicate more important frequency bands, and can be seen as peaks of a probability mass function P that models the likelihood of observing the signal energy in a given band. An example of such probability mass function is illustrated in
As mentioned previously, the low-pass filter is applied to each signal Uλm, limiting its frequency range. This also limits the information content of the filtered signal. According to the present principles, the relative energy preserved by the low-pass filtered ϕ*Uλm relative to the input signal is measured:
For a normalized filter ϕ, this ratio is necessarily bounded between 0 and 1, and indicates the preservation of energy for a given frequency band: the larger the ratio, the larger the amount of energy is captured within the given features.
According to the present principles, energy preservation is monitored only in “important” frequency bands, which are estimated using the relevance map. First, the normalised energies {γλ} are sorted in descending order (
Then, the final energy preservation estimator is computed as β=minϵϵ[1,n]αϵ, where {αλ} are ordered according to the descending order of {γλ}, and 0<β≤1 is the minimal relative amount of energy in the important frequency bands. By setting the low threshold τ for β, it is possible to determine whether a given scattering feature contains sufficient information for accurate classification, or if features of a higher scattering order need to be computed. In the inventors' experiments, the best performance has been obtained for 0.5≤τ≤0.85 and 0.7≤μ≤0.9. An example performance is presented in the precision/recall curve illustrated in
In step S605, the interface (240 in
The skilled person will appreciate that the energy preservation estimate is a classifier-independent metric. However, if the classifier is specified in advance and provides certain confidence metric (e.g., a class probability estimate), it is possible to consider the estimates together in an attempt to boost performance.
It will be appreciated that the present principles can provide a solution for audio recognition that can enable:
It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase “coupled” is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
Number | Date | Country | Kind |
---|---|---|---|
16306350.6 | Oct 2016 | EP | regional |