Embodiments of the present invention relate to a method, an apparatus for recognizing acoustic anomalies. Further embodiments relate to a corresponding computer program. In accordance with embodiments, recognizing a normal situation takes place, as well as recognizing anomalies when compared to this normal situation.
In real acoustic scenes, there is usually complex super-positioning of several sound sources. These may be spatially positioned in the foreground and background as desired. Additionally, a plurality of potential sounds is conceivable, which may reach from very short transient signals (like applause, gunshot) to longer, stationary sounds (alarm sirens, passing train). Recording usually includes a certain period of time which, when looked at subsequently, is subdivided into one or several time windows. Starting from this subdivision and depending on the length of noises (for example transient or longer, stationary sounds), noise may extend across one or more audio segments/time windows.
In many application scenarios, an anomaly, i.e. a sound deviation from the “acoustic normal state”, i.e. the amount of noises considered to be “normal”, is to be recognized. Examples of such anomalies are glass breaking (burglar detection), gunshots (supervising public events) or a chainsaw (supervising natural reserves).
It is problematic that the sound of the anomaly (not-okay class) frequently is unknown or cannot be defined or described precisely (for example, what is the sound of a broken machine?).
The second problem is that new algorithms for sound classification by means of deep neural networks are very sensitive to changed (and frequently unknown) acoustic conditions in the application scenario. Classification models which are trained using audio data which were recorded using a high-quality microphone, for example, achieve only poor recognition rates when classifying audio data recorded by means of a poorer microphone. Potential solution approaches are in the field of “domain adaptation”, i.e. adapting the models or the audio data to be classified in order to achieve higher robustness for recognition. However, in practice, it is frequently logistically difficult and too expensive to record representative audio recordings at the future place of application of an audio analysis system and subsequently annotate the same relative to sound events contained therein.
The third problem of audio analysis of environmental noises is data-protection concerns since classification methods may theoretically also be used for recognizing and transcripting voice signals (for example when recording a conversation close to the audio sensor).
The classification models of existing prior-art solutions are as follows:
When the sound anomaly to be detected can be specified precisely, a classification model can be trained based on machine learning algorithms by means of supervised learning for recognizing certain noise classes. Current studies have shown that neural networks in particular are very sensitive to changed acoustic conditions and that an additional adaptation of classification models to the respective acoustic situation of the application has to be performed.
When starting from the disadvantages as described before, there is demand for an improved approach. It is the object of the present invention to provide a concept for detecting anomalies which is optimized with regard to the learning behavior and allows reliably and precisely recognizing anomalies.
According to an embodiment, a method for recognizing acoustic anomalies may have the steps of: obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows; analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment; obtaining a further recording having one or more second audio segments associated to respective second time windows; analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments ABCD; matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing a method for recognizing acoustic anomalies, having the steps of: obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows; analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment; obtaining a further recording having one or more second audio segments associated to respective second time windows; analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments ABCD; matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment, when said computer program is run by a computer.
According to another embodiment, an apparatus for recognizing acoustic anomalies may have: an interface for obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows, and for obtaining a further recording having one or more second audio segments associated to respective second time windows; and a processor configured for analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment, and configured for analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments, and configured for matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment.
Embodiments of the present invention provide a method for recognizing acoustic anomalies. The method comprises the steps of obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows, and analyzing the plurality of first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment, like a spectrum for the audio segment (time-frequency spectrum) or an audio fingerprint having certain characteristics for the audio segment, for example. The result of the analysis of a long-term recording subdivided into a plurality of time windows, for example, is a plurality of first (one-dimensional or multi-dimensional) characteristic vectors for the plurality of the first audio segments (associated to the corresponding points in time/time windows of the long-term recording) representing the “normal state”. The method comprises further steps of obtaining another recording having one or more second audio segments associated to respective second audio windows, and analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments. This means that the result of the second part of the method exemplarily is a plurality of second characteristic vectors (for example, with corresponding points in time of the further recording). In a subsequent step, matching one or more second characteristic vectors with the plurality of the first characteristic vectors takes place (for example by comparing the identities or similarities or by recognizing an order) to recognize at least one anomaly. In accordance with embodiments, recognizing different forms of anomalies would be conceivable, i.e. a sound anomaly (i.e. recognizing a so far unheard sound for the first time), a temporal anomaly (for example changed repetition pattern of a sound heard already) or a spatial anomaly (a sound heard already occurs at a so far unknown spatial position).
Embodiments of the present invention are based on the finding that an “acoustic normal state” and “normal noises” can be learned independently by a long-term sound analysis (phase 1 of the method including the steps of obtaining a long-term recording and analyzing the same) alone. This means that this long-term analysis allows independently or autonomously adapting an analysis system to a certain acoustic scene. Annotated training data (recording+semantic class annotation) are not required, which allows large savings in time, complexity and costs. When this acoustic “normal state” or the “normal” noises have been detected, the current noise environment can take place in a subsequent analysis phase (phase 2 including the steps of obtaining a further recording and analyzing the same). The current audio segment/current noise scenario here is matched with the “normal” noises recognized or learned before/in phase 1. Generally, this means that phase 1 allows learning a model using the normal noise setting based on a statistic method or machine learning, wherein this model subsequently (in phase 2) allows matching currently recorded noise settings as to their degree of novelty (probability of anomaly).
Another advantage of this approach is that the privacy of persons potentially located in the direct surroundings of the acoustic sensors is protected. This is referred to as privacy-by-design. Due to the system involved, voice recognition is not possible since the interface is defined clearly (audio in, anomaly probability function out). This means that potential data protection concerns when using acoustic sensors can be dispelled.
Since the long-term recording represents the acoustic normal situation, the plurality of first audio segments themselves and/or in their order describe this normal situation. This means that the plurality of first audio segments themselves and/or when combined represent a kind of reference. The target of this method is recognizing anomalies when compared to this normal situation. This means that, in accordance with embodiments, the result of the clustering described above is a description of the reference using first audio segments. The step in which the anomaly is determined includes comparing the second audio segments themselves or their combination (i.e. order) to the reference in order to represent the anomaly. The anomaly is a deviation of the current acoustic situation described by the second characteristic vectors from the reference described by the first characteristic vectors. In other words, this means that, in accordance with embodiments, the first characteristic vectors themselves or in combination represent a reference representation of the normal state, whereas the second characteristics vectors themselves or in combination describe the current acoustic situation so that, in step 126, the anomaly in the form of a deviation of the description of the current acoustic situation (cf. second characteristic vectors) from the reference (cf. first characteristic vectors) can be recognized. This means that the anomaly is defined by the fact that at least one of the second acoustic characteristic vectors deviates from the series of the first acoustic characteristic vectors. Potential deviations may be: sound anomalies, temporal anomalies and spatial anomalies.
In accordance with an embodiment, phase 1 means detecting a plurality of first audio segments, which are subsequently also referred to as “normal” noises/audio segments or those considered to be “normal”. In accordance with embodiments, knowing these “normal” audio segments allows recognizing a so-called sound anomaly. This entails performing the sub-step of identifying a second characteristic vector which differs from the analyzed first characteristic vector.
In accordance with further embodiments, when analyzing, the method comprises the sub-step of identifying a repetition pattern in the plurality of the first time windows. Repeating audio segments are identified here, and the resulting pattern is determined from it. In accordance with embodiments, identifying takes place using repeating, identical or similar first characteristic vectors belonging to different first audio segments. In accordance with embodiments, when identifying, grouping identical and similar first characteristic vectors or first audio segments to form one or more groups may take place.
In accordance with embodiments, the method comprises recognizing an order of first characteristic vectors belonging to the first audio segments, or recognizing an order of groups of identical or similar first characteristic vectors or first audio segments. The basic steps advantageously allow recognizing normal noises, or recognizing normal audio objects. The combination of these normal audio objects with regard to time to a certain order or a certain repetition pattern represents an acoustic normal state.
In accordance with further embodiments, it would also be conceivable for a repetition pattern in the one or more second time windows and/or an order of second characteristic vectors belonging to different second audio objects or groups of identical or similar second characteristic vectors to be recognized. In accordance with further embodiments, this method allows, when matching, the sub-step of matching the repetition pattern of the first audio segment and/or order in the first audio segments with the repetition pattern of the second audio segments and/or the order in the second audio segments. This matching allows recognizing a temporal anomaly.
In accordance with another embodiment, the method may comprise the step of determining a respective position for the respective first audio segments. In accordance with an embodiment, determining the respective position for the respective second audio segments can be performed. In accordance with an embodiment, this allows recognizing a spatial anomaly by the sub-step of matching the position associated to the respective first audio segments with the position associated to the respective second audio segment.
It is to be pointed out here that at least two microphones, for example, are used for spatial localization, whereas one microphone is sufficient for the other two types of anomalies.
As indicated before, each characteristic vector (first and second characteristic vector) for the different audio segments may comprise one dimension or several dimensions. A potential realization of a characteristic vector would, for example, be a time-frequency spectrum. In accordance with an embodiment, the dimension space may also be reduced. This means that, in accordance with embodiments, the method comprises the step of reducing the dimensions of the characteristic vector.
In accordance with another embodiment, the method may comprise the step of determining a probability of occurrence of the respective first audio segment and outputting the probability of occurrence together with the respective first characteristic vector. Alternatively, the method may comprise the step of determining a probability of occurrence of the respective first audio segment and outputting the probability of occurrence including the respective first characteristic vector and a respective first time window. This means that the probability of occurrence for the respective audio segment or a closer probability of the occurrence of the audio segment at this point in time is output. Outputting is done using the corresponding data set or characteristic vector.
In accordance with an embodiment, the method may also be computer-implemented. This means that the method comprises a computer program having program code for performing the method.
Further embodiments relate to an apparatus having an interface and a processor. The interface serves for obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows and for obtaining another recording having one or more second audio segments associated to respective second time windows. The processor is configured to analyze the plurality of first audio segments to obtain, for each of the plurality of first audio segments, a first characteristic vector describing the respective first audio segment. Additionally, the processor is configured to analyze the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments. Additionally, the processor is configured to match the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly.
In accordance with embodiments, the apparatus comprises a recording unit connected to the interface, like a microphone or microphone array, for example. The microphone array advantageously allows determining the position as discussed before. In accordance with further embodiments, the apparatus comprises an output interface for outputting the probability of occurrence discussed before.
Embodiments of the present invention will be discussed below referring to the appended drawings, in which:
Before discussing the following embodiments of the present invention making reference to the appended drawings, it is pointed out that elements and structures of equal effect are provided with equal reference numbers so that the description thereof is mutually applicable or interchangeable.
In the first phase 110, which is referred to as adjusting phase, there are two basic steps. This is indicated by the reference numerals 112 and 114. Step 112 comprises a long-term recording of the acoustic normal state in the application scenario. The analysis apparatus 10 (cf.
This long-term recording 113 is then subdivided, for example. The subdivision may be performed to form time regions of equal duration, like 1 second or 0.1 second, for example, or dynamic time regions. Everytime region comprises an audio segment. In step 114, which is generally referred to as analyzing, this audio segment is examined separately or in combination. When analyzing, a so-called characteristic vector 115 (first characteristic vectors) is determined for each audio segment. Expressed generally, this means that a conversion from a digital recording 113 to one or more characteristic vectors 115—for example by means of deep neural networks—takes place, wherein each characteristic vector 115 “encodes” the sound at a certain point in time. Characteristic vectors 115 can, for example, be determined by an energy spectrum for a certain frequency range or, generally, a time-frequency spectrum.
It is to be pointed out here that, optionally, it is possible to reduce the dimensionality of the characteristic space of the characteristic vectors 115 by means of statistical methods (like main-component analysis). In step 114, optionally, typical or dominant noises can be identified by means of unmonitored learning methods (like clustering). Here, time sections or audio segments comprising similar characteristic vectors 115 and correspondingly comprising a similar sound are grouped together. No semantic classification of a noise (like “car” or “airplane”) is necessary here. This means that a so-called unmonitored learning using frequencies of repeating or similar audio segments takes place. In accordance with another embodiment, it would also be conceivable for unmonitored learning of the temporal order and/or typical repetition patterns of certain noises to take place in step 114.
The result of clustering is a composition of audio segments or noises, which are normal or typical of this region. Exemplarily, a probability of occurrence may be associated to each audio segment. Additionally, a repetition pattern or order, i.e. a combination of several audio segments, for which the current environment tis typical or normal can be identified. A probability can be associated here to each grouping, each repetition pattern or each series of different audio segments.
At the end of the adjusting phase, audio segments or grouped audio segments are known and described as characteristic vectors 115 typical of this environment. In a next step or next phase 120, this learned knowledge is applied correspondingly. Phase 120 comprises three basic steps 122, 124, and 126.
In step 122, an audio recording 123 is recorded. When compared to the audio recording 113, it is typically much shorter. This audio recording is, for example, shorter when compared to the audio recording 113. However, it may also be a continuous audio recording. This audio recording 123 is then analyzed in a downstream step 124. This step is comparable as regards contents to step 114. Again, the digital audio recording 123 is converted to characteristic vectors. When these two characteristic vectors 125 are finally present, they can be compared to the characteristic vectors 115.
The comparison of step 126 is performed with the goal of determining anomalies. Very similar characteristic vectors and very similar orders of characteristic vectors hint at the fact that there is no anomaly. Deviations from patterns determined before (repetition patterns, typical orders etc.) or deviations from the audio segments determined before characterized by other/new characteristic vectors hint at an anomaly. These are recognized in step 126.
In step 126, different types of anomalies can be recognized. Examples of these are:
These anomalies will be discussed in detail referring to
Optionally, a probability can be output for each of the three types of anomalies at a time x. This is illustrated by the arrows 126z, 126k, and 126r (one arrow per type of anomaly) in
It is to be pointed out here that, when comparing the characteristic vectors, frequently there is not identity, but only similarity. This means that, in accordance with embodiments, threshold values can be defined of when characteristic vectors are similar or when groups of characteristic vectors are similar so that the result also presents a threshold value for an anomaly. This threshold value application can follow outputting the probability distribution or occur in combination, for example in order to allow more precise temporal recognition of anomalies.
In accordance with further embodiments, it is also possible to recognize spatial anomalies. Here, step 114, in the adjusting phase 110, may also comprise unmonitored learning of typical spatial positions and/or movements of certain noises. Typically, in such a case, instead of the microphone 18 illustrated in
Referring to
When precisely this pattern ABCABC is recognized in phase 2, it can be assumed that there is no anomaly, or at least no temporal anomaly. If, however, the pattern ABCAABC illustrated here is recognized, there is a temporal anomaly since a further radio segment A is arranged between the two groups ABC. This audio segment A or abnormal audio segment A is provided with a double frame.
A sound anomaly is illustrated in
A spatial anomaly is illustrated in
Referring to
Additionally, at the interface 16, a probability of anomalies or probability of anomalies at certain points in time or, generally, a probability of characteristic vectors at certain points in time can be determined.
In accordance with embodiments, the apparatus 10 or the audio system is configured to recognize (simultaneously) different types of anomalies, like at least two anomalies, for example. The following fields of application are conceivable:
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method such that a block or device of an apparatus also corresponds to a respective method step or a feature of a method step. Analogously, aspects described in the context with or as a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like, for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some or several of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention may be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray disc, a CD, ROM, PROM, EPROM, EEPROM or a FLASH memory, a hard drive or another magnetic or optical memory having electronically readable control signals stored thereon, which cooperate or are capable of cooperating with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer-readable.
Some embodiments according to the invention include a data carrier comprising electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with program code, the program code being operative for performing one of the methods when the computer program product runs on a computer.
The program code may, for example, be stored on a machine-readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, wherein the computer program is stored on a machine-readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program comprising program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the computer-readable medium are typically tangible and/or non-transitory.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises processing means, for example a computer, or a programmable logic device, configured or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer a computer program for performing at least one of the methods described herein to a receiver. The transmission can, for example, be performed electronically or optically. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field-programmable gate array, FPGA) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field-programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, in some embodiments, the methods are performed by any hardware apparatus. This can be universally applicable hardware, such as a computer processor (CPU), or hardware specific for the method, such as ASIC.
The apparatus described herein may be implemented, for example, using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The apparatus described herein, or any component of the apparatus described herein may be implemented at least partly in hardware and/or software (computer program).
The methods described herein may be implemented, for example, using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein, or any component of the methods described herein may be performed at least partly by hardware and/or software.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10 2020 200 946.5 | Jan 2020 | DE | national |
This application is a continuation of copending International Application No. PCT/EP2021/051804, filed Jan. 27, 2021, which is incorporated herein by reference in its entirety, and additionally claims priority from German Application No. 10 2020 200 946.5, filed Jan. 27, 2020, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2021/051804 | Jan 2021 | US |
Child | 17874072 | US |