Field
The disclosed embodiments generally relate to the design of automated systems for recognizing sounds. More specifically, the disclosed embodiments relate to the design of an automated system that uses an inferential technique to recognize non-speech sounds based on patterns of sound primitives.
Related Art
Recent advances in computing technology are making it possible for computer systems to automatically recognize sounds, such as the sound of a gunshot, or the sound of a baby crying. This has led to the development of automated systems for detecting corresponding events, such as gunshot-detection systems and baby-monitoring systems. However, these existing systems are presently unable to detect higher-level events that are associated with collections of related sounds. For example, the sound of a baby crying followed by the sound of a human voice and then silence might indicate that a person has taken care of a crying baby. Detecting such higher-level events is a complicated task because the related sounds might occur in different sequences or at the same time.
Hence, what is needed is a system for detecting higher-level events that are associated with patterns of related sounds.
The disclosed embodiments provide a system that performs a sound-recognition operation. During operation, the system recognizes a sequence of sound primitives in an audio stream, wherein a sound primitive is associated with a semantic label comprising one or more words that describe a sound characterized by the sound primitive. Next, the system feeds the sequence of sound primitives into a finite-state automaton that recognizes events associated with sequences of sound primitives. Finally, the system feeds the recognized events into an output system that generates an output associated with the recognized events to be displayed to a user.
In some embodiments, the finite-state automaton is a non-deterministic finite-state automaton that can exist in multiple states at the same time, wherein the non-deterministic finite-state automaton maintains a probability value for each of the multiple states that the finite-state automaton can exist in.
In some embodiments, feeding the sequence of sound primitives into the finite-state automaton involves: (1) feeding the sequence of sound primitives into a first-level finite-state automaton that recognizes first-level events from the sequence of sound primitives to generate a sequence of first-level events; (2) feeding the sequence of first-level events into a second-level finite-state automaton that recognizes second-level events from the sequence of first-level events to generate a sequence of second-level events; and (3) repeating the process for zero or more additional levels of finite-state automata to generate the recognized events.
In some embodiments, if a probability value for a state in the non-deterministic finite-state automaton does not meet an activation-potential-related threshold value after a state-transition operation, the probability value for the state is set to zero.
In some embodiments, the finite-state automaton performs state-transition operations by performing computations involving one or more sequence matrices containing coefficients that define state transitions.
In some embodiments, recognizing a sequence of sound primitives in an audio stream comprises first performing a feature-detection operation on a sequence of sound samples from the audio stream to detect a set of sound features. Each of these sound features comprises a measurable characteristic for a time window of consecutive sound samples, and detecting each sound feature involves generating a coefficient indicating a likelihood that the sound feature is present in the time window. Next, the system creates a set of feature vectors from coefficients generated by the feature-detection operation, wherein each feature vector comprises a set of coefficients for sound features in the set of sound features. Finally, the system identifies the sequence of sound primitives from the sequence of feature vectors.
In some embodiments, the output system triggers an alert when a probability that a tracked event is occurring exceeds a threshold value.
The disclosed embodiments also provide a system that generates a set of sound primitives through an unsupervised learning process. During this process, the system performs a feature-detection operation on a sequence of sound samples to detect a set of sound features. Next, the system creates a set of feature vectors from coefficients generated by the feature-detection operation, wherein each feature vector comprises a set of coefficients for sound features in the set of sound features. The system then performs a clustering operation on the set of feature vectors to produce a set of feature clusters, wherein each feature cluster comprises a set of feature vectors that are proximate to each other in a vector space that contains the set of feature vectors. Next, the system defines the set of sound primitives, wherein each sound primitive is defined to be associated with a feature cluster in the set of feature clusters. Finally, the system associates semantic labels with the sound primitives, wherein a semantic label for a sound primitive comprises one or more words that describe a sound characterized by the sound primitive.
In some embodiments, while associating a semantic label with a sound primitive, the system performs the following operations. If semantic labels already exist for feature vectors in a feature cluster for the sound primitive, the system examines the semantic labels to determine a dominant semantic label for the feature cluster. On the other hand, if semantic labels do not exist for the feature vectors in the feature cluster, the system queries one or more users to obtain semantic labels for sounds associated with feature vectors in the feature cluster to determine the dominant semantic label for the feature cluster. Finally, the system associates the dominant semantic label with the sound primitive.
In some embodiments, a sound feature includes one or more of the following: (1) an average value for a parameter of a sound signal over a time window of consecutive sound samples; (2) a spectral-content-related parameter for a sound signal over the time window of consecutive time samples; and (3) a shape-related metric for a sound signal over the time window of consecutive sound samples.
The following description is presented to enable any person skilled in the art to make and use the present embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present embodiments. Thus, the present embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium. Furthermore, the methods and processes described below can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules.
Overview
The objective of sound-recognition systems is to provide humans with relevant information extracted from sounds. People recognize sounds as belonging to specific categories, such as sounds associated with a car, sounds associated with a baby crying, or sounds associated with shattering glass. However, a car can produce a wide variety of sounds that a person can recognize as falling into the car category. This is because a person typically has experienced sounds related to cars for many years, and all of these sounds have been incorporated into a semantic category associated with the concept of a car.
At present, a sound category such as “car” does not make sense to a computer system. This is because a category for the concept of “car” is not actually a category associated with lower-level sound characteristics, but is in fact a “semantic category” that is associated with the activity of operating a car. In this example, the sound-recognition process is actually the process of identifying an “activity” associated with one or more sounds.
When a computer system processes an audio signal, the computer system can group similar sounds into categories based on patterns contained in the audio signal, such as patterns related to frequencies and amplitudes of various components of the audio signal. Note that such sound categories may not make sense to people. However, the computer system can easily categorize such sound categories, which we refer to as “sound primitives.” (Note that the term “sound primitive” can refer to both machine-generated sound categories, and human-defined categories matching machine-generated sound categories.) We refer to the discrepancy between human-recognized sound categories and machine-recognized sound categories as the “human-machine semantic gap.”
We now describe a system that monitors an audio stream to recognize sound-related activities based on patterns of sound primitives contained in the audio stream. Note that these patterns of sound primitives can include sequences of sound primitives and also overlapping sound primitives.
Computing Environment
Fat edge device 130 also includes a real-time audio acquisition unit 122, which can acquire and digitize an audio signal. However, in contrast to skinny edge device 110, fat edge device 120 possesses more internal computing power, so the audio signals can be processed locally in a local meaning-extraction module 124.
The output from both local meaning-extraction module 124 and cloud-based meaning-extraction module 132 feeds into an output post-processing module 134, which is also located inside cloud-based virtual device 130. This output post-processing module 134 provides an Application-Programming Interface (API) 136, which can be used to communicate results produced by the sound-recognition process to a customer platform 140.
Referring to the model-creation system 200 illustrated in
Model-Building Process
During the model-building process, the system can use an unsupervised learning technique to generate a model to recognize a set of sound primitives as is illustrated in the flow chart that appears in
For example, a sound feature can comprise a 5-second sliding time window comprising a set of audio samples acquired at 46 millisecond intervals from an audio stream. In general, the set of sound features can include: (1) an average value for a parameter of a sound signal over the time window; (2) a spectral-content-related parameter for a sound signal over the time window; and (3) a shape-related metric for a sound signal over the time window. More specifically, the set of sound features can include: (1) a “pulse” that comprises a peak in intensity of a highest energy component of the sound signal, which can be compared against a delta function, and wherein parameters for the pulse can include a total energy, a duration, and a peak energy; (2) a “shock ratio,” which relates to a local variation in amplitude of the sound wave; (3) a “wave-linear length,” which measures a total length of the sound wave over the time window; (4) a “spectral composition of a peak” over the time window; (5) a “trajectory of the leading spectrum component” in the sound signal over the time window; for example, the trajectory can be ascending, descending or V-shaped; (6) a “leading spectral component” (or a set of leading spectral components) at each moment in the time window; (7) an “attack strength,” which reflects a most brutal variation in sound intensity over the time window; and (8) a “high-peak number,” which specifies a number of peaks that are within 80% of the peak amplitude in the time window.
Note that it is advantageous to use a sound feature that can be computed using simple incremental computations instead of more-complicated computational operations. For example, the system can compute the “wave-linear length” instead of the more computationally expensive signal-to-noise ratio (SNR).
Next, the system creates a set of feature vectors from coefficients generated by the feature-detection operation, wherein each feature vector comprises a set of coefficients for sound features in the set of sound features (step 304). The system then performs a clustering operation on the set of feature vectors to produce a set of feature clusters, wherein each feature cluster comprises a set of feature vectors that are proximate to each other in a vector space that contains the set of feature vectors (step 306). This clustering operation can involve any known clustering technique, such as the “k-means clustering technique,” which is commonly used in data mining systems. This clustering operation also makes use of a distance metric, such as the “normalized Google distance,” to form the clusters of proximate feature vectors.
The system then defines the set of sound primitives, wherein each sound primitive is defined to be associated with a feature cluster in the set of feature clusters (step 308). Finally, the system associates semantic labels with sound primitives in the set of sound primitives, wherein a semantic label for a sound primitive comprises one or more words that describe a sound characterized by the sound primitive (step 310).
Referring to the flow chart in
After the model for recognizing the set of sound primitives has been generated, the system generates a model that recognizes “events” from patterns of lower-level sound primitives. Like sound primitives, events are associated with concepts that have a semantic meaning, and are also associated with corresponding semantic labels. Moreover, each event is associated with a pattern of one or more sound primitives, wherein the pattern for a particular event can include one or more sequences of sound primitives, wherein the sound primitives can potentially overlap in the sequences. For example, an event associated with the concept of “wind” can be associated with sound primitives for “rustling” and “blowing.” In another example, an event associated with the concept of “washing dishes” can be associated with a sequence of sound primitives, which include “metal clanging,” “glass clinking” and “running water.”
Note that the model that recognizes events can be created based on input obtained from a human expert. During this process, the human expert defines each event in terms of a pattern of lower-level sound primitives. Moreover, the human expert can also define higher-level events based on patterns of lower-level events. For example, the higher-level event “storm” can be defined as a combination of the lower-level events “wind,” “rain” and “thunder.” Instead of (or in addition to) receiving input from a human expert to define events, the system can also use a machine-learning technique to make associations between lower-level events and higher-level events based on feedback from a human expert as is described in more detail below. Once these associations are determined, the system converts the associations into a grammar that is used by a non-deterministic finite-state automaton to recognize events as is described in more detail below.
Note that a sound primitive can be more clearly defined by examining other temporally proximate sound primitives. For example, the sound of an explosion can be more clearly defined as a gunshot if it is followed by more explosions, the sound of people screaming, and the sound of a police siren. In another example, a sound that could be either a laugh or a bark can be more clearly defined as a laugh if it is followed by the sound of people talking.
Sound-Recognition Process
Next, the system feeds the sequence of sound primitives into a finite-state automaton that recognizes events associated with sequences of sound primitives. This finite-state automaton can be a non-deterministic finite-state automaton that can exist in multiple states at the same time, wherein the non-deterministic finite-state automaton maintains a probability value for each of the multiple states that the finite-state automaton can exist in (step 508). Finally, the system feeds the recognized events into an output system that triggers an alert when a probability that a tracked event is occurring exceeds a threshold value (step 510).
Non-Deterministic Finite-State Automaton
As mentioned above, the system can recognize events based on other events (or from sound primitives) through use of a non-deterministic finite-state automaton. An exemplary state-transition process 800 for an exemplary non-deterministic finite-state automaton is illustrated in
Matrix Operations
In some embodiments, the system receives feedback from a human who reviews the highest-level feature vector 916 and also listens to the associated audio stream, and then provides feedback about whether the highest-level feature vector 916 is consistent with the audio stream. This feedback can be used to modify the lower-level matrices through a machine-learning process to more accurately produce higher-level feature vectors. Note that this system can use any one of a variety of well-known machine-learning techniques to modify these lower-level matrices.
Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The foregoing descriptions of embodiments have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present description to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present description. The scope of the present description is defined by the appended claims.
This application is a continuation-in-part of, and hereby claims priority under 35 U.S.C. §120 to, pending U.S. patent application Ser. No. 14/616,627, entitled “Systems and Methods for Identifying a Sound Event,” by inventor Sebastien J. V. Christian, filed 6 Feb. 2015. U.S. patent application Ser. No. 14/616,627 itself claims priority under 35 U.S.C. §119 to U.S. Provisional Application No. 61/936,706, entitled “Sound Source Identification System,” by inventor Sebastien J. V. Christian, filed 6 Feb. 2014. This application also claims priority under 35 U.S.C. §119 to U.S. Provisional Application No. 62/387,126, entitled “Systems and Methods for Identifying a Sound Event Using Perceived Patterns,” by inventor Sebastien J. V. Christian, filed 23 Dec. 2015. The above-listed applications are all hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5918223 | Blum | Jun 1999 | A |
7991206 | Kaminski, Jr. | Aug 2011 | B1 |
8082279 | Weare | Dec 2011 | B2 |
8463000 | Kaminski, Jr. | Jun 2013 | B1 |
8706276 | Ellis | Apr 2014 | B2 |
8838260 | Pachet | Sep 2014 | B2 |
9215539 | Kim | Dec 2015 | B2 |
20020023020 | Kenyon | Feb 2002 | A1 |
20020037083 | Weare | Mar 2002 | A1 |
20020164070 | Kuhner et al. | Nov 2002 | A1 |
20030045954 | Weare | Mar 2003 | A1 |
20030086341 | Wells | May 2003 | A1 |
20050091275 | Burges | Apr 2005 | A1 |
20050102135 | Goronzy | May 2005 | A1 |
20070276733 | Geshwind | Nov 2007 | A1 |
20080001780 | Ohno | Jan 2008 | A1 |
20100114576 | Sundararajan | May 2010 | A1 |
20100271905 | Khan et al. | Oct 2010 | A1 |
20120066242 | Sathya | Mar 2012 | A1 |
20120143610 | Wang et al. | Jun 2012 | A1 |
20120224706 | Hwang et al. | Sep 2012 | A1 |
20120232683 | Master | Sep 2012 | A1 |
20130065641 | Gross | Mar 2013 | A1 |
20130222133 | Schultz | Aug 2013 | A1 |
20130345843 | Young | Dec 2013 | A1 |
20160022086 | Yuan | Jan 2016 | A1 |
Entry |
---|
Chang et al.; “LIBSVM: A Library for Support Vector Machines”, created in 2001, Last updated: Mar. 4, 2013, maintained at http://www.csie.ntu.tw/˜cjlin/papers/libsvm.pdf. |
SHAZAM; http://www.shazam.com/apps; accessed Apr. 5, 2017. |
International Search Report and Written Opinion for Application No. PCT/US2015/014927, May 18, 2015. |
Number | Date | Country | |
---|---|---|---|
20160330557 A1 | Nov 2016 | US |
Number | Date | Country | |
---|---|---|---|
61936706 | Feb 2014 | US | |
62387126 | Dec 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14616627 | Feb 2015 | US |
Child | 15209251 | US |