This invention relates to determining an environment context by the classification of sounds, especially sounds that are detectable within a vehicle cabin.
Most in-vehicle activities create sound. The sound created by each in-vehicle activity may be called a “sound activity”. The sound activity created by each in-vehicle activity is unique and can be considered as a signature of the corresponding in-vehicle activity. These sound activities are either directly associated with in-vehicle events (e.g. horn sound, indicator sound, speech, music, etc.) or indirectly associated with in-vehicle events (e.g. vehicle engine sound, wiper operation sound, mechanical gear operation sound, tyre sound, sound due to wind, sound due to rain, door operation sound, etc.).
Sound activities can affect the performance of the vehicle's audio systems, e.g. an audio enhancement system, a speech recognition system, or a noise cancellation system. It would be desirable to capture and analyse sound activities in order to improve the performance of the vehicle's audio systems.
A first aspect of the invention provides a method of determining contexts for a vehicle, the method including:
A second aspect of the invention provides a system for determining contexts for a vehicle, the system including:
A third aspect of the invention provides a vehicle audio system comprising a system for determining contexts for a vehicle, the context determining system including:
Preferred embodiments of the invention facilitate capturing and analysing sound activities in order to detect a range of in-vehicle activities, which are problematic or expensive to detect using conventional vehicular sensor systems (e.g. wind blowing, rainy weather, emergency breaking, vehicle engine health, and so on). Related advantages offered by preferred embodiments include: provision of a non-intrusive means of sensing; robustness to the position and orientation of the activity with respect to the sensors; deployable at relatively low cost; capability of capturing information of multiple activities simultaneously; ability to readily distinguish between activities.
Identifying individual sound activities facilitates identifying the corresponding in-vehicle activity that created the sound activity. This in turn allows enhancement of in-vehicle audio systems, e.g. an audio player, an audio enhancement system, a speech recognition system, a noise cancellation system, and so on. For example, detecting the presence of a horn sound in the audio is a cue that can be used by an audio enhancement system to improve its performance and thereby improve the performance of the speech recognition system.
It can be advantageous to determine a wider context associated with an in-vehicle activity. This is because, in real in-vehicle scenarios, sound activities interact with one another based on the context and hence they have contextual associations. Context, in general may be defined as information that characterizes the situation of a person, place, or object. In-vehicle context may be considered as the information that characterizes the nature of the environment in the vehicle or events that have occurred within that environment. The following descriptors are examples of in-vehicle contexts:
In preferred embodiments, contextual information is used to enhance user interactions with in-vehicle devices and inter-device interactions and operations. For example, contextual information indicating that a mobile phone is operating can be used by in-vehicle audio system(s) to adapt the phone volume and thereby provide better service to the user.
One aspect of the invention provides a method for classifying contexts in a vehicle by capturing and analysing sound activities in the vehicle. The preferred method segments the resultant audio into segments each representing an in-vehicle context; then for each audio segment, a respective context and associated individual sound activities present in the audio segment are identified.
Preferred embodiments provide a method for classifying in-vehicle contexts from in-vehicle audio signals. The method may include organizing audio training data into a set of sound models representing a sound component of a sound mixture forming the in-vehicle context. The method may include organizing audio training data into a set of sound models representing the sound that is directly formed by an in-vehicle context. Preferably, the method includes building an association table containing a list of in-vehicle contexts with each context mapped to a sound model(s). Optionally the method involves organizing the in-vehicle context dynamics into n-gram models. Advantageously, the method includes utilizing data from the vehicle sensor systems. The preferred method involves joint identification of context and sound activities from an audio segment. Preferably, a list of past contexts are used in the joint identification process. Joint identification preferably involves model reduction, advantageously utilizing data from the vehicle sensor systems.
Joint identification may involve using a probabilistic technique to derive matching scores between the audio features that are extracted from the audio segment, and the model sets associated with the contexts in a context list. The probabilistic technique preferably assumes temporal sparsity in the short time audio features of the audio segment. The probabilistic technique preferably includes a context n-gram weighting to derive the model score.
Other preferred features are recited in the dependant claims attached hereto.
Further advantageous aspects of the invention will become apparent to those ordinarily skilled in the art upon review of the following description of a specific embodiment and with reference to the accompanying drawings.
An embodiment of the invention is now described by way of example and with reference to the accompanying description in which:
The vehicle 10 includes an audio system 20 that is co-operable with the microphones 12 and loudspeakers 14 to detect audio signals from, and render audio signals to, the cabin 11. The audio system 20 may include one or more audio rendering device 22 for causing audio signals to be rendered via the loudspeakers 14. The audio system 20 may include one or more speech recognition device 24 for recognising speech uttered by the occupants 18 and detected by the microphones 12. The audio system 20 may include one or more noise cancellation device 26 for processing audio signals detected by the microphones 12 and/or for rendering by the loudspeakers 14 to reduce the effects of signal noise. The audio system 20 may include one or more noise enhancement device 28 for processing audio signals detected by the microphones 12 and/or for rendering by the loudspeakers 14 to enhance the quality of the audio signal. The devices 22, 24, 26, 28 (individually or in any combination) may be co-operable with, or form part of, one or more of the vehicle's audio-utilizing devices (e.g. radio, CD player, media player, telephone system, satellite navigation system or voice command system), which equipment may be regarded as part of, or respective sub-systems of, the overall vehicle audio system 20. The devices 22, 24, 26, 28 may be implemented individually or in any combination in any convenient manner, for example as hardware and/or computer software supported by one or more data processors, and may be conventional in form and function. In preferred embodiments contextual information relating to the vehicle is used to enhance user interactions with such in-vehicle audio devices and inter-device interactions and operations.
The audio system 20 includes a context classification system (CCS) 32 embodying one aspect of the present invention. The CCS 32 may be implemented in any convenient manner, for example as hardware and/or computer software supported by one or more data processors. In use, the CCS 32 determines one or more contexts for the cabin 11 based on one or more sounds detected by the microphones 12 and/or on one or more non-audio inputs. In order to generate the non-audio inputs, the vehicle 10 includes at least one electrical device, typically comprising a sensor 16, that is operable to produce a signal that is indicative of the status of a respective aspect of the vehicle 10, especially those that may affect the sound in the cabin 11. For example, each sensor 16 may be configured to indicate the operational status of any one of the following vehicle aspects: left/right indicator operation; windshield wiper operation; media player on/off; window open/closed; rain detection; telephone operation; fan operation; sun roof; air conditioning, heater operation, amongst others. Three sensors 16 are shown in
The CCS 32 determines, or classifies, context from in-vehicle audio signals captured by one or more of the microphones 12, as exemplified by audio signal 40. In preferred embodiments, this is achieved by: 1) segmenting the audio signal 40 into smaller audio segments 42 each representing a respective in-vehicle context; and 2) jointly identifying the in-vehicle context and sound activities present in each audio segment.
The preferred CCS 32 includes an audio segmentation module 48 that segments the input audio signal 40 into shorter length audio segments 42, as illustrated in
Preferably, the audio segments 42 are analyzed to determine if they have audio content that is suitable for use in context determination, e.g. if they contain identifiable sound(s). This may be performed using any convenient conventional technique(s), for example Bayesian Information Criteria, model based segmentation, and so on. This analysis is conveniently performed by the audio segmentation module 48.
The audio segmentation module 48 may also use the non-audio data 44 to enhance the audio segmentation. For example, the non-audio data 44 may be used in determining the boundaries for the audio segments 42 during the segmentation process.
The preferred CCS 32 also includes feature extraction module 50 that is configured to perform feature extraction on the audio segments 42. This results in each segment 42 being represented as a plurality of audio features, as illustrated in
The preferred CCS 32 includes a sound activity module 52. This module 52 comprises a plurality of mathematical sound activity models 53 that are used by the CCS 32 to identify the audio content of the audio segments 42. Each model may define a specific sound (e.g. wiper operating), or a specific sound type (e.g. speech or music), or a specific sound source (e.g. a horn), or a known combination of sounds, sound types and/or sound sources. For example, in the preferred embodiment, each model comprises a mathematical representation of one or other of the following: the steady-state sound from a single sound source (e.g. a horn blast); a single specific sound activity of a sound source (e.g. music from a radio); or a mixture of two or more specific sound activities from multiple sound sources (e.g. music from a radio combined with speech from an occupant). Advantageously, the sound activity models 53 are elementary in that they can be arbitrarily combined with one another to best represent respective in-vehicle contexts. In any event, each model can be associated directly or indirectly with a specific in-vehicle sound activity or combination of in-vehicle sound activities. The CCS 32 may assign any one or more sound activities 45 to each audio segment 42 depending on the audio content of the segment 42.
The sound activity models 53 may be obtained by a training process, for example as illustrated in
The preferred CCS 32 maintains an association table 56 associating a plurality of in-vehicle contexts 43 with a respective one or more sound activity model 53, i.e. a single sound activity model 53 or a combination of sound activity models 53. For example with reference to
With reference to
In preferred embodiments, the CCS 32 uses context dynamics models 60 to analyse the assignment of contexts 43 to audio segments 42 using a statistical modelling process. Preferably an n-gram statistical modelling process is used to produce the models 60. By way of example only, a unigram (1-gram) model may be used. In general, an n-gram model represents the dynamics (time evolution) of a sequence by capturing the statistics of a contiguous sequence of n items from a given sequence. In the preferred embodiment, a respective n-gram model 60 representing the dynamics of each in-vehicle context 43 is provided. The n-gram models 60 may be obtained by a training process that is illustrated in
The preferred CCS 32 includes a context history buffer 66 for storing a sequence of identified contexts that are output from a joint identification module 68, typically in a first-in-first-out (FIFO) buffer (not shown), and feeds the identified contexts back to the joint identification module 68. A respective context is identified for each successive audio segment 42. The number of identified contexts to be stored in the buffer 66 depends on the value of “n” in the n-gram model. The information stored in the buffer 66 can be used jointly with the n-gram model to track the dynamics of the context identified for subsequent audio segments 42.
The joint identification module 68 generates an in-vehicle context together with one or more associated sound activities for each audio segment 42. In the preferred embodiment, the joint identification module 68 receives the following inputs: the extracted features from the feature extraction module 50; the sound activity models 53; the association table 56; the n-gram context models 60; and the sequence of identified contexts for audio segments immediately preceding the current audio segment (from the context history buffer 66). The preferred module 68 generates two outputs for each audio segment 42: the identified in-vehicle context 43; and the individual identified sound activities 45.
In the preferred embodiment, the joint identification module 68 applies sequential steps, namely model reduction and model scoring, to each segment 42 to generate the outputs 43, 45. The preferred model reduction step is illustrated in
Optionally, the module 68 uses the context dynamics models 60 to perform context dynamics modelling, n-gram modelling in this example, to analyse the assignment of contexts 43 to audio segments 42. This improves the model reduction process by eliminating incompatible contexts 43 from the list 70 for the current segment 42 based on the time evolution of data over the previous n−1 segments.
Pseudo code of an exemplary implementation for model scoring process is given below.
The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.
Number | Name | Date | Kind |
---|---|---|---|
20040138882 | Miyazawa | Jul 2004 | A1 |
20070188308 | Lavoie | Aug 2007 | A1 |
20090030619 | Kameyama | Jan 2009 | A1 |
20090112584 | Li | Apr 2009 | A1 |
20090164216 | Chengalvarayan | Jun 2009 | A1 |
20100088093 | Lee | Apr 2010 | A1 |
20100088809 | Leatt et al. | Apr 2010 | A1 |
20100191520 | Gruhn | Jul 2010 | A1 |
20130185065 | Tzirkel-Hancock et al. | Jul 2013 | A1 |
20130185066 | Tzirkel-Hancock et al. | Jul 2013 | A1 |
20140211962 | Davis | Jul 2014 | A1 |
20150194151 | Jeyachandran | Jul 2015 | A1 |
Number | Date | Country |
---|---|---|
1 703 471 | Sep 2006 | EP |
Entry |
---|
GB Search Report issued in related GB Application No. 1416235.8, dated Mar. 13, 2015. |
Number | Date | Country | |
---|---|---|---|
20150215716 A1 | Jul 2015 | US |