The present invention relates to audio distortion compensation and acoustic channel estimation, especially but not exclusively in vehicles.
Vehicle manufacturers are introducing speech recognition technology into vehicles for both voice command control of vehicle equipment and as a natural language interface to wider internet-based services. This technology currently performs well with a close-talking microphone but performance drops significantly when the microphone is placed at a distance from the speaker. Sound from the speaker's mouth takes a multi-path route to the microphone because the sound is reflected off different in-vehicle surfaces and reverberated before finally entering the microphone capsule. The microphone also has a characteristic electrical response to the acoustic waves and this cascade of systems leads to distortion of the original speech, making the function of speech recognition more difficult. A similar problem arises during mobile (cellular) telephone conversations when the microphone is remote from the speaker's mouth.
Audio and speech content reproduction technology is well established in the modern vehicle but the quality and intelligibility of the reproductions is often poor. Original recordings are made in environments that are acoustically very different from that inside the vehicle and sounds that appeared bright in the recording studio may be dulled inside the vehicle because the in-vehicle environment acoustically damps out critical component frequencies. Other frequency components of sound may also be boosted by the acoustic environment inside the vehicle to the point that they dominate and create an unnatural or unbalanced listening experience for the user.
Audio capture and reproduction systems offer manufacturers potential to add new value to their vehicles and it would be desirable to provide such systems with the ability to model and correct for the distortion of sound in the vehicle by estimation of acoustic channels.
A first aspect of the invention provides a method of defining an acoustic channel in an environment, the method comprising: providing a respective definition of at least one sound source in said environment, said respective definition comprising a definition of a respective sound associated with said at least one sound source and a respective location within said environment associated with said at least one sound source; identifying in an output signal of at least one microphone located in said environment at least one output signal segment corresponding to a respective one of said respective sounds; and generating from said at least one output signal segment, and optionally from the respective sound definition, a respective definition of a respective acoustic channel for association with the respective location associated with the respective sound source, and optionally with said at least one microphone, and wherein said at least one sound source comprises a respective intrinsic part of said environment.
A second aspect of the invention provides a method of compensating an audio signal for distortion caused by an acoustic channel in an environment, said method comprising: maintaining an acoustic channel map for said environment, said map comprising at least one acoustic channel definition associated with a respective one of a plurality of locations within said environment; determining a location within said environment corresponding to a source of said audio signal or a destination of said audio signal; selecting from said acoustic channel map at least one of said acoustic channel definitions based on a comparison of said determined location for said audio signal with the respective location associated with said at least one of said acoustic channel definitions; compensating said audio signal using said at least one selected acoustic channel definition.
A third aspect of the invention provides a system for defining an acoustic channel in an environment, the system comprising: at least one storage device storing a respective definition of at least one sound source in said environment, said respective definition comprising a definition of a respective sound associated with said at least one sound source and a respective location within said environment associated with said at least one sound source; an identification module configured to identify in an output signal of at least one microphone located in said environment at least one output signal segment corresponding to a respective one of said respective sounds; and an acoustic channel estimation module configured to generate from said at least one output signal segment, and optionally the respective sound definition, a respective definition of a respective acoustic channel for association with the respective location associated with the respective sound source, and optionally with said at least one microphone, wherein said at least one sound source comprises a respective intrinsic part of said environment.
A fourth aspect of the invention provides a system for compensating an audio signal for distortion caused by an acoustic channel in an environment, said system comprising a distortion compensation module configured to maintain an acoustic channel map for said environment, said map comprising at least one acoustic channel definition associated with a respective one of a plurality of locations within said environment, said distortion compensation module being further configured to determine a location within said environment corresponding to a source of said audio signal or a destination of said audio signal, to select from said acoustic channel map at least one of said acoustic channel definitions based on a comparison of said determined location for said audio signal with the respective location associated with said at least one of said acoustic channel definitions, and to compensate said audio signal using said at least one selected acoustic channel definition.
Preferred embodiments of the invention employ naturally occurring vehicle sounds to characterize the acoustic environment inside a vehicle. Sounds such as doors opening and shutting, doors locking, hazard and indicator light relay clicks, window up and down, and seat position locking are short term audio sounds that fully or partially represent the audio bandwidth range. These sounds tend to have a high acoustic energy and generate electrical signals on microphone outputs that are high above (i.e. distinguishable from) the ambient noise floor inside the vehicle. Preferably, any specific sound that is impulsive in nature and contains a range of audio frequencies (for example the sound of a door shutting, which typically includes wideband speech frequencies up to approximately 8 kHz), preferably substantially the full bandwidth of audio frequencies, can be used to characterize the acoustic channel (provided the amplitude of the sound is sufficiently high to allow it to be distinguished from the ambient noise). Optionally, the measurements from multiple sounds that each contain a range of audio frequencies can be combined to represent the same channel.
Advantageously, an acoustic channel map is constructed using sounds generated from different physical locations inside a vehicle. Preferably, pattern matching techniques are used to uniquely identify a sound type. A particular sound type may be associated with a particular location and so identification of the sound type results in identification of the location. Alternatively, a user may provide an input to the system indicating the sound's location. Alternatively still, a location estimation algorithm may be used to determine a location for a detected sound. Any conventional algorithm may be used for this purpose. For example, in the case where a sound of a particular type may emanate from any one of multiple locations (e.g. a door closing sound), a location estimation algorithm may determine which of the multiple locations is the relevant one by analysing one of more characteristics of the detected sound e.g. time-of-delay and/or amplitude and/or direction, Hence, embodiments of the invention may comprise means for determining the location of a detected sound by any one or more of: direct association with the sound type; user input; or application of a location estimation algorithm,
Preferred embodiments of the invention involve using naturally occurring vehicle sounds to blindly estimate acoustic channels in the vehicle that can be grouped to form an acoustic channel map of the interior of the vehicle.
Preferred embodiments of the invention support an audio based method for in-vehicle acoustic channel characterization using naturally occurring vehicle sounds. In particular, preferred embodiments of the invention support characterization of the acoustic environment of the interior of a vehicle using sound sources that are intrinsic to the vehicle and so do not require additional equipment. The preferred method is repeatable and comprehensive across all vehicle type and models.
Further preferred features are recited in the claims appended hereto. Other advantageous aspects of the invention will become apparent to those ordinarily skilled in the art upon review of the following description of a specific embodiment and with reference to the accompanying drawings.
An embodiment of the invention is now described by way of example and with reference to the accompanying drawings in which:
An acoustic channel can be defined as a description of the multiple paths that a sound travels from a source to the receiver. This could be from a loudspeaker to the listener's ears or from a human speaker or a faulty engine component to a microphone. In-vehicle acoustic channels are complex and are characterized by the size of the interior of the vehicle, the reflective and absorption properties of interior surfaces, components, seats and passengers and the relative positions of source and receiver inside the vehicle.
The characteristics of the acoustic channel have a distorting effect on sound emanating from a sound source in the vehicle. For example, speech uttered by the human speaker 12 is distorted by both the acoustic environment of the vehicle 10, i.e. the reverberation channel of which paths A, B are part, and the receiving microphone 18, i.e. the microphone channel, before it is presented to the signal processing system for speech recognition. The greater the distance between the speaker 12 and the receiving microphone 18, the greater the channel distortion tends to be. The aim of a channel compensation technique is to reveal the original speech sound through the channel distortions.
The characteristics of an acoustic channel are defined by the acoustic path (e.g. paths A, B) that the sound travels from source 12 to microphone 18 and the electrical characteristics of the microphone and any associated electrical equipment through which the electrical signal passes before reaching the signal processing system. Since there can be multiple speakers and multiple microphones in a vehicle, there can be multiple acoustic channels by which sound can travel in the vehicle. These acoustic channels can be grouped together in a channel map. The characteristics of each acoustic channel in the channel map depend on the physical co-ordinates of the respective sound source and microphone, and a characterisation of the relevant reverberation and microphone channels. The channel map contains information that can be used to reveal a more accurate digital representation of the original sound, e.g. speech.
Acoustic channels can be modelled in the time and frequency domain and acoustic channel compensation techniques can then be used to correct the acoustic distortion introduced by the channel and deliver a signal to the receiver (human or machine) that is much more representative of the original sound.
Conveniently, each acoustic channel in the channel map 38 is represented by an acoustic channel definition, typically comprising a mathematical definition, for example a transfer function, that is applied to an input signal (e.g. speech uttered at source 12) to produce an output signal (e.g. the electrical, typically digital, representation of the input signal that is rendered to the signal processing system via a microphone). In the following description the acoustic channel definition is assumed to comprise a transfer function but it will be understood that the invention is not limited to this and that any other suitable definition, typically comprising a mathematical channel representation, may be used.
For each microphone 16, 18, the channel map 38 may comprise a respective acoustic channel defined by a respective transfer function for a respective one of multiple locations within the vehicle 10. Advantageously, each location can be correlated with a respective location where a sound is expected to emanate from, e.g. the expected location of a driver or passenger. Hence, when sound, e.g. speech, is detected by a microphone, the DC 34 estimates the location of its source and selects the most appropriate, e.g. closest or best matching, acoustic channel in the channel map 38. The DC 34 then uses the transfer function of the selected acoustic channel to eliminate or reduce the distortion on the output signal from the microphone before rendering the distortion-compensated signal to the signal processing system 19. Typically the transfer function is inverted for application to the to the microphone output signal. With the distortion reduced or removed, the output signal is more readily recognisable by the speech recognition system, or other signal processing system, to which it is provided.
A source sound and corresponding distorted microphone output signal, together with the relevant physical locations, may be used to define an appropriate acoustic channel transfer function. This may be achieved by introducing a known test sound into the acoustic space of the vehicle 10 at predefined sound source and receiver (microphone) positions, to estimate a respective acoustic channel. A disadvantage of this approach is that additional equipment has to be temporarily introduced into the vehicle, which is relatively expensive and impractical.
In preferred embodiments of the invention, the ACE 32 uses blind channel estimation techniques that require the input test sound and/or its source location to be only partially known. Blind channel estimation techniques are only effective however when constraints are imposed on the nature of the input sound source.
When the vehicle door 14 closes, a sound is generated that has relatively high energy compared to ambient vehicle noises. In addition, the door closing sound is repeatable, relatively short in duration and has a wideband audio frequency response. These signal characteristics make the door closing sound suitable for use with blind channel estimation techniques. Moreover, the location of the door 14 is known or can be measured. Typically a location that is deemed to represent the source of the door closing sound is defined, as illustrated in
The next step is to use the estimation of the acoustic channel, which in the present example is embodied by the respective transfer function, to improve speech recognition accuracy. This distortion correction is performed by the DC 34 in conjunction with the channel map 38. For example, when speaker 12 begins to talk, the DC 34 may determine his location by applying a speaker localization algorithm to the received output signal from the relevant microphone 16, 18. Any conventional algorithm may be used for this purpose, for example the generalized cross-correlation (GCC) time delay estimation algorithm. Alternatively, a location for the speaker can be determined by determining the direction of the detected speech with respect to the microphone. This method is particularly useful for vehicles in which there are relatively few (typically up to 7) possible seating positions for the speaker. The DC 34 then correlates the speaker's location with at least one acoustic channel in the channel map 38 associated with a sound source that is closest to the location determined for the speaker 12, e.g. the acoustic channel corresponding to the closest door closing sound. The characteristics of the selected acoustic channel from the channel map 38, as defined by the respective transfer function, are then used by the DC 34 to correct the microphone output signal for the speaker 12 to compensate for channel distortion. Typically, this involves applying an inverse of the transfer function to the microphone output signal for the speaker 12, although it will be understood that this depends on how the channels are defined in the channel map. More generally, compensation involves applying a mathematical function derived from the mathematical representation of the channel in order to fully or partially compensate for the effects of the acoustic channel. It is noted that some transfer functions cannot be inverted as a single channel but can be inverted in combination with one or more other transfer functions, for example in accordance with the multiple input/output inverse theorem (MINT). Even though the acoustic channel selected from the map may not be identical to the channel from the speaker 12, they are close enough to allow an improvement in speech recognition accuracy.
In cases where the channel map 38 is deemed not to include an acoustic channel estimation for a location close enough to the determined location of the speaker 12, the DC 34 may interpolate respective channel estimations for two or more acoustic channels in the channel map to produce an acoustic channel estimation for an acoustic channel closer to the determined location of the speaker 12.
As well as compensating for the effects of distortion on speech emanating from a human speaker, the system 30 may be used to compensate for the effects of distortion on sound emanating from the loudspeaker 24 of an audio rendering system (e.g. radio, CD player, mp3 player, telephone system) incorporated into the vehicle 10. To this end the DC 34 may adjust the audio output signal from the audio system 40 before it is rendered by the loudspeaker 24 using at least one selected acoustic channel estimation from the channel map 34. The selected acoustic channel may be one associated with a location inside the vehicle 10 where the driver or a passenger is seated. In cases where the channel map 38 is deemed not to include an acoustic channel estimation for a suitable location, the DC 34 may interpolate respective channel estimations for two or more acoustic channels in the channel map to produce an acoustic channel estimation for a suitable acoustic channel. The DC 34 may select or adjust an acoustic channel, or produce an interpolated acoustic channel, in response to the detection of one or more events by the microphones 16, 18, e.g. the detection of speech from one or more location within the vehicle, or the detection of a door opening or closing.
The channel map 38 preferably comprises estimations of acoustic channels between more than one sound source location and the microphone(s) in the vehicle. For example, the ACE 32 may create a respective acoustic channel estimation for each microphone using each of the vehicle doors as the sound source. Alternatively, or in addition, other naturally occurring sounds inside the vehicle, in particular sounds made by parts that are intrinsic to the vehicle e.g. the click of a key in the ignition, doors opening and shutting, doors locking, hazard and indicator light relay clicks, window operation, seat position locking, switch clicks, user control operation, wiper operation, or seat belt operation, can be used by the ACE 32 to generate acoustic channel estimations for the channel map 38. In particular sounds having relatively high energy compared to ambient noise, being of relatively short duration and having a wideband audio frequency content are suitable for this purpose. More generally, any specific sound that is impulsive in nature and contains multiple audio frequency components, preferably across substantially the entire bandwidth of audio frequencies (for example the sound of a door shutting), can be used to characterize an acoustic channel. Suitable sounds typically result from mechanical operation of a respective part of the vehicle, including mechanical operations caused by the action of a user. Optionally, the measurements from multiple sounds, in particular localised sounds, that each contain a range of audio frequencies can be combined to represent a single channel (which may be referred to as the full channel). Optionally, one or more devices (not shown) may be incorporated into the vehicle at one or more known locations (and which may be regarded as intrinsic) that are operable to generate, or which automatically generate, one or more suitable sounds, especially while the vehicle is driving.
The vehicle sound types identified above are highly repeatable in the same vehicle and consistent in character between different types and models of vehicle. Each sound is easily identifiable and originates from different identifiable locations within the vehicle 10. This allows the acoustic channel map 38 to be generated by the ACE 32 at different time intervals during the use of the vehicle. Since vehicle sounds re-occur when the vehicle is used and are associated with an in-vehicle event, the acoustic map 38 can be updated regularly while the vehicle is being used.
Sounds that occur naturally inside the vehicle often indicate that something has happened that may affect the accuracy of the current acoustic channel map 38. Preferably, detection of such sounds triggers an adjustment of the acoustic channel map. If, for instance, the passenger 20 leaves the vehicle 10, opening and closing the passenger door 22, the channel(s) of the acoustic channel map that have the passenger door 22 as the sound source may be re-calculated (two channels in the example of
During typical use of the vehicle 10, the driver 10 enters the vehicle and turns on the ignition. Both of these actions generate a vehicle sound that can be used by the ACE 32 to characterize fully or partially the acoustic environment by the creation of, or updating of, respective acoustic channel estimations for the channel map 38. Should a further passenger enter the vehicle, the action of opening and closing the vehicle door creates sounds that allow the map 38 of acoustic channels to be updated. The vehicle 10 is then driven off and, should the driver or a passenger open a window, a further sound is generated that allows an update to the acoustic channel map. More generally, the system 30, and in particular the ACE 32, is configured to recognise at least one sound source that occurs during normal vehicle use and is detected by the, or each, microphone 16, 18 (more generally a single microphone or multiple microphones), and to use the corresponding microphone output signal, together with relevant sound source data (typically comprising a location associated with the sound and optionally a mathematical representation corresponding to the original sound (e.g. a model of the relevant sound type) to produce an acoustic channel estimation, e.g. comprising a transfer function, for inclusion in the channel map 38. It is noted that a representation, e.g. model, of the source sound is used in order to identify suitable segments of the microphone output signal, but depending on which acoustic (blind) channel estimation algorithm(s) are used the source sound representation is not necessarily needed to perform the acoustic channel estimation. However, the source sound representation is involved in creating the acoustic channel map since each sound source representation is associated with a location in the vehicle and so, once the acoustic channel has been estimated, the channel estimation may associated with the said location to maintain the acoustic channel map.
The sound segmentation module 56 cuts a relatively long, and typically buffered, sound signal 62 into smaller sound segments 64 as shown in
The source identification module 58 determines whether or not each sound segment 64 corresponds with one of the naturally occurring vehicle sounds that can be used to characterize the acoustic environment inside a vehicle as described above, i.e. whether or not each sound segment 64 corresponds to a sound source that the ACE 32 is configured, or trained, to recognize.
With reference to
In use, source identification module 58 compares the sound segments 64 against the mathematical models 68 by any suitable pattern matching process 70 in order to identify which sound segments 64 correspond to valid recognizable sound sources. By way of example, any conventional probabilistic pattern matching algorithm may be used to identify the sound source. However, any conventional single channel or multi channel source estimation technique may alternatively be used.
The acoustic channel estimation module 60 supports the implementation of one or more algorithms that estimate the acoustic channel 52 (i.e. generate a definition, typically a mathematical representation such as a transfer function, of the channel) from one or more distorted sound signals 54 corresponding to a valid sound source, as identified by the source identification module 58. In preferred embodiments, two kinds of algorithms can be used to estimate the acoustic channel: a single channel algorithm; and/or a multi-channel algorithm. Conventional channel estimation algorithms, especially blind channel estimation algorithms may be used by module 60, for example the blind single channel deconvolution using non-stationary signal processing technique proposed by Hopgood and Rayner (for single channels) or the the multichannel frequency-domain LMS (MCFLMS) algorithm proposed by Huang and Benesty (for multiple channels).
The multi-channel source deconvolution module 82 estimates the frequency response of the channel by deconvolving the source sound data from the distorted sound inputs 54. The sound source data is provided by the sound source estimation module 80 as described above. In the present example, deconvolution is performed in the log spectral domain and the source data is deconvolved by way of log spectral subtraction. However, any conventional multi-channel deconvolution technique may be used for this purpose, for example involving time domain deconvolution or frequency domain division.
The sound segmentation process can be performed using any one or combination of conventional methods (for example Bayesian Information Criteria, model based, amongst others).
The source identification process can be performed using any one or combination of conventional methods (for example threshold based methods, model based methods, template matching methods, amongst others).
Single channel deconvolution can be performed using any one or combination of conventional methods (for example frequency domain methods, time domain methods, model based methods, amongst others).
Multi channel source deconvolution can be performed using any one or combination of conventional methods (for example Independent component analysis, information maximization methods, adaptive beamforming methods, model based methods, amongst others).
The following advantageous aspects of preferred embodiments of the invention will be apparent from the foregoing. The sound sources used to characterize the acoustic environment are naturally occurring in the vehicle and so estimation of acoustic channels is simplified because no external sound reproduction equipment is required. Advantageously, the sound pressures generated by the sound sources are at levels where all frequencies bands sit above ambient noise floor level yet are acceptable to vehicle passengers. The sound sources are repeatable within the context of a single vehicle. The preferred sound sources are re-occurring and associated with in-vehicle events that commonly change the acoustic environment. Each sound source can be uniquely identified and physically located. The sound sources are at different physical locations within the vehicle and so allow generation of an acoustic channel map.
Although the invention is described herein in the context of a vehicle, it may be applied to other acoustic environments in which similar sound sources occur and are detectable by one or more microphones, for example an auditorium, theatre, cinema and so on.
The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.
Number | Name | Date | Kind |
---|---|---|---|
5956463 | Patrick | Sep 1999 | A |
8180067 | Soulodre | May 2012 | B2 |
8233353 | Zhang | Jul 2012 | B2 |
20050049877 | Agranat | Mar 2005 | A1 |
20070133811 | Hashimoto | Jun 2007 | A1 |
20070183604 | Araki | Aug 2007 | A1 |
20070247936 | Direnzo | Oct 2007 | A1 |
20090306973 | Hiekata | Dec 2009 | A1 |
20110075860 | Nakagawa | Mar 2011 | A1 |
20110081024 | Soulodre | Apr 2011 | A1 |
20120322511 | Fox | Dec 2012 | A1 |
Number | Date | Country |
---|---|---|
2014008253 | Jan 2014 | WO |
Entry |
---|
Springer Handbook of Acoustics, Rossing (Editor), ISBN 978-0-387-30446-5, 2007; p. 738. |
GB Search Report issued in related GB Application No. 1416227.5, dated Mar. 13, 2015. |
Number | Date | Country | |
---|---|---|---|
20150195647 A1 | Jul 2015 | US |