The invention relates to a method for processing speech, in particular to a method for emotion recognition and speaker identification.
In many systems with man-machine interfaces (MMI) it is desirable to integrate as much information as possible that can be derived from the various communication channels used by humans. In particular, it is often useful to include emotional information that describe the emotions of a user of a system, i.e. for example if the user is angry, happy, or sad. This emotional information may be derived from a speech signal of the user and can then be used e.g. to generate a respective response of the system. An example for a system, where emotional information can be useful, is an automatic teller machine (ATM) which is speech operated. If the user gets annoyed by the system, because the system has e.g. asked the user to repeat an order several times, he may get impatient. This emotional state may be detected by the system and thus the system's input mode may switch from speech to graphic/haptic input via a touch screen.
Another important point of today's MMI systems is the identification of speakers. In many systems it is important to know who is interacting with the system. For example, several people may share a car and certain parameters of the system may be set dependent on the current driver. It is therefore necessary that the driver be identified, which is commonly achieved by a speaker identification routine within the MMI system.
It is an object underlying the invention to provide a method for processing speech, in particular for emotion recognition and/or speaker identification.
To achieve this object, the invention provides a method according to claim 1. In addition, the invention provides a speech processing system, a computer program product, and a computer readable storage medium as defined in claims 9, 10, and 11, respectively. Further features and preferred embodiments are respectively defined in respective subclaims and/or in the following description.
According to the invention, a method for processing speech comprises the steps of receiving a speech input of a speaker, generating speech parameters from said speech input, determining parameters describing an absolute loudness of said speech input, and evaluating said speech input and/or said speech parameters using said parameters describing the absolute loudness.
This means, absolute loudness is used during evaluation of said speech input in addition to other parameters typically used in a classifier (e.g. a classifier for determining an emotional state of said speaker), such as prosodic features or voice quality features. Quality features of speech, i.e. auditory features arise from variation in the source signal and vocal tract properties, which are very speaker dependent.
Preferably, the step of evaluation comprises a step of emotion recognition and/or a step of speaker identification. The use of absolute loudness as a parameter for emotion recognition and speaker identification is a key feature of the invention. The rate of successful emotion recognition and the rate of successful speaker identification improved significantly using absolute loudness as an additional input parameter for the respective recognition systems.
Advantageously, a microphone array comprising a plurality of microphones, i.e. at least two microphones, is used for determining said parameters describing the absolute loudness. With a microphone array the distance of the speaker from the microphone array can be determined by standard algorithms and the loudness can be normalized by the distance.
This is done by estimating a time difference between microphones using correlation techniques.
Further, a location and/or distance of the speaker is determined and the absolute loudness is determined using standard algorithms for auditory and/or binaural processing. Thereby, an artificial head or similar shape with two microphones mounted at ear position is used. Processing in the ear is simulated, i.e. time delay and amplitude difference information between two “ears” is estimated and used to determine exactly the speakers position.
Said absolute loudness is preferably computed by normalizing the measured loudness at the microphones (signal gain) or the energy by said distance. Preferably this is done by multiplication, i.e. Distance times Energy.
Said distance is thereby determined using standard algorithms for speaker localization. The normalization by the distance is a key feature of the invention, because the normalization transforms the measured loudness into the absolute loudness. By normalizing the loudness by the distance the determined absolute loudness becomes independent of the distance of the speaker from the microphone.
In prior art emotion recognition systems and speaker identification systems loudness could not be used because a speaker speaking with the same loudness appeared to speak with a different loudness depending on his distance to the microphone.
A speech processing system according to the invention is capable of performing or realizing the inventive method for recognizing speech and/or the steps thereof.
A computer program product according to the invention, comprises computer program means adapted to perform and/or to realize the inventive method of recognizing speech and/or the steps thereof, when it is executed on a computer, a digital signal processing means, and/or the like.
A computer readable storage medium according to the invention comprises the inventive computer program product.
The invention and advantageous details thereof will be explained by way of an exemplary embodiment thereof in the following with reference to the accompanying drawings, in which
In
In a distance computing step CD the distance D of the speaker S from the microphone array MA is determined, i.e. the speaker is localized. Thereby, the time difference TD (also referred to as time delay) between microphones is estimated, using correlation techniques.
The distance D is further used in a loudness computing step CL, wherein the absolute loudness L is determined, which is measured in units of dbA. The absolute loudness L is determined using the signal energy, i.e. the absolute loudness is the energy normalized by the distance D.
Thereby, signal energy is measured in a window, e.g. by
where sn is the digistized speech signal. Many alternative formulars exist. In a similar way the signal energy E can be computed from the spectrum. In that case frequency based weighting according to ear sensitivity in different frequency bands can be applied. Since energy decreases proportional to 1/D, with D being the distance D between the speaker and the microphone (see
The absolute loudness L is now used in an evaluation step EV. This evaluation step EV may comprise a speaker identification and/or emotion recognition.
Besides the absolute loudness L, standard features ESF are used in the evaluation step EV. These standard features ESF are extracted in parallel to the distance computing step CD, and the loudness computing step CL in a standard feature extracting step SFES. In this standard feature extracting step SFES the received speech parameters SP from the microphone array MA are processed.
In
It should be noted that loudness could not be used in prior art systems for emotion recognition and/or speaker identification because in prior art systems only one microphone is used. If only one microphone is used, the loudness depends on the distance of the speaker to the microphone. Moreover, in prior art systems, the speech signal is normalized to eliminate any “disturbing” variance of loudness. This fact further prevents the use of loudness for emotion recognition and/or speaker identification.
With the invention, the absolute loudness can now be determined and be used for emotion recognition and/or speaker identification. In this context it is assumed that absolute loudness can be important for emotion recognition and also is characteristic for speakers and thus carries valuable information for speaker identification.
Number | Date | Country | Kind |
---|---|---|---|
02 027 964 | Dec 2002 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
4817149 | Myers | Mar 1989 | A |
4918730 | Schulze | Apr 1990 | A |
5278910 | Suzuki et al. | Jan 1994 | A |
6151571 | Pertrushin | Nov 2000 | A |
6353810 | Petrushin | Mar 2002 | B1 |
6539350 | Walker | Mar 2003 | B1 |
6593956 | Potts et al. | Jul 2003 | B1 |
6686839 | Chou et al. | Feb 2004 | B2 |
6823304 | Ikeda | Nov 2004 | B2 |
7035798 | Kobayashi | Apr 2006 | B2 |
7054228 | Hickling | May 2006 | B1 |
7180892 | Tackin | Feb 2007 | B1 |
20020167862 | Tomasi et al. | Nov 2002 | A1 |
20030182116 | Nunally | Sep 2003 | A1 |
20040158466 | Miranda | Aug 2004 | A1 |
20050027528 | Yantorno et al. | Feb 2005 | A1 |
20050060153 | Gable et al. | Mar 2005 | A1 |
20050086056 | Yoda et al. | Apr 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20040128127 A1 | Jul 2004 | US |