Priority is claimed on Japanese Patent Application No. 2013-261544, filed on Dec. 18, 2013, the contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a sound processing apparatus, a sound processing method, and a sound processing program.
2. Background
A sound system has been proposed which adjusts sound quality and sound volume of a sound signal to be broadcasted to a room inside. In such a sound system, a plurality of predetermined band noise signals are output from a loudspeaker provided inside a room, and a noise signal detected by a microphone provided in a sound field of the loudspeaker is analyzed, to thereby measure a transfer function (for example, refer to Japanese Patent Application, Publication No. 2002-328682A).
In this way, a sound signal emitted from a loudspeaker is collected by a microphone, and a transfer function is obtained from the collected sound signal. The obtained transfer function is used for noise suppression, or estimation of the direction and the position of a sound source.
However, according to the technique described above, when a process on a speech signal uttered by a talker (a speaker, a person) is performed, if a measuring point by a loudspeaker and an uttering position of a talker are slightly mismatched, the accuracy of the process is degraded. Further, according to the technique described above, it is difficult to match a sound volume of an actual talker and a sound volume of a preliminary measurement for measuring a transfer function. As a result, according to the technique described above, there is a problem that since reverberation characteristics and the like change due to the difference of the sound volumes, the accuracy of the process is insufficient.
An object of an aspect of the present invention is to provide a sound processing apparatus, a sound processing method, and a sound processing program capable of accurately estimating a transfer function in a sound field.
According to the aspect of the above (1), (9), or (11), it is possible to accurately estimate a transfer function in a sound field.
According to the aspect of the above (2), since the second sound collecting unit is unnecessary, the size of the apparatus can be reduced, and it is possible to estimate a transfer function when a talker utters.
According to the aspect of the above (3), (10), or (12), only by the first sound collecting unit, it is possible to accurately estimate a transfer function based on the delayed sound signals and a selected representative signal.
According to the aspect of the above (4), since the second sound collecting unit can collect the sound signal uttered by a talker in a state where there is no reflected sound, it is possible to accurately estimate a transfer function.
According to the aspect of the above (5), since a transfer function which is already stored in the storage unit can be used, it is possible to save time to estimate a transfer function.
According to the aspect of the above (6), since the sound signal uttered by a talker can be collected when a transfer function is not stored in the storage unit, it is possible to accurately estimate a transfer function.
According to the aspect of the above (7) or (8), since the estimated transfer function can be sequentially updated or interpolated, it is possible to accurately estimate a transfer function.
First, a problem is described when, in a narrow space such as a vehicle inside, assuming a loudspeaker to be a talker (a speaker, a person), a sound signal emitted from the loudspeaker is collected by a microphone to estimate a transfer function.
For example, since the diameter of the loudspeaker is greater than the size of the mouth of a talker, the reflection time of a reflected sound is different between sound signals emitted from a vibration plate of the loudspeaker depending on the positions from the center to the periphery of the vibration plate. Further, depending on the sound volume from the loudspeaker, multiple times reflection may occur. An example of multiple times reflection is twice reflection. For example, in twice reflection, a sound signal emitted from the loudspeaker is reflected by a seat of the vehicle and then further reflected by a steering wheel of the vehicle. In such a case, since the sound signal after reflection is different from an assumed speech signal uttered by a talker, it is impossible to estimate a transfer function having good accuracy by using the sound signal after reflection. Further, it is difficult to arrange a loudspeaker, inside a vehicle, at the same position as a position of the mouth of a talker.
Since there is such a problem, when a loudspeaker and a microphone are arranged inside a vehicle, a sound signal emitted from the loudspeaker is collected by the microphone, and a transfer function is estimated from the collected sound signal, there is a problem that only a recognition rate of about 30% can be obtained in speech recognition using the transfer function.
Next, an outline of an embodiment of the present invention is described.
In a sound processing apparatus according to an embodiment of the present invention, a transfer function of a sound field is estimated using speech by an actual talker.
Thereby, the difference of reflection caused by the diameter of the loudspeaker described above is resolved, the number of reflection in a room inside is also matched with that of an actual talker, and further it is possible to solve the problem relating to the position of the mouth of a talker.
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
The second sound collecting unit 20 collects a sound signal of one channel and transmits the collected sound signal of one channel to the sound processing apparatus 10. The second sound collecting unit 20 is a close-talking microphone worn by a talker. The second sound collecting unit 20 includes, for example, one microphone which receives a sound wave having a component of a frequency band (for example, 200 Hz to 4 kHz). The second sound collecting unit 20 may transmit the collected sound signal of one channel in a wireless manner or a wired manner.
The second sound collecting unit 20 may be, for example, a mobile phone having a microphone. In this case, the mobile phone may transmit an acquired sound signal to the second sound signal acquiring unit 101, for example, in a wireless manner.
The first sound collecting unit 30 collects sound signals of M (M is an integer greater than 1, for example, 8) channels and transmits the collected sound signals of M channels to the sound processing apparatus 10. The first sound collecting unit 30 includes, for example, M microphones 301-1 to 301-M which receive a sound wave having a component of a frequency band (for example, 200 Hz to 4 kHz). Hereinafter, the microphones 301-1 to 301-M are referred to simply as the microphone 301 unless otherwise stated. The first sound collecting unit 30 may transmit the collected sound signals of M channels in a wireless manner or a wired manner. When M is greater than 1, the sound signals only have to be synchronized with each other between the channels at the time of transmission.
The second sound signal acquiring unit 101 acquires the one sound signal collected by the one microphone of the second sound collecting unit 20. The second sound signal acquiring unit 101 outputs the acquired one sound signal to the transfer function estimating unit 103. Alternatively, the second sound signal acquiring unit 101 applies Fourier transform on the acquired one sound signal for each frame in a time domain and thereby generates an input signal in a frequency domain. The second sound signal acquiring unit 101 outputs the one sound signal applied with Fourier transform to the transfer function estimating unit 103.
The first sound signal acquiring unit 102 acquires the M sound signals collected by the M microphones 301 of the first sound collecting unit 30. The first sound signal acquiring unit 102 outputs the acquired M sound signals to the transfer function estimating unit 103. Alternatively, the first sound signal acquiring unit 102 applies Fourier transform on the acquired M sound signals for each frame in a time domain and thereby generates input signals in a frequency domain. The first sound signal acquiring unit 102 outputs the M sound signals applied with Fourier transform to the transfer function estimating unit 103.
The transfer function estimating unit 103 estimates a transfer function as described below by using the sound signal input from the second sound signal acquiring unit 101 and the first sound signal acquiring unit 102 and causes the storage unit 109 to store the estimated transfer function. The transfer function estimating unit 103 may associate a talker and a transfer function and may cause the storage unit 109 to store the transfer function associated with the talker, for example, in such a case that there are a plurality of drivers who use a vehicle. In this case, for example, in response to information input by a driver via an operation unit (not shown), the transfer function estimating unit 103 reads out and uses a transfer function corresponding to the driver, of the transfer functions stored in the storage unit 109.
A transfer function is stored in the storage unit 109. In such a case that there are a plurality of drivers who use a vehicle, a talker and a transfer function are associated and stored in the storage unit 109.
The sound source localizing unit 104 reads out a transfer function stored in the storage unit 109 corresponding to a sound signal input from the first sound signal acquiring unit 102 and estimates a sound source direction by using the transfer function which is read out (hereinafter, referred to as sound source localization). The sound source localizing unit 104 outputs information indicating a result of performing sound source localization to the sound source separating unit 105.
The sound source separating unit 105 reads out a transfer function stored in the storage unit 109 corresponding the information indicating a result of performing sound source localization input from the sound source localizing unit 104 and performs sound source separation of a target sound and noise by using the transfer function which is read out. The sound source separating unit 105 outputs a signal corresponding to each sound source obtained by the sound source separation to the sound feature value extracting unit 106. The target sound includes, for example, speech uttered by a talker. Noise includes a sound other than the target sound, such as wind noise or a sound emitted from another apparatus disposed in a room where sound collection is performed.
The sound feature value extracting unit 106 extracts a sound feature value of the signal corresponding to each sound source input from the sound source separating unit 105 and outputs information indicating each extracted sound feature value to the speech recognizing unit 107.
When speech uttered by a person is included in a sound source, the speech recognizing unit 107 performs speech recognition based on the sound feature value input from the sound feature value extracting unit 106 and outputs a recognition result of the speech recognition to the output unit 108.
The output unit 108 is, for example, a display device, a sound signal output device, or the like. The output unit 108 displays information based on the recognition result input from the speech recognizing unit 107 on, for example, a display unit.
As shown by an image of an arrow 401, a sound signal uttered by a talker is propagated directly to the second sound collecting unit 20. On the other hand, as shown by an image of an arrow 402, a sound signal uttered by a talker is propagated directly to or is propagated, after being reflected by a seat, a steering wheel, and the like of the vehicle, to the first sound collecting unit 30.
The relation between a transfer function and a sound signal collected by the second sound collecting unit 20 and the first sound collecting unit 30 is described.
In
x
1(t)=a1(t)s(t) (1)
In Expression (1), an operator indicated by X in a circle is an operator of tensor product. Further, when the order is N, Expression (1) is expressed by Expression (2).
Further, Expression (1) is expressed by Expression (3) in a frequency domain.
X
1(ω)=A1(ω)S(ω) (3)
Next, an acoustic model when the number of the microphone 301 of the first sound collecting unit 30 is M is described.
In
Further, when the order is N, Expression (4) is expressed by Expression (5).
Further, Expression (4) is expressed by Expression (6) in a frequency domain.
Next, an estimation method of a transfer function in the present embodiment is described. In the present embodiment, the transfer function estimating unit 103 estimates a transfer function by using any of the following seven methods.
First, a method in which the transfer function estimating unit 103 calculates a transfer function by using a regression model is described. The regression model is a model used when the correlation between independent values is examined or the like. The regression model is expressed by a product of a regressor (independent variable) and a base parameter which is an unknown parameter. The method described below is also referred to, hereinafter, as a TD (Time Domain) method.
First, when assuming first to N-th samples as one frame, an observation value x[N]T of one frame in a time domain is expressed by Expression (7).
x
[N]
T
=s
[1:N]
T
a
T(t) (7)
In Expression (7), x[N]T is an observation value, s[1:N]T is a regressor, and aT(t) is a base parameter, in the regression model. The x[N]T is a value based on a sound signal collected by the first sound collecting unit 30, s[1:N]T is a value based on a sound signal collected by the second sound collecting unit 20, and aT(t) is a transfer function to be obtained. In Expression (7), superscript T represents a transposed matrix.
Next, the observation values for F frames are expressed by Expression (8).
In Expression (8), the shift length between the frames is arbitrary, but the shift length for the TD method in the present embodiment is one in general. Therefore, in case of F frames, Expression (9) may be used.
In Expression (8), when the left-hand term is defined as x[N|1:F] and the right-hand term relating to s is defined as Φ, a least square estimation value of the transfer function aT(t) which makes a residual error square sum minimum is expressed by Expression (10). That is, the transfer function estimating unit 103 estimates a transfer function by using Expression (10).
a
T(t)=(ΦTΦ)−1ΦTx[N|1:F] (10)
In Expression (10), (ΦTΦ)−1ΦT is a pseudo inverse matrix of Φ. That is, Expression (10) represents that the transfer function aT(t) is estimated by multiplying the observation value x[N 1:F] by the pseudo inverse matrix of Φ.
In the present embodiment, only T samples from the beginning of samples in a signal are used. Hereinafter, T is referred to as a usage order.
In the present embodiment, an example is described in which estimation of a transfer function in a sound signal is performed. However, the present method can be applied to estimation of a transfer function in a non-linear model in the control of a mechanical system or the like. For example, according to the present embodiment, it is possible to estimate a parameter of a model, such as mass or inertia moment of an inverted pendulum which is one of non-linear mechanical systems, by using a regression model derived from Lagrange's motion equation.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by using a complex regression model in a frequency domain is described. The complex regression model is a complexly extended model of the regression model in a time domain. The method described below is also referred to, hereinafter, as a FD (Frequency Domain) method.
First, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame in a frequency domain is expressed by Expression (11).
X
[N]
T
=S
[N]
A
T(ω) (11)
In Expression (11), X[N]T is an observation value, S[N] is a regressor, and AT(ω) is a base parameter, in the regression model. The X[N]T is a value based on a sound signal collected by the first sound collecting unit 30, S[N] is a value based on a sound signal collected by the second sound collecting unit 20, and AT(ω) is a transfer function to be obtained. In Expression (11), S[N] is a complex scalar.
Next, the observation values for F frames are expressed by Expression (12).
In Expression (12), when the left-hand term is defined as x[N|1:F] and the right-hand term relating to S is defined as Φ, a least square estimation value of the transfer function AT(ω) which makes a residual error square sum minimum is expressed by Expression (13). That is, the transfer function estimating unit 103 estimates a transfer function by using Expression (13).
A
T(ω)=(ΦTΦ)−1ΦTx[N|1:F] (13)
Similar to Expression (10), Expression (13) represents that the transfer function AT(ω) is estimated by multiplying the observation value x[N|1:F] by the pseudo inverse matrix of Φ.
In the FD method described above, only T samples from the beginning of samples in a signal are used.
In the FD method described above, when X[n]T is converted into x[n]T by Fourier transform and when S[n] is converted into s[n] by Fourier transform, it is possible to use a window function. For example, a window function to be used is a Hamming window function. Thereby, in the FD method described above, since the number of samples cut from samples in a signal can be appropriately selected in use of estimation of a transfer function, it is possible to reduce a computation amount compared to the TD method.
Here, selection of a window function to be used is described.
The transfer function estimating unit 103 may predetermine a window function to be used. Alternatively, the transfer function estimating unit 103 may prepare a plurality of window functions to be used and may select any of the window functions depending on a sound field or a talker. For example, speech recognition may be performed by use of the configuration shown in
The shift length between frames in the FD method may be arbitrary since a transfer function of a sound field is unchanged by time. When the shift length is long, a calculation amount can be reduced, but the performance of estimation is degraded since the number of frames used in estimation of a transfer function is reduced. Therefore, the shift length between frames in the FD method is appropriately set corresponding to a desired estimation accuracy.
In the FD method, since a regression model is used, a transfer function that makes a square error in an observation sample minimum can be obtained. Therefore, it is possible to estimate a transfer function having high accuracy.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by use of an addition average between frames in a frequency domain is described. The method described below is also referred to, hereinafter, as a FDA (Frequency Domain Average) method.
First, similar to the FD method, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame is the same as that of the FD method expressed by Expression (11). The observation values for F frames are the same as those of the FD method expressed by Expression (12).
The transfer function estimating unit 103 estimates a transfer function AT(ω) by calculating an average of values obtained by dividing an output value by an input value, using Expression (14).
Expression (14) represents that a transfer function AT(ω) is estimated by calculating an average value of values, each of the values being obtained in each frame by dividing a value X[N]T based on a sound signal collected by the first sound collecting unit 30 which is an output value, by a value S[N] based on a sound signal collected by the second sound collecting unit 20 which is an input value.
The transfer function AT(w) is converted into N samples by inverse Fourier transform. In the present embodiment, only T samples from the beginning of samples in a signal are used.
In the FDA method described above, similar to the FD method, when X[n]T is converted into x[n]T by Fourier transform and when S[n] is converted into s[n] by Fourier transform, it is possible to use a window function. For example, a window function to be used is a Hamming window function. Thereby, in the FDA method described above, since the number of samples cut from samples in a signal can be appropriately selected in use of estimation of a transfer function, it is possible to reduce a computation amount compared to the TD method.
Also in the FDA method, the shift length between frames may be arbitrary since a transfer function of a sound field is unchanged by time. When the shift length is long, the calculation amount can be reduced, but the performance of estimation is degraded since the number of frames used in estimation of a transfer function is reduced. Therefore, the shift length between frames in the FDA method is appropriately set corresponding to a desired estimation accuracy.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by use of an addition average between frames in a frequency domain is described. The method described below is also referred to, hereinafter, as a FDN (Frequency Domain Normalize) method.
First, similar to the FD method, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame is the same as that of the FD method expressed by Expression (11). The observation values for F frames are the same as those of the FD method expressed by Expression (12).
The transfer function estimating unit 103 estimates a transfer function AT(ω) by calculating an average value of output values and an average value of input values separately and dividing the calculated output average value by the calculated input average value, using Expression (15).
Expression (15) represents that a transfer function AT(ω) is estimated by dividing an average value of values X[N]T by an average value of values S[N], each of the values X[N]T being obtained in each frame based on a sound signal collected by the first sound collecting unit 30 and being an output value, and each of the values S[N] being obtained in each frame based on a sound signal collected by the second sound collecting unit 20 and being an input value.
The transfer function AT(ω) is converted into N samples by inverse Fourier transform. In the present embodiment, only T samples from the beginning of samples in a signal are used.
In the FDN method described above, similar to the FD method, when X[n]T is converted into x[n]T by Fourier transform and when S[n] is converted into s[n] by Fourier transform, it is possible to use a window function. For example, a window function to be used is a Hamming window function. Thereby, in the FDN method described above, since the number of samples cut from samples in a signal can be appropriately selected in use of estimation of a transfer function, it is possible to reduce a computation amount compared to the TD method.
Also in the FDN method, the shift length between frames may be arbitrary since a transfer function of a sound field is unchanged by time. When the shift length is long, a calculation amount can be reduced, but the performance of estimation is degraded since the number of frames used in estimation of a transfer function is reduced. Therefore, the shift length between frames in the FDN method is appropriately set based on the desired estimation accuracy.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by use of an addition average between frames in a frequency domain is described. The method described below is also referred to, hereinafter, as a FDP (Frequency Domain Phase Average) method.
First, similar to the FD method, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame is the same as that of the FD method expressed by Expression (11). The observation values for F frames are the same as those of the FD method expressed by Expression (12).
By using an amplitude value which is an averaged value between frames and selecting a phase of the most probably reliable frame (assume the frame as the k-th frame; here, k is a value equal to or more than 1 and equal to or less than F), a transfer function AT(ω) is expressed by Expression (16).
In Expression (16), < represents a phase angle. In the right-hand first term of Expression (16), an average value of absolute values of X[N]T, each of the absolute values of X[N]T being obtained in each frame based on a sound signal collected by the first sound collecting unit 30, is divided by an average value of absolute values of S[N], each of the absolute values of S[N] being obtained in each frame based on a sound signal collected by the second sound collecting unit 20. That is, the right-hand first term represents averaging amplitudes between frames.
Next, the right-hand second term represents that a phase angle of a value X[N]T in the probably reliable k-th frame based on a sound signal collected by the first sound collecting unit 30 is divided by a phase angle of a value S[N] in the probably reliable k-th frame based on a sound signal collected by the second sound collecting unit 20.
Then, by multiplying the right-hand first term by the right-hand second term, a transfer function AT(ω) is estimated.
The transfer function estimating unit 103 selects the most probably reliable k-th frame based on a selection index. As the selection index, it is possible to select a frame having a large power over the entire region of the usage frequency band.
The transfer function AT(ω) is converted into N samples by inverse Fourier transform. In the present embodiment, only T samples from the beginning of samples in a signal are used.
According to the FDP method described above, similar to the FD method or the like, it is possible to multiply a window for converting X[n]T into x[n]T by Fourier transform. Similarly, it is possible to multiply a window for converting S[n] into s[n] by Fourier transform. Therefore, in the FDP method, it is possible to reduce a computation amount compared to the TD method.
Also in the FDP method, the shift length between frames may be arbitrary since a transfer function of a sound field is unchanged by time. When the shift length is long, a calculation amount can be reduced, but the performance of estimation is degraded since the number of frames used in estimation of a transfer function is reduced. Therefore, the shift length between frames in the FDP method is appropriately set corresponding to a desired estimation accuracy.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by use of an addition average between frames in a frequency domain, which is further applied with a cross spectrum method, is described. The method described below is also referred to, hereinafter, as a FDC (Frequency Domain Cross Spectrum) method.
First, similar to the FD method, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame is the same as that of the FD method expressed by Expression (11). The observation values for F frames are the same as those of the FD method expressed by Expression (12).
By using the cross spectrum method, a transfer function A(ω) is expressed by Expression (17). In Expression (17), superscript * (asterisk) represents the complex conjugate.
The cross spectrum method is described.
A power spectrum density function Sx(f) can be obtained by applying Fourier transform on an autocorrelation function Rx, and a cross spectrum density Sxy(f) can be obtained by applying Fourier transform on a crosscorrelation function Rxy.
Further, according to the convolution theorem in which a convolution relation in a time domain is a product relation in a frequency domain, the cross spectrum density Sxy(f) is represented by a frequency domain expression of an impulse response, that is, the product of a transfer function H(f) and the power spectrum density function Sx(f).
Further, according to the Fourier transform relation between the power spectrum density and the signal, the power spectrum density function Sx(f) is represented by Expression (18), and the cross spectrum density Sxy(f) is represented by Expression (19).
S
x(f)=E|X*(f)X(f)| (18)
S
xy(f)=E|X*(f)Y(f)| (19)
That is, by applying Fourier transform on the observed input signal x(t) and the observed output signal y(t), or applying Fourier transform on a discrete time expression x(n) of the signal x(t) and a discrete time expression y(n) of the signal y(t), and performing calculations of Expression (18) and Expression (19), estimation of the impulse response can be performed.
In Expression (17) described above, the denominator of the right-hand term is the sum of Expression (18), and the numerator corresponds to the sum of Expression (19). Accordingly, by dividing the sum of Expression (18) by the sum of Expression (19), the transfer function H(f)=A(ω) can be calculated.
The transfer function AT(ω) is converted into N samples by inverse Fourier transform. In the present embodiment, only T samples from the beginning of samples in a signal are used.
As described above, according to the FDC method, similar to the FD method or the like, it is possible to multiply a window for converting X[n]T into x[n]T by Fourier transform. Similarly, it is possible to multiply a window for converting S[n] into s[n] by Fourier transform. Therefore, in the FDC method, it is possible to reduce a computation amount compared to the TD method.
Also in the FDC method, the shift length between frames may be arbitrary since a transfer function of a sound field is unchanged by time. When the shift length is long, a calculation amount can be reduced, but the performance of estimation is degraded since the number of frames used in estimation of a transfer function is reduced. Therefore, the shift length between frames in the FDC method is appropriately set corresponding to a desired estimation accuracy.
Next, a method in which the transfer function estimating unit 103 estimates a transfer function by use of one frame in a frequency domain is described. The method described below is also referred to, hereinafter, as a FDS (Frequency Domain Single Frame) method.
First, similar to the FD method, when assuming first to N-th samples as one frame, an observation value X[N]T of one frame is the same as that of the FD method expressed by Expression (11).
According to Expression (11), a transfer function AT(ω) for one frame is calculated. The calculated transfer function is expressed by Expression (20).
Since a transfer function is estimated by the observation value only for one frame, the number of samples in one frame can be greater than that used in the FD method or the like.
Next, a process sequence performed by the transmission function estimating unit 103 in the FDS method is described.
As described above, according to the FDS method, similar to the FD method or the like, it is possible to multiply a window for converting X[n]T into x[n]T by Fourier transform. Similarly, it is possible to multiply a window for converting S[n] into s[n] by Fourier transform. Therefore, in the FDS method, it is possible to reduce a computation amount compared to the TD method.
As described above, the sound processing apparatus 10 of the present embodiment includes: the first sound collecting unit 30 that is placed in a sound field and collects a sound signal which is speech of a talker; the second sound collecting unit 20 that is arranged to be movable to a position which is closer to a talker than the first sound collecting unit 30 and collects the sound signal; the transfer function estimating unit 103 that estimates a transfer function from a sound signal collected by the first sound collecting unit 30 and a sound signal collected by the second sound collecting unit 20 when a talker is at a predetermined position in the sound field; and a sound signal processing unit (sound source localizing unit 104, sound source separating unit 105, sound feature value extracting unit 106, speech recognizing unit 107) that performs a process of the sound signal by use of the transfer function estimated by the transfer function estimating unit 103.
In addition, in the sound processing apparatus 10 of the present embodiment, the second sound collecting unit 20 is arranged at a position where the direct sound of a talker can be collected.
According to this configuration, the sound processing apparatus 10 of the present embodiment is capable of accurately estimating a transfer function in a sound field.
Next, a test result is described in a case where the sound processing apparatus 10 of the present embodiment is used.
In
In
As shown in
On the other hand, as shown in
Thus, according to the sound processing apparatus 10 of the present embodiment, it was possible to improve the speech recognition rate by about 40% compared to the conventional technique.
Estimation of a transfer function by use of the methods described above may be performed only at the first time. The transfer function estimating unit 103 may cause the storage unit 109 to store the estimated transfer function and may use the transfer function stored in the storage unit 109 at and after the second time. The measurement at the first time may be performed, for example, at the time of adjusting the seat position of a vehicle inside or the like, in accordance with a command from a control unit which performs a variety of control of the vehicle.
In addition, in a case where the second sound collecting unit 20 is a mobile phone such as a smartphone, when a driver makes a phone call with the mobile phone while stopping the vehicle, the transfer function estimating unit 103 may acquire a sound signal and may estimate a transfer function. Further, when a driver makes a phone call with a mobile phone, the transfer function may be sequentially updated.
In addition, in the present embodiment, only a driver is described as an example of a talker. However, a transfer function can be estimated as described above with respect to a sound signal of a person seated at a passenger seat, a rear seat, or the like. In this case, for example, the transfer function estimating unit 103 may switch one of the transfer functions stored in the storage unit 109 to another, corresponding to a result of operation of the operation unit (not shown) by the driver or another person.
In the first embodiment, an example is described in which the transfer function estimating unit 103 estimates a transfer function by using one of the methods described above; however, the embodiment is not limited thereto. The transfer function estimating unit 103 may estimate a transfer function by using two or more of the methods.
For example, the transfer function estimating unit 103 may integrate the FD method and the TD method and may estimate a transfer function as described below. The transfer function estimating unit 103 integrates A(ω) and a(t) obtained by least square estimation. Then, the transfer function estimating unit 103 performs analogical reasoning at the time of transfer function interpolation. Further, the transfer function estimating unit 103 calculates an accuracy of phase in the FD method and an accuracy of amplitude in the TD method. Then, the transfer function estimating unit 103 compares the calculated accuracy of phase or accuracy of amplitude with a predetermined accuracy. The transfer function estimating unit 103 estimates a transfer function by the FD method when the accuracy of phase is better than the predetermined accuracy. On the other hand, the transfer function estimating unit 103 estimates a transfer function by the TD method when the accuracy of amplitude is better than the predetermined accuracy.
The first embodiment is described using an example in which a sound signal uttered by a talker is collected by use of the second sound collecting unit 20 and the first sound collecting unit 30, and a transfer function is estimated based on the collected sound signal; however, the embodiment is not limited thereto. For example, the first sound collecting unit 30 acquires a sound signal emitted from a loudspeaker instead of a talker. Then, the transfer function estimating unit 103 may obtain a transfer function by using the acquired sound signal as an observation value and may integrate the obtained transfer function and an estimated transfer function by any of the methods described above.
The transfer function Ã(ω) estimated based on the sound signal of the talker collected by the second sound collecting unit 20 and the first sound collecting unit 30 is represented by Expression (21) and Expression (23).
In Expression (21), Ã(ω) is expressed by Expression (22), and D is expressed by Expression (23).
Ã(ω)=λ[T]e−jωt
Ã(ω)=DA(ω)+(1−D)·Ã(ω) (23)
In Expression (23), the interpolated transfer function Ã(ω) is expressed by Expression (24).
Ã(ω)=λ[F]e−jωt
From Expression (21) and Expression (23), Ã(ω) is expressed by Expression (25).
Ã(ω)=λ[T]e−jωt
It is possible to adjust which one of Expression (21) and Expression (23) is weighted by the value of D.
The meaning of integrating a transfer function measured based on a sound signal output from a loudspeaker and a transfer function estimated based on a sound signal of a talker collected by the second sound collecting unit 20 and the first sound collecting unit 30 is to interpolate two transfer functions of the same direction and further interpolate a GMM described below.
As described above, by integrating a transfer function measured based on a sound signal output from a loudspeaker and a transfer function estimated based on a sound signal of a talker collected by the second sound collecting unit 20 and the first sound collecting unit 30, it is possible to estimate a transfer function in consideration of individual differences of drivers (for example, body height, direction of speech).
In addition, when switching between transfer functions of a plurality of talkers, the transfer function estimating unit 103 (talker identifying unit) may perform talker identification by using a sound signal collected by the first sound collecting unit 30 and switch to a transfer function corresponding to the identified talker. In this case, prior learning may be performed for talker identification, by using a GMM (Gaussian Mixture Model). Alternatively, the transfer function estimating unit 103 may generate an acoustic model used for identification from a sound signal used when a transfer function is estimated based on a sound signal collected by the second sound collecting unit 20 and the first sound collecting unit 30 and may cause the storage unit 109 to store the generated acoustic model. Then, the transfer function estimating unit 103 obtains the likelihood for each talker of the GMM by using a feature value extracted by the sound feature value extracting unit 106. Accordingly, by using a ratio of such calculated likelihoods, D in Expression (21) and Expression (23) may be determined. In other words, a transfer function of an acoustic model corresponding to the likelihood of the largest value is employed. In a case where a transfer function to be used is manually switched, D is 0 or 1.
The first embodiment is described using an example in which a sound signal is collected by using the second sound collecting unit 20 which is a close-talking microphone and the first sound collecting unit 30 which is a microphone array, and a transfer function is estimated based on the collected sound signal. The present embodiment is described using an example in which a sound signal is collected by using the first sound collecting unit 30 without using the second sound collecting unit 20, and a transfer function is estimated based on the collected sound signal.
For example, the imaging unit 40 which captures an image including the mouth of a talker is connected to the mouth position estimating unit 110. The mouth position estimating unit 110 estimates a position of the mouth of a talker relative to the first sound collecting unit 30 based on the image captured by the imaging unit 40. The mouth position estimating unit 110, for example, estimates a position of the mouth of a talker relative to the first sound collecting unit 30 based on the size of an image of the mouth included in the captured image. The mouth position estimating unit 110 outputs information indicating the estimated mouth position to the transfer function estimating unit 103A.
When a position of a sound source is estimated by using a Kalman filter based on a sound signal only, the transfer function estimating unit 103A may include the mouth position estimating unit 110.
The transfer function estimating unit 103A estimates a transfer function by using the information indicating a mouth position output from the mouth position estimating unit 110 and the sound signal collected by the first sound collecting unit 30 and causes the storage unit 109 to store the estimated transfer function.
A time difference t[l] with reference to a first microphone 301 described below and information indicating a position of a talker relative to the microphone 301 are input to the observation model unit 701. As described below, the observation model unit 701 uses an observation model to calculate an observation model ζ[l] and outputs the calculated observation model ζ[l] to the updating unit 702.
The updating unit 702 uses the observation model ζ[l] input from the observation model unit 701, a variance P̂[l|l−1] input from the predicting unit 703, and an observation value h(ζ̂[l]) input from the observation unit 704 to update an observation model ζ̂[l] and a variance P̂[l] and outputs the updated observation model ζ̂[l] and variance P̂[l] to the predicting unit 703.
The predicting unit 703 predicts the next observation model ζ̂[l|l−1] and variance P̂[l|l−1] by using the observation model ζ̂[l] and variance P̂[l] input from the updating unit 702. The predicting unit 703 outputs the predicted observation model ζ̂[l|l−1] and variance P̂[l|l−1] to the observation unit 704 and outputs the predicted variance P̂[l|l−1] to the updating unit 702.
The observation unit 704 calculates the observation value h(ζ̂[l]) by using the observation model ζ̂[l|l−1] and variance P̂[l|l−1] input from the predicting unit 703 and outputs the calculated observation value h(ζ̂[l]) to the updating unit 702.
A propagating wave model is described. In the description below, a signal in a frequency domain based on a sound signal uttered by a talker is referred to as S(ω), a signal in a frequency domain based on a sound signal collected by a microphone is referred to as X[n](ω), and a transfer function is referred to as A(ξs, ξm[n], ω).
A signal X[n](ω) in a frequency domain in a case where a sound signal is one channel is expressed by Expression (26). Here, n represents the number of a microphone, ξs represents a speech position, and ξm[n] represents the position of an n-th microphone.
X
[n](ω)=A(ξs, ξm[n], ω)S(ω) (26)
In Expression (26), ξs is expressed by Expression (27), and ξm[n] is represented by Expression (28).
ξs=[xs, ys]T (27)
ξm[n]=[xm[n], ym[n]]T (29)
A signal X(ω) in a frequency domain in a case where a sound signal is a multichannel is expressed by Expression (29).
X(ω)=[X[l](ω), . . . , X[N](ω)]T (290)
In Expression (29), a transfer function A(ξs, ξm, ω) is expressed by Expression (30).
A(ξs, ξm, ω)=[A(ξs, ξm[l], ω), . . . , A(ξs, ξm[N], ω)] (30)
As shown in
In Expression (31), c represents the speed of light. From Expression (27) and Expression (28), the distance D[n] is expressed by Expression (32).
D
[n]=√{square root over ((xs−xm[n])2+(ys−ym[n])2)}{square root over ((xs−xm[n])2+(ys−ym[n])2)} (32)
Next, a motion model is described.
The motion model (random walk model) of a talker is expressed by Expression (33).
ξs[l+1]=ξs[l]+Ws[l] (33)
In Expression (33), Ws[l] is expressed by Expression (34).
W
s[l]
=[N(0, σx), N(0, σy)]T (34)
The motion model (random walk model) of a microphone is expressed by Expression (35).
ξm[l+1]=ξm[l]+Wm[l] (35)
In Expression (35), Wm[1] is expressed by Expression (36), and Wm[n][l] is expressed by Expression (37).
W
m[l]
=[W
m[1][l]
, . . . , W
m[N][l]]T ε2N×1 (36)
W
m[n][l]
=[N(0, σm), N(0, σm)]T (37)
In Expression (36), R represents a covariance matrix.
Next, an observation model is described. The observation model described below is stored in the observation model unit 701.
When observing a time difference with reference to the first microphone 301, the time difference is expressed by Expression (38).
The observation model is expressed by Expression (39).
The observation model unit 701 calculates an observation model ζ[l] by using Expression (38) and Expression (39) and outputs the calculated observation model ζ[l] to the updating unit 702.
Next, a prediction step performed by the predicting unit 703 is described.
The predicting unit 703 performs update of an average by using Expression (40).
The predicting unit 703 performs update of a variance P by using Expression (41).
In Expression (41), I represents a unit matrix, and diag( ) represents a diagonal matrix. P represents a variance, F represents a linear model relating to the time transition of a system, and R represents a covariance matrix. The predicting unit 703 updates an observation model ζ̂[l|l−1] by an observation model ζ̂[l−1] input from the updating unit 702 and outputs the updated observation model ζ̂[l|l−1] to the observation unit 704. Further, the predicting unit 703 updates a variance P̂[l|l−1] by a variance P̂[l−1] input from the updating unit 702 and outputs the updated variance P̂[l|l−1] to the observation unit 704 and the updating unit 702.
Next, an observation step performed by the observation unit 704 is described.
The observation unit 704 observes the observation model ζ̂[l|l−1] input from the predicting unit 703, calculates an observation value h(ζ̂[l]) using Expression (42), and outputs the calculated observation value h(ζ̂[l]) to the updating unit 702.
Next, an update step performed by the updating unit 702 is described.
The updating unit 702 updates a Karman gain K using Expression (43).
K
[l]
=P
[l|l−1]
H
[l]
T(H[l]P[l|l−1]H[l]T+Q[l])−1 (43)
In Expression (43), H represents an observation model which plays a role of linearly mapping an observation space on a state space, and Q represents a covariance matrix.
The updating unit 702 updates the observation model ζ̂[l] using Expression (44).
{circumflex over (ξ)}[l]={circumflex over (ξ)}[l|l−1]+K[l](ξ[l]−h({circumflex over (ξ)}[l])) (44)
In Expression (43), P[l] is expressed by Expression (45), H[l] is expressed by Expression (46), and Q[l] is expressed by Expression (47).
In Expression (47), σr represents a variance with respect to an observation.
The updating unit 702 updates the observation model ζ̂[l] and variance P̂[l] by using the observation model ζ[l] input from the observation model unit 701, the observation value h(ζ̂[l]) input from the observation unit 704, the variance P̂[l|l−1] input from the predicting unit 703, and Expression (44) to Expression (47) described above and outputs the updated observation model ζ̂[l] and variance P̂[l] to the predicting unit 703.
The transfer function updating unit 103A-1 performs the update described above until an estimation error becomes minimum and estimates a transfer function A(ξ̂s[l], τ̂m[l], ω).
As described above, the sound processing apparatus 10A of the present embodiment includes: the first sound collecting unit 30 that is placed in a sound field and collects a sound signal which is speech of a talker; a talker position estimating unit (mouth position estimating unit 110) that estimates a talker position which is a position of a talker relative to the first sound collecting unit 30; the transfer function estimating unit 103 that estimates a transfer function from a sound signal collected by the first sound collecting unit 30 when a talker is at a predetermined position in the sound field and the estimated talker position; and a sound signal processing unit (sound source localizing unit 104, sound source separating unit 105, sound feature value extracting unit 106, speech recognizing unit 107) that performs a process of the sound signal by use of the transfer function estimated by the transfer function estimating unit 103.
By this configuration, according to the present embodiment, it is possible to estimate a transfer function without using the second sound collecting unit 20, by using the first sound collecting unit 30 only.
When a sound signal is collected by using the second sound collecting unit 20 and the first sound collecting unit 30, and a transfer function is estimated based on the collected sound signal for the first time only, a sound signal may be collected by using the first sound collecting unit 30 at and after the second time. The transfer function estimating unit 103 may use a sound signal collected by the first sound collecting unit 30 as an observation value and, by sequentially updating a Karman filter, adjust the transfer function estimated for the first time. Thus, the transfer function can be adjusted.
Since such sequential update is performed, the transfer function estimating unit 103 may estimate a transfer function by using a method in a time domain of the methods described above.
The first embodiment is described using an example in which, in a case where there are a plurality of drivers, a sound signal is collected by using the second sound collecting unit 20 and the first sound collecting unit 30, and a transfer function is estimated based on the collected sound signal; however, the embodiment is not limited thereto.
For example, only speech of a first driver is collected by using the second sound collecting unit 20 and the first sound collecting unit 30, and a transfer function is estimated based on the collected sound signal. Speech of another driver is collected by using the first sound collecting unit 30. Then, the transfer function estimating unit 103 or 103A may use the collected sound signal which is speech of a driver as an observation value and, by sequentially updating a Karman filter, adjust the transfer function of the first driver. Thus, the transfer function of the first driver can be adjusted. The transfer function estimating unit 103 or 103A may associate the transfer function adjusted in this way with the driver as a talker and cause the storage unit 109 to store the associated transfer function.
Similarly, since sequential update is performed, the transfer function estimating unit 103 or 103A may estimate a transfer function by using a method in a time domain of the methods described above.
Also in the sound processing apparatus 10 of the first embodiment, the talker identification described above may be performed. The transfer function estimating unit 103 or 103A determines whether or not a transfer function corresponding to an identified talker is already stored in the storage unit 109. When a transfer function corresponding to the talker is already stored in the storage unit 109, the transfer function estimating unit 103 or 103A reads out the transfer function corresponding to the talker from the storage unit 109 and uses the transfer function which is read out.
On the other hand, when a transfer function corresponding to the talker is not already stored in the storage unit 109, the transfer function estimating unit 103 or 103A may perform notification which prompts a talker to talk. For example, the notification may be performed by use of a sound signal from a loudspeaker (not shown) connected to the sound processing apparatus 10 or the like, or may be performed by use of an image or character information from a display unit (not shown) connected to the sound processing apparatus 10 (or 10A) or the like.
Hereinafter, an example of a process sequence in which identification of a talker is performed and a transfer function is set is described by using
First, an example of a process of setting a transfer function is described by using
Next, another example of a process sequence of setting a transfer function is described by using
In Step S303, for example, when the user selects information indicating that a speech recognition function is not used, the transfer function estimating unit 103A may determine that measurement of a transfer function is not performed. Alternatively, when the user selects information indicating that a speech recognition function is used, the transfer function estimating unit 103A may determine that measurement of a transfer function is performed.
Next, still another example of a process sequence of setting a transfer function is described by using
In the example shown in
The process sequences shown in
In this way, by using a plurality of acoustic models or language models, for example, even in such a case that a first user is a man who speaks Japanese and a second user is a woman who speaks English, the sound processing apparatus 10A of the present embodiment can measure a transfer function in a space such as in a vehicle by using an acoustic model or a language model for each user. As a result, according to the present embodiment, it is possible to improve a speech recognition rate in a space such as in a vehicle.
The first embodiment is described using an example in which the transfer function estimating unit 103 estimates a transfer function based on a sound signal collected by the second sound collecting unit 20 which is a close-talking microphone and the first sound collecting unit 30 which is a microphone array.
The present embodiment is described using an example in which a transfer function is estimated by using only the microphone array without using the close-talking microphone.
The first sound signal acquiring unit 102B acquires M sound signals, one of the M sound signals being collected by each of the M microphones 301 of the first sound collecting unit 30B. The first sound signal acquiring unit 102B outputs the acquired M sound signals to the transfer function estimating unit 103B, the delaying unit 111, and the selecting unit 112.
The delaying unit 111 applies a delay operation (time delay, time shift) by a predetermined time on the M sound signals input from the first sound signal acquiring unit 102B. Here, the predetermined time is, as described below, a time which makes an impulse response of a sound signal closer to the sound source than a microphone 301 corresponding to a representative channel selected by the selecting unit 112 be at a positive time by calculation. The delaying unit 111 applies Fourier transform in a time domain on the time-delayed M sound signals for each frame and thereby generates an input signal in a frequency domain. The delaying unit 111 outputs Fourier-transformed M sound signals to the transfer function estimating unit 103B. The sound signal input to the sound source localizing unit 104 may be a signal which is delayed by the delaying unit 111 and on which the Fourier transform is not applied yet.
The selecting unit 112 selects one sound signal of the M sound signals input from the first sound signal acquiring unit 102B. The selected sound signal may be arbitrary, or may be one corresponding a predetermined microphone 301. The selecting unit 112 outputs information indicating the selection result, to the transfer function estimating unit 103B. The selection of a sound signal may be performed by the transfer function estimating unit 103B.
The transfer function estimating unit 103B estimates a transfer function as described below by using the information indicating the selection result input from the selecting unit 112 and the sound signal input from the delaying unit 111 and outputs the estimated transfer function to the sound source localizing unit 104. Further, the transfer function estimating unit 103B causes the storage unit 109 to store the estimated transfer function. The transfer function estimating unit 103B may associate a talker and a transfer function and may cause the storage unit 109 to store the transfer function associated with the talker, for example, in such a case that there are a plurality of drivers who use a vehicle. In this case, for example, in response to information input by a driver via an operation unit (not shown), the transfer function estimating unit 103B reads out and uses a transfer function corresponding to the driver, of the transfer functions stored in the storage unit 109.
In the example shown in
As shown in
In the following description, a first channel sound signal that arrives at the microphone 301-1 is referred to as 1ch, a second channel sound signal that arrives at the microphone 301-2 is referred to as 2ch, a third channel sound signal that arrives at the microphone 301-3 is referred to as 3ch, and a fourth channel sound signal that arrives at the microphone 301-4 is referred to as 4ch.
In
One of the signals x1(t) to x4(t) is a time domain signal of the sound signal collected by each of the microphones 301-1 to 301-4. Further, ã1(t) is a transfer function estimated between the microphone 301-1 and the microphone 301-1, ã2(t) is a transfer function estimated between the microphone 301-1 and the microphone 301-2, ã3(t) is a transfer function estimated between the microphone 301-1 and the microphone 301-3, and ã4(t) is a transfer function estimated between the microphone 301-1 and the microphone 301-4.
Next, a case where the number of microphones 301 is M is described.
One of a1(t) to a4(t) is a transfer function of each of the microphones 301-1 to 301-4. First, it is assumed that the sound signal collected by the microphone 301-1 is a representative channel. When the order is N, time domain signals x1[N] to xM[N] are expressed by Expression (48).
In
Here, it is assumed that the 1ch is a representative channel, and it is assumed that as the waveform g1, the start time of the impulse response of the 1ch transfer function is 0. As the waveform g2, a time t13 is the start time of the impulse response of the 2ch transfer function, and as the waveform g3, a time t12 is the start time of the impulse response of the 3ch transfer function. As the waveform g4, a time −t11 is the start time of the impulse response of the 4ch transfer function.
That is, in a case where an arbitrary microphone 301 is selected of the microphones 301, when there is a microphone 301 which is closer to the mouth of the talker Sp than the selected microphone 301, a direct wave arrives at a negative time of the impulse response of the transfer function with respect to the microphone 301.
Therefore, in the present embodiment, even in a case where an arbitrary microphone 301 is selected of the microphones 301 by the selecting unit 112, the delaying unit 111 performs a delay operation by a predetermined time T such that the start time of a channel which is closer to the sound source than the representative channel is not at a negative time, and estimation of a transfer function is performed.
As shown in
That is, even in a case where an arbitrary microphone 301 is selected of the microphones 301, and there is a microphone 301 which is closer to the mouth of the talker Sp than the selected microphone 301, a direct wave arrives at a positive time of the impulse response of the transfer function with respect to all of the microphones 301.
When the number of the microphones 301 is M, and the order is N, time domain signals x1[N] to xM[N] are ones delayed from Expression (48) by the time T and therefore is expressed by Expression (49).
In Expression (49), the left-hand term is defined as x[N], the first right-hand term is defined as a(t), and the second right-hand term is defined as x1[1−T:N−T].
When Fourier transform is applied on Expression (49), Expression (49) is converted into Expression (50).
X
[N]
=A(ω)X1(ω) (50)
In Expression (50), ω is a frequency in a frequency domain, and X1[N] is a complex scalar.
From Expression (50), when assuming first to N-th samples as one frame, an observation value X[N]T of one frame in a frequency domain is expressed by Expression (51).
X
[N]
T
=X
1[N]
A
T(ω) (51)
The transfer function estimating unit 103B estimates a transfer function by the same process as that of the TD method, FD method, FDA method, FDN method, FDP method, FDC method, and/or FDS method described in the first embodiment using Expression (51) as the observation value of one frame.
Next, test results in a case where the sound processing apparatus 10B of the present embodiment is used are described.
First, the test condition is described. A sound source used for the test was a loudspeaker capable of changing the angle by each 30 degrees. Speech uttered by a person was recorded, and the recorded sound signal was output from the loudspeaker. Collection of the sound signal was performed by using eight microphones 301.
In the sound processing apparatus 10B, the order N is 4096, and the usage sample number is 16384×1. The transfer function estimating unit 103B estimated a transfer function by using the FD method. In the estimation condition, the usage order T is 4096, the frame length N is 1638, the shift length is 10, the used window function is a Hamming function, and the delay amount T is 128. The test was performed by changing the angle of the loudspeaker to be −60 degrees, −30 degrees, 0 degree, 30 degrees, and 60 degrees.
Next, a result of performing sound source localization by using the sound processing apparatus 10B is described.
In
As the lines g31 to g33 in
As described above, the sound processing apparatus 10B of the present embodiment includes: a first sound collecting unit (first sound collecting unit 30B, first sound signal acquiring unit 102B) that is placed in a sound field and collects a sound signal which is speech of a talker, by use of a plurality of microphones 301-1 to 301-M; the delaying unit 111 that delays all sound signals collected by the first sound collecting unit, by a predetermined time; the selecting unit 112 that selects one microphone of the plurality of microphones 301-1 to 301-M; the transfer function estimating unit 103B that estimates a transfer function of another microphone relative to the selected one microphone by use of a sound signal delayed by the delaying unit 111; and a sound signal processing unit (sound source localizing unit 104, sound source separating unit 105, sound feature value extracting unit 106, speech recognizing unit 107) that performs a process of the sound signal by use of the transfer function estimated by the transfer function estimating unit 103B.
According to this configuration, in the sound processing apparatus 10B of the present embodiment, an arbitrary microphone 301 of the plurality of microphones 301 included in the first sound collecting unit 30B is selected as a representative channel.
Then, by shifting the start time of the impulse in the transfer function of the representative channel by a time T, it is possible to estimate a transfer function even when there is a microphone 301 closer to the sound source than a microphone 301 corresponding to the selected representative channel. As a result, it is possible to accurately estimate a transfer function by using a microphone array without using a close-talking microphone even in a narrow space such as a vehicle inside.
The present embodiment is described using an example in
Further, the sound processing apparatus 10B may include the mouth position estimating unit 110 (
Further, the present embodiment is described using an example in which the acquired sound signal is delayed by a predetermined time T; however, the delay time T may be calculated by the sound processing apparatus 10B. For example, when the sound processing apparatus 10B is placed in a vehicle, a known sound signal is emitted from an assumed position of the mouth of the driver, and the emitted sound signal is acquired by the first sound collecting unit 30B and the first sound signal acquiring unit 102B. Then, the sound processing apparatus 10B may calculate the delay time T based on the timing of the acquired sound signal of each channel.
For example, the sound processing apparatus 10B may calculate the difference between a time when the sound signal is acquired earliest and a time when the sound signal is acquired latest and calculate, as the delay time T, a time obtained by adding a predetermined margin to the calculated difference or a time obtained by multiplying the calculated difference by a predetermined value.
In the first to third embodiments, a vehicle is described as an example of a sound field; however, the embodiment is not limited thereto. For example, the sound field may be an indoor room, a conference room, or the like. In this case, the position of a talker may be substantially fixed such as a case in which, for example, a talker sits on a sofa provided in the room or the like. When the position of a talker is substantially fixed in this way, estimation of a transfer function based on the sound signal collected by the second sound collecting unit 20 and the first sound collecting unit 30 in the sound processing apparatus 10 may be performed only once. Alternatively, estimation of a transfer function based on the sound signal collected by the first sound collecting unit 30A in the sound processing apparatus 10A may be performed only once. Alternatively, estimation of a transfer function based on the sound signal collected by the first sound collecting unit 30B in the sound processing apparatus 10B may be performed only once. After the estimation, speech recognition may be performed by using a transfer function stored in the storage unit 109, or by using a transfer function obtained by updating the stored transfer function by use of the sound signal collected by the first sound collecting unit 30 (or 30A, 30B). In this way, when the sound field is a room or the like, the second sound collecting unit 20 in the sound processing apparatus 10 may be a mobile phone or the like. In a case where the second sound collecting unit 20 in the sound processing apparatus 10 is a mobile phone or the like, a transfer function may be estimated or be updated when a talker makes a phone call.
The sound processing apparatuses 10, 10A, and 10B output the result of such speech recognition, for example, to an apparatus (for example, TV, air conditioner, projector) provided inside a room or the like.
The apparatus provided inside a room may operate corresponding to the input speech recognition result.
The sound source direction may be estimated by recording a program for performing the functions of the sound processing apparatus 10 (or 10A, 10B) according to the invention on a computer-readable recording medium, reading the program recorded on the recording medium into a computer system, and executing the program. Here, the “computer system” may include an OS or hardware such as peripherals. The “computer system” may include a WWW system including a homepage providing environment (or display environment). Examples of the “computer-readable recording medium” include portable mediums such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM and a storage device such as a hard disk built in a computer system. The “computer-readable recording medium” may include a medium that temporarily holds a program for a predetermined time, like a volatile memory (RAM) in a computer system serving as a server or a client in a case where the program is transmitted via a network such as the Internet or a communication circuit such as a telephone circuit.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by transmission waves in the transmission medium. Here, the “transmission medium” via which the program is transmitted means a medium having a function of transmitting information such as a network (communication network) such as the Internet or a communication circuit (communication line) such as a telephone line. The program may be configured to realize a part of the above-mentioned functions or may be configured to realize the above-mentioned functions by combination with a program recorded in advance in a computer system, like a so-called differential file (differential program).
While preferred embodiments of the invention have been described and shown above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description, and is only limited by the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2013-261544 | Dec 2013 | JP | national |