This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2016-169999, filed on Aug. 31, 2016; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a signal processing system, a signal processing method, and a computer program product.
Under circumstances where a microphone is away from sound sources, when a plurality of sound sources are present, consideration is given to collecting sounds for individual sound sources in high quality. The microphone observes signals coming from the sound sources mixed in a space. For this reason, it is desirable that the signals be separated for each sound source, or that sound capture be performed while suppressing signals coming from other sound sources (noise sources) when a single sound source is targeted. To this end, signal processing techniques have been proposed to enhance a target speech using multichannel acoustic signals obtained by a microphone array, that is, a plurality of microphones.
In the conventional techniques, a variation in acoustic characteristics of a space, a deviation from an expected arrangement or sensitivity of microphones, and other factors, have decreased the accuracy of estimating the sound source in some cases.
According to one embodiment, a signal processing system includes a filter unit, a conversion unit, a decomposition unit, and an estimation unit. The filter unit applies, to a plurality of time series input signals, N filters estimated by independent component analysis of the input signals to output N output signals. The conversion unit converts the output signals into nonnegative signals each taking on a nonnegative value. The decomposition unit decomposes the nonnegative signals into a spatial basis that includes nonnegative three-dimensional elements, that is, K first elements, N second elements, and I third elements, a spectral basis matrix of I rows and L columns that includes L nonnegative spectral basis vectors expressed by I-dimensional column vectors, and a nonnegative L-dimensional activity vector. The estimation unit estimates sound source signals representing signals of the signal sources based on the output signals using the spatial basis, the spectral basis matrix, and the activity vector.
Exemplary embodiments of a signal processing system according to the present invention are described below in detail with reference to the accompanying drawings.
Techniques have been proposed for estimating a sound source signal in a particular direction (region) on the basis of output of a plurality of linear spatial filters. Such techniques estimate a sound source signal in a particular direction by modeling power spectral density of a plurality of output signals of linear spatial filters as the product of power spectral density of sound source signals in respective directions (regions) and a gain matrix prepared in advance, and multiplying a (pseudo) inverse matrix of the gain matrix by output vectors of the respective linear spatial filters, for example. In doing so, the gain matrix is calculated in advance from spatial arrangement of microphones and parameters of the linear spatial filters. As described above, a variation in acoustic characteristics of a space and other factors may cause variance between a previously expected environment and an actual environment of observation signals, deteriorating the quality of estimated results.
Although a signal processing system according to the first embodiment does not make assumptions in advance as described above, it estimates simultaneously, from observation signals themselves, information equivalent to a gain matrix and parameters of the observation signals instead. Thus, sound source estimation of higher quality than ever before is possible. In the present embodiment, model parameters for processing are adaptively estimated based on input while utilizing spatial information obtained from output of multichannel signal processing and observation signals. First, a plurality of output signals of multichannel signal processing are obtained so as to be separated for individual sound sources as much as possible by means of blind sound source separation, for example. A problem of sound source separation is then formulated as a problem of nonnegative tensor (matrix) factorization (NTF (NMF)) when the amplitude or power spectrum of the multichannel output signals is viewed as second-order or third-order tensor (matrix). The result of the factorization is used to constitute a noise suppression filter.
In the following embodiments, an example is described in which a sound source serves as a signal source and an acoustic signal (sound source signal) generated from the sound source serves as a signal source signal. The signal source and the signal source signal are not limited to a sound source and a sound source signal, respectively. Other signals (such as a brain wave signal and a radio wave signal) having a space propagation model similar to that of an acoustic signal may be applied as time series input signals, which are series of data points indexed in time order.
The microphone array 101 includes a plurality of microphones (sensors). Each microphone (detection unit) detects a sound source signal from a sound source. The microphone array 101 can observe acoustic signals at a plurality of points in a space. The acoustic signals observed at the respective points, even at the same time, differ from one another depending on the location of the sound source and acoustic characteristics of the space. Proper use of the difference between these acoustic signals realizes spatial filters. Signals acquired by the microphone array 101 are sometimes referred to as observation signals.
The filter unit 102 applies N (where N is an integer of 2 or greater) linear spatial filters having spatial characteristics different from one another to two or more observation signals observed using the microphone array 101, and outputs N output signals (spatial filter output signals). N linear spatial filters are also referred to as a spatial filter bank. Observation signals input to the filter unit 102 correspond to a plurality of time series input signals. If the signal source signal is a sound source signal, the observation signals observed using the microphone array 101 correspond to the time series input signal. If the signal source signal is other signal such as a brain wave signal and a radio wave signal, the observation signals observed using a sensor that detects the other signal correspond to the time series input signal. As described later, a proper combination of linear spatial filters can improve the final accuracy of estimating the sound source.
The conversion unit 103 converts each output signal output from the filter unit 102 into a nonnegative signal taking on a nonnegative value. For example, the conversion unit 103 converts each output signal output from the filter unit 102 into a frequency domain signal by performing frequency analysis on the output signal. Furthermore, the conversion unit 103 converts a value of each frequency domain signal into a nonnegative value by taking an absolute value or a square of the absolute value for each time. The conversion unit 103 outputs N nonnegative signals thus obtained.
Any conventionally known method of frequency analysis can be applied such as Fourier analysis, filter bank analysis, and wavelet analysis. When the filter unit 102 applies linear spatial filters in a frequency domain, once the filter unit 102 directly inputs frequency domain signals to the conversion unit 103, the conversion unit 103 is not required to perform frequency analysis on the signals. Additionally, when observation signals are mixed based on an instantaneous mixing process in the frequency domain and are observed by microphones, the conversion unit 103 is not required to convert the observation signals into frequency domain signals.
The decomposition unit 110 decomposes each nonnegative signal into a spatial basis matrix and an activity vector (activity vector 1) using the NMF method. The spatial basis matrix is a matrix that includes nonnegative two-dimensional elements, that is, K (where K is an integer of 2 or greater according to the number of sound sources) elements (the first element) and N elements (the second elements). The activity vector is a nonnegative K-dimensional vector.
The decomposition unit 110 includes a spatial basis update unit 111 and an activity update unit 112. The spatial basis update unit 111 updates the spatial basis matrix with reference to its corresponding nonnegative signal and activity vector. The activity update unit 112 updates the activity vector with reference to its corresponding nonnegative signal and spatial basis matrix. The decomposition unit 110 repeats such update processing in order to improve the accuracy of decomposition.
The estimation unit 104 estimates a sound source signal on the basis of the output signal output from the filter unit 102 using the spatial basis matrix and the activity vector, and outputs the estimated signal (estimated sound source signal).
The units described above (the filter unit 102, the conversion unit 103, the decomposition unit 110, and the estimation unit 104) may be implemented by causing one or more processors such as a central processing unit (CPU) to execute a computer program, that is, via software, may be implemented via hardware such as one or more integrated circuits (IC), or may be implemented by combining both software and hardware.
The following describes signal processing performed by the signal processing system 100 thus configured according to the first embodiment with reference to
The filter unit 102 applies N linear spatial filters to the observation signals (input signals) observed by the microphone array 101, and outputs N output signals (Step S101). The conversion unit 103 converts the output signals into nonnegative signals (Step S102). The decomposition unit 110 decomposes the nonnegative signals into a spatial basis matrix and an activity vector (Step S103). The estimation unit 104 estimates sound source signals on the basis of the output signals using the spatial basis matrix and the activity vector, and outputs the estimated sound source signals (Step S104).
Observation and Decomposition Models in Power Spectral Domain Using Spatial Filter Bank
The following further describes details of the present embodiment. Models for observing and decomposing signals using a spatial filter bank are first described. A spatial filter bank assumes observation signals observed by a plurality of microphones to be input, and outputs respective output signals from a plurality of linear spatial filters. Here, an observation model is considered for observing mixed signals through the spatial filter bank system.
The model observes, using M microphone(s), acoustic signals coming from a sound source(s) k (1≦k≦K) in a direction θk viewed from the microphone(s) in a space. This system is considered as a linear time-invariant system. When an impulse response between the sound source and the microphone is sufficiently shorter than the window length in which short-time Fourier transform (STFT) is performed, short-time Fourier transform is performed on observation signals. When a frequency i is 1≦i≦I (where I is an integer of 2 or greater) and a time j is 1≦j≦J, the relation between a sound source signal silk and an observation signal xijk can be represented by expression (1).
x
ijk
=a
i(θk)sijk (1)
Let ai (θk) represent a steering vector in the direction θk. The sound source signal sijk is expressed by a complex number, and the observation signal xijk and ai (θk) are each expressed by an M-dimensional complex number. The steering vector is uniquely determined between the sound source and the microphone array 101.
To simplify the description here, the steering vector is determined only by the direction θk viewed from the microphone array 101. In fact, the steering vector varies depending on various spatial factors, such as the distance between the microphone array 101 and the sound source, and the location of the microphone array 101 in a room, even if the same microphone array 101 is used.
Furthermore, when K sound sources are present, the observation signal xij can be simply represented by the sum of observation signals from the respective sound sources, as shown in expression (2) below. Note that xij is expressed by an M-dimensional complex number.
The observation signal xij can also be represented in a matrix form as shown in expression (3) below.
x
ij
=A
i
s
ij (3)
Ai is a mixing matrix expressed by an M×K-dimensional complex number and defined as expression (4) below. sij is a sound source vector expressed by a K-dimensional complex number and defined as expression (5) below. “t” on the right side of expression (5) denotes the transpose of the matrix.
A
i
=[a
i(θ1) . . . ai(θk) . . . ai(θK)] (4)
s
ij
=[s
ij1
. . . s
ijk
. . . s
ijk]t (5)
It is now considered to obtain N output signals by applying N spatial filters to the observation signal. When output signals are expressed by an N-dimensional vector yij, an output signal yij can be represented as expression (6) below using a separation matrix Wi representing the N spatial filters. The separation matrix Wi is expressed by an N×M-dimensional complex number. A spatial filter group expressed by the separation matrix Wi is sometimes referred to as a spatial filter bank Wi.
y
ij
=W
i
A
i
s
ij (6)
It is considered that the observation signal xij=Aisij is filtered by the spatial filter group Wi (spatial filter bank) having N spatial characteristics different from one another to be analyzed into N output signals.
Here, considering a matrix Gi that is defined as Gi=WiAi and expressed by a N×K-dimensional complex number, the output signal yij can further be represented as expression (7) below. The output signal yij corresponds to the N output signals output by the filter unit 102.
y
ij
=G
i
s
ij (7)
Granted that the steering vector ai (θk) in each direction can be accurately known in advance, Gi is known, which enables sij to be determined from yij. In fact, the assumed direction θk cannot be known in advance. Even if θk is known, a gap is found between the theoretical value and the actual value of the steering vector ai (θk). That is, the steering vector ai (θk) is difficult to be accurately estimated.
Here, the problem is considered in a power domain. Where the n-th (1≦n≦N) element of yij, yijn={yij}n, is concerned, it can be represented as expression (8) below using an element in n-th row and k-th column of Gi, {Gi}nk.
Granted that sound sources have no correlation with one another, the element can be approximated as shown in expression (9) below by taking a square of an absolute value of each term.
Thus, assuming that a square of an absolute value of each element for a matrix B is expressed as |B|2, expression (7) can be approximated by a power domain as shown in expression (10). The conversion unit 103 converts output signals into nonnegative signals by applying the left side of expression (10), for example.
|yij|2≈Gi|2|sij|2 (10)
Similarly to expression (7), if |Gi|2 is known, it is possible to estimate a power spectral density (PSD) vector |sij|2 of a sound source.
In the local PSD estimation method or the method disclosed in Japanese Patent No. 4724054, instead of the direction θk, a local space R (θk)=[θk−δ, θk+δ] is defined that has an angle width with the direction θk as a center, and the average power spectral density is considered for each local space. This average power spectral density is substituted by Gi that is represented by expression (11) below.
|{Gi}kn|2=E[|wnihai(θ)|2]θεR(θ
E[·] denotes an expectation operation. whni is a vector in the n-th row of the separation matrix Wi. The symbol h denotes the Hermitian transpose of the matrix. In this manner, expression (10) can be used to estimate the PSD of a sound source in a local space having a certain range, instead of a specific point the location of which is difficult to be specified. With a local space having a certain range, estimating the location of a target sound source in advance in accordance with an application is also a realistic assumption.
In order to calculate |{G}kn|2 in advance, the steering vector ai (θ) needs to be determined as shown in expression (11). However, the steering vector varies depending on acoustic characteristics of a space affected by a room or a place used and a deviation from an expected arrangement or sensitivity of microphones, as described above. Consequently, the quality of sound source estimation may be deteriorated.
For this reason, the present embodiment enables accurate estimation of sound sources independently of the accuracy of |{G}kn|2 by considering the problem of estimating a sound source PSD (power) as an NMF problem in the model shown in expression (10). Hereinafter, an operator |·|2 for the square of the absolute value of each element in a matrix is omitted for simplicity unless specifically mentioned.
Derivation of Multichannel Post Filter
It has been discussed that observation signals can be represented by the decomposition model as shown in expression (10) in the power spectral domain using a spatial filter bank. The following recites that this problem can be solved as an NMF problem.
First, the problem of expression (10) is described as a problem of NMF at each frequency. Expression (12) below is rewritten with the operator |·|2 omitted from expression (10).
y
ij
≈G
i
s
ij (12)
In the method of local PSD estimation, Gi is given in advance. ai (θ) in expression (11) needs to be calculated for each direction on the basis of information on microphone arrangement, for example, and whni needs to be preset using some sort of criteria. Then, sij is calculated from yij using a (pseudo) inverse matrix of Gi. In doing so, the element of sij becomes negative in some cases, which requires correction such as changing the relevant term to zero.
Because the elements in matrices on both sides of expression (12) are all nonnegative, this problem can be considered as a typical NMF problem. NMF is a problem of decomposing the left side in which all values are nonnegative into two matrices on the right side in which all values are nonnegative likewise. Assuming that matrices each having the vectors yij and sij as j column are respectively Yi and Si, the problem can be represented as expression (13) below and considered as a NMF problem. Yi is expressed by a nonnegative N×J-dimensional real number. Si is expressed by a nonnegative K×J-dimensional real number.
Y
i
≈G
i
S
i (13)
Thus, Gi may also be unknown, and Gi and sij can be estimated simultaneously. As described above, the method of the present embodiment can be applied even if microphone arrangement is unknown.
At this time, k column of Gi corresponds to an output pattern when only signals from the sound source(s) k are passed through the spatial filter bank, that is, a power ratio between outputs of respective spatial filters. As is evident from expression (12), the power ratio is constant regardless of the power (the sound source signal sijk) of the corresponding sound source k. Furthermore, if the spatial filter bank is properly set, the pattern is such that the power ratio differs greatly for each of the sound sources k. The matrix Yi on the left side serves to extract K different patterns that consistently appear for j column into each column of the matrix Gi. Thus, applying NMF to expression (13) should cause a pattern having the power ratio for each sound source between outputs of respective spatial filters of the bank described above to be output for each sound source.
Here, the PSD pattern that appears in each column of Gi is called a spatial basis vector, following the spectral basis vector used for applying NMF to decompose a spectrogram of one channel signal. Additionally, Gi in which spatial basis vectors are arranged is called a spatial basis matrix. Although each element of sij corresponds to the power of each sound source, it has arbitrariness of a value with Gi. For this reason, sij is called an activity vector, following the conventional term of NMF.
Sound source separation based on the fact that the power ratio is constant for each sound source is formulated by NMF as problems of sound source separation and speech enhancement when a plurality of microphones are dispersedly arranged, in M. Togami, Y. Kawaguchi, H. Kokubo and Y. Obuchi: “Acoustic echo suppressor with multichannel semi-blind non-negative matrix factorization”, Proc. of APSIPA, pp. 522-525 (2010) (non-patent document), for example. Conventional methods differ from the present embodiment in that not output of a spatial filter bank but this formulation is directly applied to observe a plurality of microphones.
As described above, in order for sound sources to be decomposed as different patterns by NMF, different sound sources need to have different observation patterns. The techniques such as the non-patent document utilize the fact that PSD patterns vary from a sound source close to a particular microphone to a sound source far from any microphones, for example, by arranging the microphones apart from one another. Specifically, the techniques utilize the following fact: PSD of signals observed by microphones is larger as the signals are closer to the microphones, which generates such a difference in patterns that, in a PSD pattern of a sound source close to a particular microphone, the PSD of an element observed by the microphone close to thereof is larger and that of other elements is lower whereas, in a PSD pattern of a sound source far from any microphones, a difference in value between elements is relatively small. Generating such patterns requires a specific assumption about positional relation between microphones and sound sources.
By contrast, the present embodiment requires no such an assumption as described above about microphone arrangement and locations of sound sources because properly setting a spatial filter bank enables a difference in PSD pattern between the sound sources to occur even if microphones are close to one another. Varying directional characteristics between spatial filters constituting a spatial filter bank enables the difference in PSD pattern to occur.
Furthermore, appropriately adjusting the difference in PSD pattern to be large depending on locations of sound sources and microphones can improve the accuracy of estimating the sound sources in the present embodiment. For example, a linear spatial filter group used to separate sound sources by frequency domain independent component analysis is desirably used as a spatial filter bank. With such a configuration, each filter is learned to output a separate sound source as much as possible, so that the PSD pattern naturally differs for each sound source. Consequently, sound source estimation of higher quality can be expected because of the nature of the above-described NMF. A method is also possible in which a spatial filter bank is made up of a group of beam formers that are each oriented to different directions, for example. When the entire length of a microphone array used for observation is short or the number of microphones is small, however, the directivity fails to become sharp, failing to increase a difference in PSD pattern between sound sources. With the spatial filter bank based on independent component analysis, the spatial filters are configured in accordance with corresponding observation signals, enabling the difference in PSD pattern between sound sources to be increased in the microphone array even with a short entire length and a small number of microphones.
A conventional general method can be used for decomposition into nonnegative matrices Gi and Si with the above-described NMF. For example, the decomposition unit 110 estimates Gi and Si so that a distance d GiSi) between Yi and GiSi is short on condition that all values of elements in Gi and Si are nonnegative. For the distance d (•, •), a square error (expression (16) to be described later), the Itakura-Saito distance (expression (20) to be described later), and other measures can be used. In doing so, a method can be used for estimating Gi and Si on the basis of an iteration update rule that ensures convergence on a local optimum solution.
As described above, the signal processing system according to the first embodiment can estimate sound sources more accurately independently of a variation in acoustic characteristics of a space and other factors by applying nonnegative matrix factorization to each output signal output from the corresponding filter.
A signal processing system according to a second embodiment formulates a problem of sound source separation as a problem of NTF when the amplitude or power spectrum of multichannel is viewed as third-order tensor. The second embodiment corresponds to an embodiment achieved by extending the first embodiment, which has formulated the problem as decomposition for each frequency, to a frequency direction.
In the second embodiment, functions of the decomposition unit 110-2 and the estimation unit 104-2 differ from those of the equivalent of the first embodiment. The other configurations and functions are similar to those illustrated in
The decomposition unit 110-2 decomposes each nonnegative signal into a spatial basis, a spectral basis matrix, and an activity vector (activity vector 3) using the NTF method. The spatial basis is a tensor that includes nonnegative three-dimensional elements, that is, K (where K is an integer of 2 or greater according to the number of sound sources) elements (the first element), N elements (the second elements), and I (where I is an integer of 2 or greater and denotes the number of frequencies) elements (the third element). The spectral basis matrix is a matrix of I rows and L columns that includes L (where L is an integer of 2 or greater) nonnegative spectral basis vectors expressed by I-dimensional column vectors. The activity vector is a nonnegative L-dimensional vector.
The activity vector (activity vector 1) of the first embodiment can be calculated by the product of the spectral basis matrix and the activity vector (activity vector 3) of the second embodiment.
The decomposition unit 110-2 includes a spatial basis update unit 111-2, an activity update unit 112-2, and a spectral basis update unit 113-2. The spatial basis update unit 111-2 updates the spatial basis with reference to its corresponding output signal, spectral basis matrix, and the activity vector. The spectral basis update unit 113-2 updates the spectral basis matrix with reference to its corresponding output signal, spatial basis, and activity vector. The activity update unit 112-2 updates the activity vector with reference to its corresponding output signal, spatial basis, and spectral basis matrix. The decomposition unit 110-2 repeats such update processing in order to improve the accuracy of decomposition.
The estimation unit 104-2 estimates a sound source signal representing the signal of a signal source on the basis of the output signal using the spatial basis, the spectral basis matrix, and the activity vector, and outputs the estimated sound source signal (estimated sound source signal).
The flow of signal processing according to the second embodiment is similar to that of the signal processing (
The following recites that the problem of sound source separation that has formulated by being extended to a frequency direction can be solved as an NTF problem. In expressions (12) and (13) above, decomposition is considered for each frequency, which generally involves a problem of permutation of determining which spatial basis belongs to which sound source for each frequency.
The present embodiment addresses the permutation problem by introducing a spectral basis in addition to the spatial basis. This is based on the assumption that values for power components of signals coming from the same sound source vary in synchronization with one another in all frequencies.
Because the number of sound sources is often smaller than the number of input channels, accurate separation has been conventionally difficult without any effort, such as including a penalty term in an objective function of NMF or learning a basis in advance, in the case of NMF for each frequency. As in the present embodiment, introducing a spectral basis that associates frequencies with one another adds a constraint between frequencies, enabling accurate separation without such efforts as described above.
First, decomposition as shown in expression (14) below is considered for output {yij}n of a spatial filter bank.
Here, gink is a coefficient (redefinition) of the spatial basis. t(k)il is a coefficient of the spectral basis of the sound sources k. v(k)lj is a coefficient of the activity.
These coefficients are all nonnegative real numbers. l (1≦l≦L) denotes an index of the spectral basis.
Here, each sound source has L separate spectral bases. Alternatively, L may be different depending on the sound source, or sound sources may share a spectral basis.
Expression (14) shows a problem of decomposing a third-order tensor {yijn} of a nonnegative element into tensors {gink}, {t(k)il}, and {v(k)lj} having nonnegative values, and can be considered as a type of NTF problem.
The NTF of the present embodiment optimizes the coefficients gink, t(k)il, and v(k)lj so as to decrease the distance between the observation signal yijn obtained by the spatial filter bank and the estimated value ŷijn obtained through decomposition, similarly to NMF. That is, when the distance between x and y is d (x, y), a problem expressed by expression (15) below is solved.
This problem can use an estimation method that is based on an update rule using the auxiliary function method and ensures convergence on a local optimum solution, similarly to NMF.
A distance criterion d at this case can be selected in accordance with the purpose. When a square error (the Euclidean distance) dEuc represented by expression (16) below is used for the distance criterion, the update rule for each of the coefficients is represented as expressions (17), (18), and (19). Note that the yijn in this case is not a power spectrum but an amplitude spectrum.
When the Itakura-Saito distance dIS represented by expression (20) below is used for the distance criterion, the update rule for each of the coefficients is represented as expressions (21), (22), and (23). Note that the yijn in this case is a power spectrum. A more general update expression with the β-divergence may be applied.
In order to eliminate arbitrariness between the basis and the activity, gink and t(k)il are subjected to normalization represented by expressions (24) and (25) below for each update.
The decomposition unit 110-2 repeats performing updates in order of expressions (17), (24), (18), (25), and (19) or in order of expressions (21), (24), (22), (25), and (23) for one update.
As described above, the signal processing system according to the second embodiment can estimate sound sources more accurately independently of a variation in acoustic characteristics of a space and other factors by applying nonnegative tensor factorization to each output signal output from the corresponding filter.
Application to Speech Enhancement and Sound Source Separation
In order to perform speech enhancement or sound source separation using the coefficients obtained through NMF (first embodiment) and NTF (second embodiment), an estimation coefficient is used to obtain a gain coefficient or a separation matrix to apply it.
For the n-th filter bank output yijn, a gain coefficient hijnk to estimate a component of the sound sources k can be calculated as shown in expression (26) below, for example.
This is used to estimate a complex spectral component zijnk of the sound sources k as shown in expression (27) below on the basis of the filter bank output yijn (here, a complex spectrum, not the power spectrum taking |·|2).
z
ijnk
=h
ijnk
·y
ijn (27)
In this case, any component that has already been lost in filter bank output other the n-th output cannot be restored. Alternatively, a separation matrix Hij in an amplitude or power domain may be considered. Hij is expressed by a K×N-dimensional real number.
In this case, the estimated sound source complex spectrum zijk of the sound sources k can be found from expression (29) below. Here, the filter bank output yijn is also a complex spectrum.
Note that the methods of speech enhancement and sound source separation shown in expressions (27) and (29) are merely examples. For example, the square root of the right side of expressions (26) and (28) may be taken. The terms of the numerators and denominators of expressions (26) and (28) may be raised to the p-th power to take q-root of the entire right side. Methods such as a minimum mean square error (MMSE)—short time spectral amplitude (STSA) may be used.
Semi-Supervised Learning for Speech Enhancement
Because information on the sound sources k is not provided in advance in the update of the coefficients described above, which is a desired sound source cannot be directly known, similarly to the typical problem of blind sound source separation. For application to speech enhancement, assuming the number of sound sources K=2, for example, two sound sources of speech and noise are considered, but it is unknown to which sound source k=1 applies.
Here, a basis learned in advance from a clean speech is set for all spectral bases t(k=1)il of k=1 during learning. No update is then performed for the coefficient k=1 alone in the update rule of expression (18) or (22). Thus, the signal corresponding to k=1 can be expected to be a speech signal. Because the k=1 spectral bases are not updated, the effect of reducing the calculation amount during learning can also be expected.
A basis learned in advance from a clean speech (learning data) may be set for the k=1 spectral bases as a learning initial value. In this case, the calculation amount is increased for updates during learning. When distortion is found in observed speech compared with the clean speech learned in advance, however, the learning effect can be expected of adapting the speech spectral bases to the distortion.
When the clean speech is set only for part of the k=1 spectral bases, which are not updated during learning, and the remainder of the k=1 bases and k≠1 bases are all updated, noise coming from a direction of k=1 assumed as speech can be learned as bases of speech other than k=1. Consequently, noise coming from the same direction as the k=1 sound source can also be separated from speech.
The learning initial value is not limited to the above. A value calculated from spatial arrangement of a microphone array and linear spatial filters, for example, may be set as a learning initial value.
In a third embodiment, an example of applying a signal processing system to a speech input device is described. The signal processing system of the present embodiment accurately recognizes speech using an estimated sound source signal even in an environment in which speech recognition (a technique of converting speech into text) is usually difficult, such as under noise. The system then performs control, such as using the result to operate equipment and displaying the result of speech recognition for a user.
The third embodiment differs from the first embodiment in that the identification unit 105-3, the calculation unit 106-3, the output control unit 107-3, and the display unit 120-3 are added. The other configurations and functions are similar to those illustrated in
The identification unit 105-3 performs identification processing based on a sound source signal. For example, the identification unit 105-3 identifies a category of a signal at each time for estimated sound source signals obtained by the estimation unit 104. When the signal is an acoustic signal and the sound source is uttered speech, for example, the identification unit 105-3 identifies a phoneme for each time, transcribes contents uttered by a speaker (performs what is called speech recognition), and outputs the recognition result. In this manner, category identification includes processing of identifying the type or the contents of speech uttered by the user. Examples of the category identification include continuous speech recognition that uses the phoneme identification described above, specific keyword detection for detecting the presence of an uttered specific word, and speech detection for simply detecting the presence of uttered speech.
The calculation unit 106-3 calculates the degree of separation indicating a degree that a signal source is separated by the filter unit 102, based on a distribution of values of spatial bases (spatial basis matrix), for example. The degree of separation indicates the extent to which a sound source signal is separated from the other sound source signals.
The output control unit 107-3 performs control so as to change output of a result of identification processing performed by the identification unit 105-3, in accordance with the degree of separation. For example, the output control unit 107-3 controls display on the display unit 120-3 on the basis of the category obtained by the identification unit 105-3. In doing so, the identification unit 105-3 changes the display mode with reference to the degree of separation output from the calculation unit 106-3. For example, the identification unit 105-3 considers that, if the degree of separation is low, the estimation accuracy of the sound source signal estimated by the estimation unit 104 is also low and the result from the identification unit 105-3 is also unreliable, and displays the reason as well as a message or the like prompting re-utterance for the speaker who is the user.
The display unit 120-3 is a device such as a display that displays various types of information including images, videos, and speech signals. The output control unit 107-3 controls the contents displayed on the display unit 120-3.
A method of outputting information is not limited to display of an image, for example. A method of outputting speech may be used. In this case, the system may include a speech output unit such as a loudspeaker with the display unit 120-3, or in place of the display unit 120-3. The system may also be configured to control operation of equipment, for example, using an identification result.
As described above, the calculation unit 106-3 calculates the degree of separation indicating how well a sound source signal can be estimated and the output control unit 107-3 uses the calculated result to control output, which is a reason why the present embodiment is not merely a combination of a signal processing device and other devices.
The following describes signal processing performed by the signal processing system 100-3 thus configured according to the third embodiment with reference to
The signal processing from Step S201 to Step S204 is similar to the processing from Step S101 to Step S104 in the signal processing system 100 according to the first embodiment, and thus description thereof is omitted.
The identification unit 105-3 performs identification processing on the signals (estimated sound source signals) estimated by the estimation unit 104, and outputs an identification result (such as a category) (Step S205). The calculation unit 106-3 calculates the degree of separation based on the spatial basis (Step S206). The output control unit 107-3 controls output of the identification result in accordance with the calculated degree of separation (Step S207).
The following describes a specific example of how to calculate the degree of separation. The k-th column vector gik of the spatial basis matrix Gi in expression (13) represents a PSD output pattern in the spatial filter output of the sound sources k. If the sound sources k are sufficiently separated by the linear spatial filters of the filter unit 102, only one or a few elements of gik should have large values and the remainder has small values. Consequently, whether sound sources are sufficiently separated at the filter unit 102 can be found by checking for a sparseness in the magnitude of values between elements of gik (distribution of values). Furthermore, a prerequisite for the estimation unit 104 estimating sound sources more accurately is that the sound source signals are separated at the filter unit 102 to some extent. The accuracy of estimated sound source signals input to the identification unit 105-3 can therefore be found by checking a sparseness in the magnitude of the values between the elements of gik.
The sparseness in the magnitude of the values between the elements of gik can be quantified by calculating entropy as shown in expression (30) below, for example. gn denotes the n-th element of a column vector g.
The column vector g is assumed to be normalized as shown in expression (31) below.
H(g) is smaller with a larger sparseness in values, whereas H(g) is larger with a smaller sparseness. For example, a reciprocal 1/H(g) of expression (31) is assumed to be the degree of separation of the sound sources k. In practice, expression (31) is used with a cumulative sum taken also in a frequency direction i, for example.
The possibility of accurately decomposing signals at the decomposition unit 110 depends on whether the difference in PSD pattern between sound sources in the spatial filter output is sufficiently large. When the similarity, specifically, the square error, between the elements of gik is small, for example, signals are unlikely to be sufficiently separated. Outputting a reciprocal of the similarity as the degree of separation is also possible.
The calculation unit 106-3 may calculate the degree of separation using the activity vector (activity vector 1) aside from the spatial basis matrix. For example, the calculation unit 106-3 may calculate entropy H(sij) using the activity vector sij instead of the column vector gik of the spatial basis matrix in expressions (30) and (31). If speech is input from a direction and the sound source is sufficiently estimated, a sparseness is generated in the value of the activity vector 1, and the value of H (sij) is decreased. Thus, H(sij) can be used as the degree of separation similarly to H(g).
Use Cases of Signal Processing Systems
Actual use cases of the signal processing systems described above are described.
Case 1: Meeting Transcription System
As a use case, a meeting transcription system is considered that is set up in a meeting room during a meeting and transcribes utterance contents of the meeting. The system includes one of the signal processing systems of the above embodiments, and is set up in the center of a meeting table in the meeting room, for example. A plurality of microphones provided to a main unit observe speech signals coming from a plurality of speakers, and output estimated sound source signals estimated for each speaker. A speech recognition device (the identification unit 105-3) recognizes the estimated sound source signals output for each speaker, and converts utterance contents of each speaker into characters. The transcribed contents can be utilized later to review the details of the meeting.
In speech recognition in which speech recorded using a microphone set up in a location away from a speaker, the influence of speech of other speakers, reverberations in a room, ambient noise, and self-noise caused by an electric circuit connected to the microphone reduces the accuracy of transcribing the speech correctly. Thus, an estimation device for estimating sound source signals is required to eliminate the influence. With the signal processing systems of the above embodiments, speech signals from each speaker can be estimated more accurately than the conventional methods, improving the accuracy of speech recognition.
In the signal processing systems of the above embodiments, because arrangement of microphones does not need to be known in advance, the microphones may be moved individually. For example, locating some microphones near meeting participants can further improve the accuracy of speech recognition. Additionally, flexible operation is possible. The location of the microphones may be changed in each meeting, for example.
With a mechanism using the calculation unit 106-3, the signal processing system itself can determine that the user's speech is not sufficiently estimated. With time recorded with meeting speech, a user of transcription and an assistant of the system transcription can listen to the meeting speech corresponding to the time again so that an error in recognition of transcribed text can be amended more quickly than a case of listening to the entire speech again.
When speech of a specific speaker is not sufficiently estimated continuously, in particular, the potential problems are that the location of a microphone is away from the user and the directivity of a microphone is not directed to the user. In such a case, the meeting participant can be notified by the system that utterance has not been caught properly and prompted to relocate the microphone by locating the microphone by the participant or directing the microphone to the participant.
Case 2: Speech Response System
As another use case, a speech response system under noise is considered. The speech response system receives a question or a request from a user by speech, understands the content, and accesses a database, for example, in order to present a response desired by the user. If the system is installed in a public space such as a station and a store, it cannot catch a user's speech correctly in some cases. For this reason, the speech input device of the above embodiment is applied to the speech response system.
Similarly to the use case of the meeting transcription system described above, user's speech of higher quality, that is speech with noise suppressed more appropriately, can be obtained with the above embodiment. Thus, the speech response system can provide the user with a more appropriate response than conventional systems.
With a mechanism using the calculation unit 106-3, the signal processing system itself can determine that the user's speech is not sufficiently estimated. In such a case, the user can be notified that utterance given by the user has not been caught properly and prompted to utter again.
Consequently, the mechanism can prevent the system from mistakenly catching and understanding a question of the user and responding improperly.
As described above, according to the first to the third embodiments, sound sources can be estimated more accurately independently of a variation in acoustic characteristics of a space and other factors.
The following describes the hardware configuration of the signal processing systems according to the first to the third embodiments with reference to
The signal processing systems according to the first to the third embodiments include a control unit such as a central processing unit (CPU) 51, a storage device such as a read only memory (ROM) 52 and a random access memory (RAM) 53, a communication I/F 54 connected to a network for communications, and a bus 61 for connecting the units.
A computer program to be executed on the signal processing systems according to the first to the third embodiments is preinstalled and provided on the ROM 52, for example.
A computer program to be executed on the signal processing systems according to the first to the third embodiments may be recorded and provided as an installable or executable file on a computer-readable recording medium such as a compact disc read only memory (CD-ROM), a flexible disc (FD), a compact disc recordable (CD-R), and a digital versatile disc (DVD); and can be provided as a computer program product.
Furthermore, a computer program to be executed on the signal processing systems according to the first to the third embodiments may be stored on a computer connected to a network such as the Internet to be provided by being downloaded via the network. A computer program to be executed on the signal processing systems according to the first to the third embodiments may be provided or distributed via a network such as the Internet.
A computer program to be executed on the signal processing systems according to the first to the third embodiments can cause a computer to function as the units of the signal processing systems described above. This computer can be executed by the CPU 51 reading a computer program from a computer-readable storage medium to be loaded on a main memory.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2016-169999 | Aug 2016 | JP | national |