This application is based upon and claims the benefit of priority from prior Japanese Patent Application No. 2005-084443, filed Mar. 23, 2005, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to acoustic signal processing, particularly to estimation of the number of sound sources propagating through a medium, a direction of the acoustic source, frequency components of acoustic waves coming from the sound sources, and the like.
2. Description of the Related Art Recently, a sound source localization and separation system is proposed in a field of robot auditory research. In the system, the number of plural target sound sources and the directions of the target sound sources are estimated under a noise environment (sound source localization), and each of the source sounds are separated and extracted (sound source separation). For example, F. Asano, “dividing sounds” Instrument and Control vol. 43, No. 4, p 325-330 (2004) discloses a method, in which N source sounds are observed by M microphones in an environment in which background noise exists, a spatial correlation matrix is generated from data in which short-time Fourier transform (FFT) process of each microphone output is performed, and a main eigenvalue having a larger value is determined by eigenvalue decomposition, thereby estimating a number N of sound sources as the main eigenvalue. In this case, characteristics in which the signal having no directional property such as the source sound having a directional property is mapped to the main eigenvalue while the background noise is mapped to all the eigenvalues are utilized.
Namely, an eigenvector corresponding to the main eigenvalue becomes a basis vector of a signal part space developed by the signal from the sound source, and the eigenvector corresponding to the remaining eigenvalue becomes the basis vector of the noise part space developed by the background noise signal. A position vector of each sound source can be searched for by utilizing the basis vector of the noise part space to apply a MUSIC method, and the sound from the sound source can be extracted by a beam former in which directivity is given to a direction obtained as a result of the search.
However, the noise part space cannot be defined when the number N of sound sources is equal to the number M of microphones, and the undetectable sound source exists when the number N of sound sources exceeds the number M of microphones. Therefore, the number of estimable sound sources is lower than the number M of microphones. In this method, there is no particularly large limitation with respect to the sound source, and it is a mathematically simple. However, in order to deal with many sound sources, there is a limitation that the number of microphones needed is higher than the number of sound sources.
A method in which the sound source localization and the sound source separation are performed using a pair of microphones is described in K. Nakadai et al., “real time active chase of person by hierarchy integration of audio-visual information” Japan Society for Artificial Intelligence AI Challenge Kenkyuukai, SIG-Challenge-0113-5, p 35-42, June 2001. In this method, by focusing attention on a harmonic structure (frequency structure including a fundamental wave and its harmonics) unique to the sound generated through a tube (articulator) like human voice, the harmonic structure having a different frequency of the fundamental wave is detected from data in which the Fourier transform of a sound signal obtained by the microphone is performed. The number of detected harmonic structures is set at the number of speakers, the direction with a certainty factor is estimated using interaural phase difference (IPD) and interaural intensity difference (IID) in each harmonic structure, and each source sound is estimated by the harmonic structure itself. In this method, the number of sound sources which is not lower than the number of microphones can be dealt with by detecting the plural harmonic structures from the Fourier transform. However, since the estimation of the number of sound sources, the direction, and the sound source is performed based on the harmonic structure, the sound source which can be dealt with is limited to the sounds such as the human voice having the harmonic structure, and the method cannot be adapted to the various sounds.
Thus, in the conventional methods, there is a problem of an antinomy that (1) the number of sound sources cannot be set at the number not lower than the number of microphones when no limitation is provided in the sound source, and (2) there is limitation such as assumption of the harmonic structure in the sound source when the number of sound sources is set at the number not lower than the number of microphones. Currently, the system of being able to deal with the number of sound sources not lower than the number of microphones without limiting the sound source is not established yet.
In view of the foregoing, an object of the invention is to provide an acoustic signal processing apparatus, an acoustic signal processing method, and an acoustic signal processing program for the sound source localization and the sound source separation, in which the limitation of the sound source can further be released and the number of sound sources which is not lower than the number of microphones can be dealt with, and a computer-readable recording medium in which the acoustic signal processing program is recorded.
According to one aspect of the present invention, there is provided an acoustic signal processing apparatus comprising: an acoustic signal input device configured to input n acoustic signals including voice from a sound source, the n acoustic signals being detected at n different points (n is a natural number 3 or more); a frequency resolution device configured to resolve each of the acoustic signals into a plurality of frequency components to obtain n pieces of frequency resolved information including phase information of each frequency component; a two-dimensional data generating device configured to compute phase difference between a pair of pieces of frequency resolved information in each frequency component with respect to m pairs of pieces of frequency resolved information different from each other in the n pieces of frequency resolved information (m is a natural number 2 or more), the two-dimensional data generating device generating m pieces of two-dimensional data in which a frequency function is set at a first axis and a function of the phase difference is set at a second axis; a graphics detection device configured to detect predetermined graphics from each piece of the two-dimensional data; a sound source candidate information generating device configured to generate sound source candidate information including at least one of the number of a plurality of sound source candidates, a spatial existence range of each sound source candidate, and the frequency component of the acoustic signal from each sound source candidate based on each of the detected graphics, the sound source candidate information generating device generating corresponding information indicating a corresponding relationship between the pieces of sound source candidate information; and a sound source information generating device configured to generate sound source information including at least one of the number of sound sources, the spatial existence range of the sound source, an existence period of the voice, a frequency component configuration of the voice, amplitude information on the voice, and symbolic contents of the voice based on the sound source candidate information and corresponding information which are generated by the sound source candidate information generating device.
Embodiments of the invention will be described below with reference to the accompanying drawings.
As shown in
[Basic Concept of Sound Source Estimation Based on Phase Difference in Each Frequency Component]
The microphones 1a to 1c are arranged at predetermined intervals in a medium such as air. The microphones 1a to 1c convert medium vibrations (acoustic waves) at different n points into electric signals (acoustic signals). The microphones 1a to 1c form different m pairs of microphones (m is a natural number larger than 1).
The acoustic signal input unit 2 periodically performs analog-to-digital conversion of the n-channel acoustic signals obtained by the microphones 1a to 1c at predetermined sampling period Er, which generates n-channel digitized amplitude data in time series.
Assuming that the sound source is located sufficiently far away compared with a distance between the microphones, as shown in
K. Suzuki et al., implementation of “coming by an oral command” function of home robots by audio-visual association Proceedings of Fourth Conference of the Society of Instrument and Control Engineers System Integration Division (SI2003), 2F4-5 (2003) discloses a method in which a resemblance of which part of one piece of amplitude data to which part of the other piece of amplitude data is searched by pattern matching to derive the arrival time difference ΔT between two acoustic signals (103 and 104 of
In the embodiment, the inputted amplitude data is analyzed by resolving the amplitude data in the phase difference of each frequency component. Accordingly, even if the plural sound sources exist, because the phase difference corresponding to the sound source direction is observed between two pieces of data with respect to the frequency component unique to each sound source, when the phase difference of each frequency component can be classified in a group of the same sound source direction without assuming the strong limitation for the sound source, the number of sound sources, the direction of each sound source, the main characteristic frequency component generated by each sound source should be grasped for wide-ranging sound sources. Although it is a straight forward idea, there are problems which need to be overcome when the actual data is analyzed. The functional blocks (the frequency resolution unit 3, the two-dimensional data generating unit 4, and the graphics detection unit 5) for grouping will continuously be described along with the problems.
[Frequency Resolution Unit 3]
Fast Fourier transform (FFT) can be cited as a general technique of resolving the amplitude data into the frequency components. A Cooley-Turkey DFT algorithm is known as a representative algorithm.
As shown in
After a windowing process (120 in
At this point, the generated short-time Fourier transform data becomes the data in which the amplitude data of the frame is resolved into the N/2 frequency components, and the numeral value of a real part R(k) and an imaginary part I(k) in the buffer 122 indicates a point Pk on a complex coordinate system 123 for a k-th frequency component fk as shown in
When a sampling frequency is set at Fr (Hz) and a frame length is set at N (samples), k runs integer values from 0 to (N/2)−1. k=0 expresses 0 (Hz) (direct current) and k=(N/2)−1 expresses Fr/2 (Hz) (highest frequency component). The frequency in each k is expressed by equally dividing the distance between k=0 and k=(N/2)−1 by frequency resolution Δf=(Fr/2)/((N/2)−1) (Hz), and the frequency in each k is expressed by fK=k·Δf.
As described above, the frequency resolution unit 3 generates the frequency-resolved data in time series by continuously performing the process at predetermined intervals (the amount of frame shift Fs). The frequency-resolved data includes a power value and a phase value in each frequency of the inputted amplitude data.
[Two-Dimensional Data Generating Unit 4 and Graphics Detection Unit 5]
As shown in
[Phase Difference Computing Unit 301]
The phase difference computing unit 301 compares two pieces of frequency-resolved data a and b obtained by the frequency resolution unit 3 at the same time, and the phase difference computing unit 301 generates the data of the phase difference between a and b obtained by computing the difference between phase values of a and b in each frequency component. As shown in
[Coordinate Value Determining Unit 302]
The coordinate value determining unit 302 computes the difference between the phase values in each frequency component based on the phase difference data obtained by the phase difference computing unit 301, and the coordinate value determining unit 302 determines a coordinate value which deals with the phase difference data obtained by the coordinate value determining unit 302 as a point on a predetermined two-dimensional XY coordinate system. An X-coordinate value x(fk) and a Y-coordinate value y(fk) corresponding to the phase difference ΔPh(fk) of the frequency component fk are determined by equations shown in
[Frequency Proportionality of Phase Difference for the Same Time Difference]
The phase difference which is computed in each frequency component by the phase difference computing unit 301 as shown in
[Cyclicity of Phase Difference]
However, the proportionality of the frequency and the phase difference between the microphones is held in all the ranges as shown in
The available phase value in each frequency component can be obtained as the value of the rotational angle θ shown in
[Phase Difference When Plural Sound Source Exist]
On the other hand, when the acoustic waves are generated from the plural sound sources, a frequency-phase difference plot is schematically shown in
The problem that the number of source sounds and the directions of the sound sources are estimated can come down to discovery of the line such as the lines in the plot of
[Voting Unit 303]
As described later, the voting unit 303 applies a linear Hough transform to each frequency component to which the (x, y) coordinate is given by the coordinate value determining unit 302, and the voting unit 303 votes its locus in a Hough voting space by a predetermined method. Although A. Okazaki, “Primary image processing,” Kogyotyousakai, p 100-102 (2000) describes the Hough transform, the Hough transform will be described here again.
Linear Hough Transform
As schematically shown in
A Hough curve can independently be determined with respect to each point on the XY coordinate system. As shown in
[Hough Voting]
The engineering technique of Hough voting is used in order to detect the line from the point group. This is a technique of suggesting the set of θ and ρ through which many loci pass, i.e. the existence of the line at the position where a large number of votes is obtained in the Hough voting space such that the set of θ and ρ through which each locus passes is voted in a two-dimensional Hough voting space having the coordinate axes of θ and ρ. Generally, a two-dimensional array (Hough voting space) having a searching range size for θ and ρ is prepared and the two-dimensional array is initialized by zero. Then, the locus is determined at each point by the Hough transform, and a value on the array through which the locus passes is incremented by 1. This is referred to as Hough voting. When the vote of the locus is ended for all the points, it is found that the line does not exist at the position where the number of votes is 0 (no locus passes through), the line passing through one point exists at the position where the number of votes is 1 (only one locus passes through), the line passing through two points exists at the position where the number of votes is 2 (only two loci pass through), and the line passing through n points exists at the position where the number of votes is n (only n loci pass through). When the resolution of the Hough voting space can be increased to infinity, as described above, only the point through which the locus passes obtains the number of votes corresponding to the number of loci passing through the point. However, because the actual Hough voting space is quantized with the proper resolution for θ and ρ, the high vote distribution is also generated near the position where the plural loci intersect one another. Therefore, it is necessary that the loci intersecting position is determined more accurately by searching for the position having the maximum value from the vote distribution of the Hough voting space.
The voting unit 303 performs Hough voting for frequency components satisfying all the following conditions. Due to the conditions, only the frequency component having a power not lower than a predetermined threshold in a given frequency band is voted:
(Voting condition 1): The frequency is in a predetermined range (low-frequency cut and high-frequency cut), and
(Voting condition 2): Power P(fk) of the frequency component fk is not lower than the predetermined threshold.
The voting condition 1 is generally used in order to cut out the low frequency on which background noise is superposed or to cut the high frequency in which the accuracy of FFT is decreased. The ranges of the low-frequency cut and the high-frequency cut out can be adjusted according to the operation. When the widest frequency band is used, it is preferable that only a direct-current component is cut in the low-frequency cut and only the maximum frequency is cut in the high-frequency cut.
In the frequency component in which the background noise level is very weak, it is thought that the reliability of FFT result is not so high. The voting condition 2 is used in order that the frequency component having the low reliability is caused not to participate in the vote by performing the threshold process with the power. Assuming that the power value is set at Po1(fk) in the microphone 1a and the power value is set at Po2(fk) in the microphone 1b, the method of determining the estimated power P(fk) includes the following three conditions. The use of the conditions can be set according to the operation.
(Average value): An average value of Po1(fk) and Po2(fk) is used. It is necessary that both the power values of Po1(fk) and Po2(fk) are appropriately strong.
(Minimum value): The lower one of Po1(fk) and Po2(fk) is used. It is necessary that both the power values of Po1(fk) and Po2(fk) are not lower than the threshold value at the minimum.
(Maximum value): The larger one of Po1(fk) and Po2(fk) is used. Even if one of the power values is lower than the threshold value, the vote is performed when the other power value is sufficiently strong.
Further, the voting unit 303 can perform the following two addition methods in the vote.
(Addition method 1): A predetermined fixed value (for example, 1) is added to the position through which the locus passes.
(Addition method 2): A function value of power P(fk) of the frequency component fk is added to the position through which the locus passes.
The addition method 1 is usually used in the line detection problem by the Hough transform. In the addition method 1, because the vote is ranked in proportion to the number of passing points, it is preferable to detect the line (i.e. sound source) including the many frequency components on a priority basis. At this point, because there is no limitation to the harmonic structure (in which the included frequencies should be equally spaced) with respect to the frequency component included in the line, in addition to human voice, more sound sources can be detected.
Even if a small number of passing points exists, in the addition method 2, the high-order maximum value can be obtained when the frequency component having a large power is included. It is preferable to detect the line (i.e. sound source) having a promising component in which the power is large while the number of frequency components is small. The function value of the power P(fk) is computed as G(P(fk)) in the addition method 2.
[Collective Voting of Plural FFT Results]
Further, although the voting unit 303 can perform the voting in each FFT time, in the embodiment, the voting unit 303 performs collective voting for the usually successive m-time (m≧1) time-series FFT results. On a long-term basis, the frequency component of a sound source fluctuates. However, when the voting unit 303 performs collective voting for the successive m-time time-series FFT results, a Hough voting result having higher reliability can be obtained with more pieces of data obtained from the plural-time FFT results having properly short-time when the frequency component is stable. m can be set as the parameter according to the operation.
[Straight-Line Detection Unit 304]
The straight-line detection unit 304 detects a promising line by analyzing the vote distribution on the Hough voting space generated by the voting unit 303. However, at this point, a higher-accuracy line detection can be realized by considering the situation unique to the problem, such as the cyclicity of the phase difference described in
The amplitude data obtained by the pair of microphones is converted into power value data and phase value data of each frequency component by the frequency resolution unit 3. Referring to
[Limitation of ρ=0]
When analog-to digital conversion is performed in phase to the signals of the microphone 1a and the microphone 1b by the acoustic signal input unit 2, the line which should be detected always passes through ρ=0, i.e. the origin of the XY coordinate system. Therefore, the sound source estimation problem comes down to the problem that the maximum value is searched for from the vote distribution S(θ, 0) located on the θ axis in which ρ becomes zero on the Hough voting space.
Referring to
[Definition of Line Group in Consideration of Phase Difference Cyclicity]
A line 197 shown in
Referring to
[Maximum Position Detection in Consideration of Phase Difference Cyclicity]
As described above, the sound source is no expressed by one line, but the sound source is dealt with as the line group including the reference line and the cyclic extension line due to the cyclicity of the phase difference. This should also be considered in detecting the maximum position from the vote distribution. Usually the method of searching for the maximum position with the vote value on ρ=0 (or ρ=ρ0) (i.e. vote value of reference line) is sufficient from a performance viewpoint, and the method also has an effect of reducing the searching time and improving the accuracy, in the case where the cyclicity of the phase difference does not occur, or in the case where the sound source is detected only near the front face of the pair of microphones even if the cyclicity occurs. However, in the case where the sound source which exists in the wider range is detected, it is necessary for the maximum position to be searched for by summing the vote values at some points separated from one another by Δρ for a certain θ. The difference will be described below.
The frequency resolution unit 3 converts the amplitude data obtained by the pair of microphones into the power value data and the phase value data of each frequency component. Referring to
On the other hand,
A vote H(θ0) of a certain θ0 is computed as the sum of the votes on the θ axis 241 and the votes on the broken lines 242 to 249, i.e. H(θ0)=Σ{S(θ0, aΔρ(θ0))}, when longitudinally viewed at the position of θ=θ0. This operation corresponds to the sum of the votes of the reference line 200 in θ=θ0 and the vote of the cyclic extension line. The numeral 250 represents a bar graph of the vote distribution H(θ). Unlike the bar graph shown by the numeral 222 of
[Generalization: Maximum Position Detection in Consideration of Non-In-Phase]
When the acoustic signal input unit 2 performs analog-to digital conversion of the signals of the microphone 1a and the microphone 1b in phase, the line to be detected does not pass through ρ=0, i.e. the origin of the XY coordinate system. In this case, it is necessary that the limitation of ρ=0 is removed to search for the maximum position.
When the reference line in which the limitation of ρ=0 is removed is generalized to describe (θ0, ρ0), the line group (reference line and cyclic extension line) can be described as (θ0, aΔρ(θ0)+ρ0), where Δρ(θ0) is an average movement amount of the cyclic extension line determined by θ0. When the sound source comes from a certain direction, only one of the most promising line group exists in θ0 corresponding to the direction. The line group is given by (θ0, aΔρ(θ0)+ρ0max) using a value of ρ0max in which the vote of the line group Σ{S(θ0, aΔρ(θ0)+ρ0)} becomes the maximum when ρ0 is changed. Therefore, the vote V is set at the maximum vote value Σ{S(θ, aΔρ(θ)+ρ0max)} in each θ, which allows the same maximum position detection algorithm as for the limitation of ρ=0 to be applied to perform the line detection.
[Graphics Matching Unit 6]
The detected line group is a candidate of the sound source at each time, and the candidate of the sound source is independently estimated in each pair of microphones. At this point, the voice emitted from the same sound source is simultaneously detected as each line group by plural pairs of microphones. Therefore, when correspondence of the line group which derives from the same sound source can be performed by the plural pairs of microphones, the information on the sound source can be obtained with higher reliability. The graphics matching unit 6 performs the correspondence. The information edited in each line group by the graphics matching unit 6 is referred to as sound source candidate information.
As shown in
[Directional estimation Unit 311]
The directional estimation unit 311 receives the line detection result from the straight-line detection unit 304, i.e. the θ value of each line group, and the directional estimation unit 311 computes an existence range of the sound source corresponding to each line group. At this point, the number of detected line groups becomes the number of candidates of the sound source. When the distance between the base line and the sound source is sufficiently large with respect to the base line of the pair of microphones, the existence range of the sound source becomes a conical surface having an angle with respect to the base line of the pair of microphones. Referring to
The arrival time difference ΔT between the microphone 1a and the microphone 1b can be changed within the range of ±ΔTmax. As shown in
Next, a general condition shown in
As shown in
[Sound Source Component Estimation Unit 312]
The sound source component estimation unit 312 evaluates the distance between the (x, y) coordinate value of each frequency component given by the coordinate value determining unit 302 and the line detected by the straight-line detection unit 304, and the sound source component estimation unit 312 detects the points (i.e. frequency component) located near the line as the frequency component of the line group (i.e. sound source). Then, the sound source component estimation unit 312 estimates the frequency component in each sound source based on the detection result.
[Detection by Distance Threshold Method]
As shown in
Similarly, as shown in
At this point, the frequency component 289 and the origin (direct-current component) are included in both the areas 286 and 288, so that the frequency component 289 and the origin are double detected as the component of both the sound sources (multiple belonging). The method, in which the threshold processing is performed to the horizontal distance between the frequency component and the line, the frequency component existing in the threshold is selected in each line group (sound source), and the power and the phase of the frequency component are directly set at the source sound component, is referred to as the “distance threshold method.”
[Detection by Nearest Neighbor Method]
[Detection by Distance Coefficient Method]
In the above two methods, only the frequency component existing within the predetermined threshold of the horizontal distance is selected for the lines constituting the line group, and the power and the phase of the frequency component are directly set at the frequency component of the source sound corresponding to the line group. On the other hand, in the “distance coefficient method” described below, a non-negative coefficient α is computed, and the power of the frequency component is multiplied by the non-negative coefficient α. The non-negative coefficient α is monotonously decreased according to the increase in horizontal distance d between the frequency component and the line. Therefore, the frequency component belongs to the source sound while the power of the frequency component is decreased as the frequency component is separated from the line in terms of the horizontal distance.
In this method, it is not necessary to perform threshold processing using the horizontal distance. Each horizontal distance d between the frequency component and a certain line group (horizontal distance between the frequency component and the nearest line in the line group) is determined, and the value in which the power of the frequency component is multiplied by the coefficient α determined based on the horizontal distance d is set at the power of the frequency component in the line group. The equation for computing the non-negative coefficient α which is monotonously decreased according to the increase in horizontal distance d can arbitrarily be set. A sigmoid (S-shaped curve) function α=exp((−(B·d)c) shown in
[Treatment of Plural FFT Results]
As described above, not only the voting unit 303 can perform the voting in each one-time FFT, but also the voting unit 303 can perform the voting of the successive m-time FFT results in a collective manner. Accordingly, the functional blocks subsequent to the straight-line detection unit 304 for processing the Hough voting result are operated as a unit of the period in which one-time Hough transform is executed. When the Hough voting is performed in m≧2, since the FFT results of the plural times are classified into the components constituting the source sound, sometimes the same frequency components having different times belong to different source sounds. Therefore, irrespective of the value of m, the coordinate value determining unit 302 imparts a starting time of the obtained frame as the information on the obtained time to each frequency component (i.e. dot shown in
[Power Retention Option]
In the above methods, in the frequency component belonging to the plural (N) line groups (sound sources) (only the direct-current component in the nearest neighbor method, and all the frequency components in the distance coefficient method), it is also possible that the powers of the frequency components at the same time which are distributed to the sound sources is normalized and divided into N pieces such that the total of the powers is equal to the power value Po(fk) of the time before the distribution. Therefore, the total power can be retained at the same level as the input power in the whole of the sound source in each frequency component. This is referred to as the “power retention option.” There are two distribution methods. Namely, the two methods include (1), where the power is equally divided into N segments (applicable to the distance threshold method and the nearest neighbor method), and (2), where the power is distributed according to the distance between the frequency component and each line group (applicable to the distance threshold method and the distance coefficient method).
The method (1) is the distribution method in which normalization is automatically achieved by equally dividing the power into N segments. The method (1) can be applied to the distance threshold method and the nearest neighbor method, in which the distribution is determined independently of the distance.
The method (2) is the distribution method in which, after the coefficient is determined in the same manner as the distance coefficient method, the total of the powers is retained by normalizing the power such that the total of the powers becomes 1. The method (2) can be applied to the distance threshold method and the distance coefficient method, in which the multiple belonging is generated except in the origin.
The sound source component estimation unit 312 can perform all of the distance threshold method, the nearest neighbor method, and the distance coefficient method according to the setting. Further, in the distance threshold method and the nearest neighbor method, the above-described power retention option can be selected.
[Time-Series Tracking Unit 313]
As described above, the straight-line detection unit 304 determines the line group in each Hough voting performed by the voting unit 303. The Hough voting is performed for the successive m-time (m≧1) FFT results in the collective manner. As a result, the line group is determined in time series while the time of m frames is set at one period (hereinafter referred to as “graphics detection period”). Because θ of the line group corresponds to the sound source direction Φ computed by the directional estimation unit 311 in a one-to-one relationship, even if the sound source stands still or is moved, the locus of θ (or Φ) corresponding to the stable sound source should continue on the time axis. On the other hand, due to the threshold setting, sometimes the line group corresponding to the background noise (referred to as “noise line group”) is included in the line groups detected by the straight-line detection unit 304. However, the locus of θ (or Φ) of the noise line group does not continue on the time axis, or the locus of θ (or Φ) of the noise line group is short even if the locus continues.
The time-series tracking unit 313 determines the locus of Φ on the time axis by dividing Φ determined in each graphics detection period into continuous groups on the time axis. The grouping method will be described below with reference to
(1) A locus data buffer is prepared. The locus data buffer is an array of pieces of locus data. A starting time Ts, an end time Te, an array (line group list) of pieces of line group data Ld constituting the locus, and a label number Ln can be stored in one piece of locus data Kd. One piece of line group data Ld is a group of pieces of data including the θvalue and ρ value (obtained by the straight-line detection unit 304) of one line group constituting the locus, the Φ value (obtained by the directional estimation unit 311) indicating the sound source direction corresponding to the line group, the frequency component (obtained by the sound source component estimation unit 312) corresponding to the line group, and the times when these values are obtained. Initially the locus data buffer is empty. A new label number is prepared as a parameter for issuing the label number, and an initial value of the new label number is set at zero.
(2) For each Φ which is newly obtained at a time T (hereinafter it is assumed that two Φs shown by dots 303 and 304 in
(3) When the locus data satisfying the condition (2) is found like the dot 303, assuming that Φn forms the same locus, Φn, the θ value and ρ value corresponding to Φn, the frequency component, and the current time T are added as new line group data of the locus data Kd to the line group list, and the current time T is set at the new end time Te of the locus. At this point, when plural loci are found, assuming that all the loci form the same locus, all the loci are integrated to the locus data having the youngest label number, and the remaining data is deleted from the locus data buffer. The starting time Ts of the integrated locus data is the earliest starting time among the pieces of locus data before the integration, the end time Te is the latest end time among the pieces of locus data before the integration, and the line group list is the sum of the line group lists of pieces of data before the integration. As a result, the dot 303 is added to the locus data 301.
(4) When the locus data satisfying the condition (2) is not found like the dot 304, the new locus data is produced as the start of the new locus in an empty part of the locus data buffer, both the starting time Ts and the end time Te are set at the current time T, Φn, the θ value and ρ value corresponding to Φn, the frequency component, and the current time T are set at the initial line group data of the line group list, the value of the new label number is given as the label number Ln of the locus, and the new label number is incremented by 1. When the new label number reaches a predetermined maximum value, the new label number is returned to zero. Accordingly, the dot 304 is entered as the new locus data in locus data buffer.
(5) When the locus data which elapses the predetermined time At since the data is finally updated (i.e. from the end time Te) exists in the pieces of locus data stored in the locus data buffer, the locus data which elapses the predetermined time Δt is outputted to the next-stage duration estimation unit 314 as the locus in which a new Φn to be added is not found, i.e. the tracking is completed. Then, the locus data is deleted from the locus data buffer. In
[Duration Estimation Unit 314]
The duration estimation unit 314 computes duration of the locus from the starting time and the end time of the locus data in which the tracking is completed, and the locus data is outputted from the time-series tracking unit 313. The duration estimation unit 314 certifies the locus data having the duration exceeding the predetermined threshold as the locus data based on the source sound, and the duration estimation unit 314 certifies the pieces of locus data except for the locus data having the duration exceeding the predetermined threshold as the locus data based on the noise. The locus data based on the source sound is referred to as sound source stream information. The sound source stream information includes the starting time Ts and the end time Te of the source sound and the pieces of time-series locus data of θ, ρ, and Φ indicating the sound source direction. The number of line groups obtained by the graphics detection unit 5 gives the number of sound sources, and the noise sound source is also included in the number of sound sources. The number of pieces of sound source stream information obtained by the duration estimation unit 314 gives the reliable number of sound sources except for the number of sound sources based on the noise.
[Sound Source Component Matching Unit 315]
The sound source component matching unit 315 causes the pieces of sound source stream information which derive from the same sound source to correspond to one another, and then the sound source component matching unit 315 generates sound source candidate corresponding information. The pieces of sound source stream information are obtained with respect to the different pairs of microphones through the time-series tracking unit 313 and the duration estimation unit 314 respectively. The voices emitted from the same sound source at the same time should be similar to one another in the frequency component. Therefore, a degree of similarity is computed by matching patterns of the frequency components between the sound source streams at the same time based on the sound source component at each time in each line group estimated by the sound source component estimation unit 312, and the sound source streams correspond to each other. The sound source streams which correspond to each other have the frequency component patterns which capture the maximum degree of similarity not lower than the predetermined threshold. At this point, however, the pattern matching can be performed in all the ranges of the sound source stream, it is efficient to search the sound source streams in which the total degrees of similarity or the average degree of similarity becomes the maximum not lower than the predetermined threshold by matching the frequency component patterns of the times in the period in which the matched sound source streams exist simultaneously. The times to be matched are set the time when the powers of both the matched sound source streams become values not lower than the predetermined threshold, which allows the matching reliability to be further improved.
It should be noted that the information can be exchanged among the functional blocks of the graphics matching unit 6 through a cable (not shown) if necessary.
[Sound Source Information Generating Unit 7]
As shown in
[Sound Source Existence Range Estimation Unit 401]
The sound source existence range estimation unit 401 computes a spatial existence range of the sound source based on the sound source candidate corresponding information generated by the graphics matching unit 6. The computing method includes the two following methods, and the two methods can be switched by the parameter.
(Computing method 1) The sound source directions indicated by the pieces of sound source stream information, which are caused to correspond to one another because the pieces of sound source stream information which derive from the same sound source, are assumed as the conical surface (see
(Computing method 2) The spatial existence range of the sound source is determined as follows using the sound source directions indicated by the pieces of sound source stream information, which are caused to correspond to one another because the pieces of sound source stream information derive from the same sound source. Namely, (1), a concentric spherical surface whose center is the origin of the apparatus is assumed, and a table in which an angle for each pair of microphones is computed is previously prepared for a discrete point (spatial coordinate) on the concentric spherical surface. (2) The discrete point on the concentric spherical surface, in which the angle for each pair of microphones satisfies the set of sound source directions on the condition of least square error, is searched for, and the position of the point is set at the spatial existence range of the sound source.
[Pair Selection Unit 402]
The pair selection unit 402 selects the optimum pair for the sound source voice separation and extraction based on the sound source candidate corresponding information generated by the graphics matching unit 6. The selection method includes the two following methods, and the two methods can be switched by the parameter.
(Selection method 1) The sound source directions indicated by the pieces of sound source stream information, which are caused to correspond to one another because the pieces of sound source stream information derive from the same sound source, are compared to one another to select the pair of microphones detecting the sound source stream located nearest to the front face. Accordingly, the pair of microphones detecting the sound source stream from the most front face is used to extract the sound source voice.
(Selection method 2) The sound source directions indicated by the pieces of sound source stream information, which are caused to correspond to one another because the pieces of sound source stream information derives from the same sound source, are assumed as the conical surface (see
[In-Phasing Unit 403]
The in-phasing unit 403 obtains time transition in the sound source direction Φ of the stream from the sound source stream information selected by the pair selection unit 402, and the in-phasing unit 403 determines a width Φw=Φmax−Φmid by computing an intermediate value Φmid=(Φmax+Φmin)/2 from a maximum value Φmax and a minimum value Φmin of Φ. The in-phasing unit 403 extracts the pieces of time-series data of the two frequency resolved data a and b, which are of the origin of the sound source stream information, from the time going back to the predetermined time from the starting time Ts of the stream, to the time that elapses the predetermined time since the end time Te, and the in-phasing unit 403 performs correction such that the arrival time difference computed back by the intermediate value Φmid is cancelled. Therefore, the in-phasing unit 403 performs in-phasing.
Alternatively, the in-phasing unit 403 sets the sound source direction Φ of each time by the directional estimation unit 311 at Φmid, and the in-phasing unit 403 can simultaneously perform the in-phasing of the pieces of time-series data of the two frequency resolved data a and b. Whether the sound source stream information is referred to, or Φ of each time is referred to is determined by the operation mode, and the operation mode can be set as the parameter.
[Adaptive Array Processing Unit 404]
The adaptive array processing unit 404 separates and extracts the source sound (time-series data of frequency component) of the stream with high accuracy by performing an adaptive array process to the extracted and in-phased pieces of time-series data of the two frequency resolved data a and b. In the adaptive array process, center directivity is faced to the front face of 0° and the value in which a predetermined margin is added to ±Φw is set at a tracking range. As disclosed in Tadashi Amada et al., “Microphone array technique for speech recognition,” Toshiba review, vol. 59, No. 9, 2004, the method of clearly separating and extracting the voice within the set directivity range by using main and sub Griffith-Jim type generalized side-lobe cancellers can be used as the adaptive array process.
In the case of the use of the adaptive array process, usually the tracking range is previously set to wait the voice from the direction of the tracking range. Therefore, in order to wait the voice from all directions, it is necessary to prepare many adaptive arrays whose tracking ranges are changed. On the contrary, in the apparatus of the embodiment, after the number of sound sources and the directions of the sound sources are actually determined, only the number of adaptive arrays can be operated according to the number of sound sources, and the tracking range can be set at a predetermined narrow range according to the sound source directions. Therefore, the voice can efficiently be separated and extracted with high quality.
Further, the previous in-phase of the pieces of time-series data of the two frequency resolved data a and b allows the sound from all directions to be processed only by setting the tracking range in the adaptive array process at the neighborhood of the front face.
Voice Recognition Unit 405
The voice recognition unit 405 analyzes and verifies the time-series data of the source sound extracted by the adaptive array processing unit 404. Therefore, the voice recognition unit 405 extracts symbolic contents of the stream, i.e. symbols (string) expressing linguistic meaning, the kind of sound source, or the speaker.
[Output Unit 8]
The output unit 8 outputs information that includes at least one of the number of sound source candidates, the spatial existence range of the sound source candidate (angle Φ determining the conical surface), the voice component configuration (pieces of time-series data of the power and phase in each frequency component), the number of sound source candidates (sound source streams) except for the noise sound sources, and the temporal existence period of the voice as the sound source candidate information by the graphics matching unit 6. The number of sound source candidates can be obtained as the number of line groups by the graphics detection unit 5. The spatial existence range of the sound source candidate, which is of the emitting source of the acoustic signal, is estimated by the directional estimation unit 311. The voice component configuration is estimated by the sound source component estimation unit 312, and the sound source candidate emits the voice. The number of sound source candidates can be obtained by the time-series tracking unit 313 and the duration estimation unit 314. The temporal existence period of the voice can be obtained by the time-series tracking unit 313 and the duration estimation unit 314, and the sound source candidate emits the voice. Alternatively, the output unit 8 outputs the information including at least one of the number of sound sources, the finer spatial existence range of the sound source (conical surface intersecting range or table-searching coordinate value), the separated voice in each sound source (time-series data of amplitude value), and the symbolic content of the sound source voice as the sound source information by the sound source information generating unit 7. The number of sound sources can be obtained as the number of corresponding line group (sound source stream) by the graphics matching unit 6. The finer spatial existence range of the sound source is estimated by the sound source the existence range estimation unit 401, and the sound source is the emitting source of the acoustic signal. The separated voice in each sound source can be obtained by the pair selection unit 402, the in-phasing unit 403, and the adaptive array unit 404. The symbolic content of the sound source voice can be obtained by the voice recognition unit 405.
[User Interface Unit 9]
The user interface unit 9 displays various kinds of setting contents necessary for the acoustic signal processing to a user, and the user interface unit 9 receives the setting input from the user. The user interface unit 9 also stores the setting contents in an external storage device or reads the setting contents from the external storage device. As shown in
[Process Flowchart]
In initial setting process Step S1, a part of the process in the user interface unit 8 is performed. In Step S1, the various kinds of setting contents necessary for the acoustic signal processing are read from the external storage device, and the apparatus is initialized in a predetermined setting state.
In the acoustic signal input process Step S2, the process in the acoustic signal input unit 2 is performed. The two acoustic signals captured at the two positions which are spatially different from each other are inputted in Step S2.
In the frequency resolution process Step S3, the process in the frequency resolution unit 3 is performed. In Step S3, the frequency resolution is performed on each of the acoustic signals inputted in Step S2, and at least the phase value (and the power value if necessary) is computed for each frequency.
In the two-dimensional data generating process Step S4, the process in the two-dimensional data generating unit 4 is performed. In Step S4, the phase values of the acoustic signals computed in each frequency in Step S3 are compared to one another to compute the phase difference between the phase values in each frequency. Then, the phase difference in each frequency is set as the point on the XY coordinate system, in which the frequency function is set on the X-axis and the phase difference function is set on the Y-axis. The point is converted into the (x, y) coordinate value which is uniquely determined by the frequencies and the phase difference between the frequencies.
In the graphics detection process Step S5, the process in the graphics detection unit 5 is performed. In Step S5, the predetermined graphics is detected from the two-dimensional data by Step S4.
In the graphics matching process Step S6, the process in the graphics matching unit 6 is performed. The graphics detected by Step S5 is set at the sound source candidate, and the graphics is caused to correspond among the pairs of microphones having different sound source candidates. Therefore, the pieces of graphics information (the sound source candidate corresponding information) by the plural pairs of microphones are integrated for the same sound source.
In the sound source information generating process Step S7, the process in the sound source information generating unit 7 is performed. In Step S7, the sound source information including at least one of the number of sound sources which are of the emitting source of the acoustic signal, the finer spatial existence range of the sound source, the component configuration of the voice emitted from each sound source, the separated voice in each sound source, the temporal existence period of the voice emitted from each sound source, and the symbolic contents of the voice emitted from each sound source is generated based on the graphics information (the sound source candidate corresponding information) on the same sound source by the plural pairs of microphones for the same sound source which is integrated in Step S6.
In the output process Step S8, the process in the output unit 8 is performed. The sound source candidate information generated by Step S6 and the sound source information generated by Step S7 are outputted in Step S8.
In the ending determination process Step S9, a part of the process in the user interface unit 9 is performed. In Step S9, whether an ending command from the user is present or absent is confirmed. When the ending command exists, the process flow is controlled to go to Step S12. When the ending command does not exist, the process flow is controlled to go to Step S10.
In the confirming determination process Step S10, a part of the process in the user interface unit 9 is performed. In Step S10, whether a confirmation command from the user is present or absent is confirmed. When the confirmation command exists, the process flow is controlled to go to Step S11. When the confirmation command does not exist, the process flow is controlled to go to Step S2.
In the information display and setting receiving process Step S11, a part of the process in the user interface unit 9 is performed. Step S11 is performed by receiving the confirmation command from the user. Step S11 enables the display of various kinds of setting contents necessary for the acoustic signal processing to the user, the reception of the setting input from the user, the storage of the setting contents in the external storage device by the storage command, the readout of the setting contents from the external storage device by the read command, and the visualization of the various processing results and the intermediate results, and the display of the various processing results and the intermediate results to the user. Further, in Step S11, the user selects the desired data to visualize the data in more detail. Therefore, the user can confirm the operation of the acoustic signal processing, the user can adjust the apparatus such that the apparatus performs the desired operation, and the process can be continued in the adjusted state.
In the ending process Step S12, a part of the process in the user interface unit 9 is performed. Step S12 is performed by receiving the ending command from the user. In Step S12, the various kinds of setting contents necessary for the acoustic signal processing are automatically stored.
[Modification]
The modifications of the above-described embodiment will be described below.
[Detection of Vertical Line]
In the embodiment, the two-dimensional data generating unit 4 generates the point group while the X coordinate value is set at the phase difference ΔPh(fk) and the Y coordinate value is set at the frequency component number k by the coordinate value determining unit 302. It is also possible that the X coordinate value is set as an estimation value ΔT(fk)=(ΔPh(fk)/2π)×(1/fk) in each frequency of the arrival time difference computed from the phase difference ΔPh(fk). When the arrival time difference is used instead of the phase difference, the points having the same arrival time differences, i.e. the points which derive from the same sound source are arranged on a perpendicular line.
At this point, as the frequency is increased, the time difference ΔT(fk) which can be expressed by the phase difference ΔPh(fk) is decreased. As shown in
Therefore, in order to solve the phase difference cyclic problem, for the frequency ranges exceeding the limit frequency 292, the coordinate value determining unit 302 forms the two-dimensional data by generating the redundant points at the position of the arrival time difference ΔT(fk) corresponding to the phase difference within the range of ±Tmax as shown in
Accordingly, the voting unit 303 and the straight-line detection unit 304 can detect a promising perpendicular line (295 in
The problem that the maximum position is determined can also be solved by detecting the maximum position which obtains the votes not lower than the predetermined threshold at the maximum position on the one-dimensional vote distribution (peripheral distribution of the projection voting to the Y-axis direction), in which the X coordinate value of the redundant point group is voted. Thus, all the pieces of evidence indicating the sound source existing in the different directions are projected to the lines having the same gradients (i.e. perpendicular line) by using the arrival time difference as the X-axis instead of the phase difference, so that the detection can simply be performed by the peripheral distribution without performing the Hough transform.
The sound source direction information obtained by determining the perpendicular line is the arrival time difference ΔT(fk) which is obtained not as θ but as ρ. Therefore, the directional estimation unit 311 can immediately compute the sound source direction Φ from the arrival time difference ΔT with no θ.
Thus, the two-dimensional data generated by the two-dimensional data generating unit 4 is not limited to one kind, and the graphics detection method performed by the graphics detection unit 5 is not limited to one method. The point group plot shown in
[Program: Realization with Computer]
As shown in
[Recording Medium]
As shown in
[Acoustic Velocity Correction with Temperature Sensor]
The invention can be realized, such that the acoustic signal processing apparatus includes a temperature sensor which measures an ambient temperature and the acoustic velocity Vs shown in
Alternatively, the invention can be realized, such that the acoustic signal processing apparatus includes means for transmitting the acoustic wave and means for receiving the acoustic wave which are arranged at predetermined intervals, and the acoustic velocity Vs is directly computed and corrected to determine the accurate Tmax by measuring the time interval during which the acoustic wave emitted from the acoustic wave transmitting means reaches the acoustic wave receiving means with measurement means.
[Unequal division of θ for Equal Interval of Φ]
In the invention, when the Hough transform is performed in order to the gradient of the line group, for example, quantization is performed by dividing θ by 1°. When θ is equally divided, the value of the estimable sound source direction Φ is unequally quantized. Therefore, in the invention, it is also possible that the quantization of θ is performed by equally dividing Φ and thereby the variations in estimation accuracy are not generated in the sound source direction.
[Variation of Graphics Matching]
In the embodiment, the sound source component matching unit 315 is the means for matching the sound source stream (time series of graphics) by different pairs based on the similarity of the frequency component at the same time. The matching method enables the separation and extraction with a clue of the difference in frequency components of the sound source voices when the plural sound sources to be detected exist at the same time.
Due to the operation purpose, sometimes the sound sources to be simultaneously detected is the strongest one, or sometimes the sound sources to be simultaneously detected is one having the longest duration. Therefore, the sound source component matching unit 315 may be realized so as to include the options, in which the sound source component matching unit 315 causes the sound source streams in which the power becomes the maximum in each pair to correspond to one another, the sound source component matching unit 315 causes the sound source streams in which the duration becomes the longest to correspond to one another, and the sound source component matching unit 315 causes the sound source streams in which the overlap of the duration becomes the longest to correspond to one another. The switch of the options can be set as the parameter.
[Directivity Control of Another Sensor]
In the embodiment, the sound source the existence range estimation unit 401 determines the point having the least error as the spatial existence range of the sound source by searching for the point satisfying the least square error from the discrete points on the concentric spherical surface with the computing method 2. At this point, except for the point having the least error, the points of top k-rank, such as the point having the second least error and the point having the third least error, can be determined in terms of the least error. The acoustic signal processing apparatus can include another sensor such as a camera. In the application in which the camera is trained toward the sound source direction, while the camera is trained to the determined points of top k-rank in order of the least error, the acoustic signal processing apparatus can visually detect the object which becomes the target. Since the direction and distance of the point are determined, the angle and zoom of the camera can smoothly be controlled. Therefore, the visual sense object which should exist at the sound source position can efficiently be searched for and detected. Specifically, the apparatus can be applied to an application in which the camera is trained toward the direction of the voice to find a face.
In the method disclosed in K. Nakadai et al., “Real time active chase of person by hierarchy integration of audio-visual information,” Japan Society for Artificial Intelligence AI Challenge Kenkyuukai, SIG-Challenge-0113-5 (in Japanese), p 35-42, June 2001, the number of sound sources, directions of the sound sources, and the component estimation are determined by detecting the basic frequency component constituting the harmonic structure and the harmonic components of the basic frequency component from the frequency resolved data. Because of the assumption of the harmonic structure, this method is specialized in the human voice. However, many sound sources having no harmonic structure, such as the opening sound and closing sound of a door, exist in an actual environment, thus the method cannot deal with the source sound emitted from the sound sources having no harmonic structure.
Although the method disclosed in F. Asano, “Dividing sounds,” Transaction of the Society of Instrument and Control Engineers (in Japanese) vol. 43, No. 4, p 325-330 (2004) is not limited to the particular model, the sound source which can be dealt with by this method is limited to only one as long as the two microphones are used.
On the contrary, according to the embodiment of the invention, the phase difference in each frequency component is divided into groups in each sound source by the Hough transform. Therefore, while the two microphones are used, the function of determining the orientations of at least two sound sources and the function of separating at least two sound sources are realized. At this point, the restricted models such as the harmonic structure are not used in the invention, so that the invention can be applied to wide-ranging sound sources.
Other effects and advantages obtained by the embodiment of the invention are summarized as follows:
(1) Wide-ranging sound sources can stably be detected by using the voting method suitable to the detection of a sound source having a many frequency components or a sound source having a strong power in Hough voting.
(2) A sound source can be efficiently detected with high accuracy by considering the limitation of ρ=0 and the phase difference cyclicity in detecting the line.
(3) The use of the line detection result can determine useful sound source information including the spatial existence range of the sound source which is of the emitting source of the acoustic signal, the temporal existence period of the source sound emitted from the sound source, the component configuration of the source sound, the separated voice of the source sound, and the symbolic contents of the source sound.
(4) In estimating the frequency component of each sound source, the component near the line is simply selected, to which line the frequency component belongs is determined, and the coefficient is multiplied according to the distance between the line and the frequency component. Therefore, the source sound can individually be separated in a simple manner.
(5) The directivity range of the adaptive array process is adaptively set by previously learning the frequency component direction, which allows the source sounds to be separated with higher accuracy.
(6) The symbolic contents of the source sound can be determined by recognizing the source sound while separating the source sound with high accuracy.
(7) The user can confirm the operation of the apparatus, the user can perform the adjustment such that the desired operation is performed, and the user can utilize the apparatus in the adjusted state.
(8) The sound source direction is estimated from one pair of microphones, and the matching and integration of the estimation result are performed for plural pairs of microphones. Therefore, not the sound source direction, but the spatial position of the sound source can be estimated.
(9) The appropriate pair of microphones is selected from the plural pairs of microphones with respect to one sound source. Therefore, with respect to a sound source of low quality in a single pair of microphones, the sound source voice can be extracted with high quality from the voice of the pair of microphones of good reception quality, and the sound source voice can thus be recognized.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspect is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2005-084443 | Mar 2005 | JP | national |
Number | Date | Country |
---|---|---|
2003-337164 | Nov 2003 | JP |
Number | Date | Country | |
---|---|---|---|
20060215854 A1 | Sep 2006 | US |