Acoustic signal processing device, acoustic signal processing method, and program

CROSS-REFERENCE TO RELATED APPLICATION

Priority is claimed on Japanese Patent Application No. 2018-165504, filed Sep. 4, 2018, the content of which is incorporated herein by reference.

BACKGROUND
Field of the Invention

The present invention relates to an acoustic signal processing device, an acoustic signal processing method, and a program.

Related Art

Conventionally, there is a technology of collecting sounds using a plurality of microphones and acquiring identification of a sound source on the basis of the collected sounds and information based on the collected sounds. In such a technology, the sounds collected by the microphones are converted into sampled electrical signals, signal processing is executed on the converted electrical signals, and thereby the information based on the collected sounds is acquired. In addition, signal processing in such a technology is processing in which the converted electrical signals are assumed to be electrical signals obtained by sampling sounds collected by microphones located at different positions at the same sampling frequency (for example, refer to Katsutoshi Itoyama, Kazuhiro Nakadai, “Synchronization between channels of a plurality of A/D converters based on probabilistic generation model,” Proceedings of the 2018 Spring Conference, Acoustical Society of Japan, 2018, pp. 505-508).

However, in practice, an AD converter provided for each microphone samples the converted electrical signals in synchronization with a clock generated by a vibrator provided for each AD converter. For this reason, there are cases in which sampling at the same sampling frequency is not necessarily performed depending on individual differences of the vibrators. In addition, in robots or the like which operate in extreme environments, external influences such as temperature or humidity are different for each vibrator. For this reason, in this case, not only the individual differences of each vibrator but also the external influences may cause a gap in a clock of each vibrator. To reduce such a gap, it has been proposed to use an oven-controlled crystal oscillator (OCXO), an oscillator with small individual difference such as an atomic clock, a large capacity capacitor, or the like. However, it is not realistic to actually mount it on a robot or the like and to operate it. Therefore, in such conventional technologies, accuracy of information based on sounds collected by a plurality of microphones may deteriorate in some cases.

SUMMARY OF THE INVENTION

Aspects of the present invention have been made in view of the above circumstances, and an object thereof is to provide an acoustic signal processing device, an acoustic signal processing method, and a computer program which can suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.

In order to solve the above problems, the present invention adopts the following aspects.

(1) An acoustic signal processing device according to one aspect of the present invention includes an acoustic signal processing unit configured to calculate a spectrum of each acoustic signal and a steering vector having m elements on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), and to estimate a sampling frequency ω_min the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω_idealthat is a predetermined value.

(2) In the aspect (1) described above, the steering vector may represent a difference between positions of the microphones having a transfer characteristic from a sound source of the sounds to each of the microphones.

(3) In the aspect (1) or (2) described above, a matrix representing a conversion of an analog signal from a spectrum of an ideal signal into a spectrum of a signal sampled at the sampling frequency ω_mand a sample time τ_mis set to a spectrum expansion matrix, and the acoustic signal processing unit may estimate the sampling frequency ω_mon the basis of the steering vector, the spectrum expansion matrix, and a spectrum X_m.

(4) An acoustic signal processing method according to another aspect of the present invention includes a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals, and an estimation step of estimating a sampling frequency ω_min the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω_idealthat is a predetermined value.

(5) A computer readable non-transitory storage medium according to still another aspect of the present invention stores a program causing a computer of an acoustic signal processing device to execute a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals, and an estimation step of estimating a sampling frequency ω_min the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω_idealthat is a predetermined value.

According to the aspects (1), (4), and (5), it is possible to synchronize a plurality of acoustic signals having different sampling frequencies. For this reason, according to the aspects (1), (4), and (5), it is possible to suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.

According to the aspect (2) described above, it is possible to include a distance difference of a microphone from a sound source, a direct sound, and a reflected sound.

According to the aspect (3) described above, it is possible to correct a gap between sampling frequencies ω_mand ω_ideal.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram which shows an example of a configuration of an acoustic signal output device 1 of an embodiment.

FIG. 2 is a diagram which shows an example of a functional configuration of an acoustic signal processing device 20 in the embodiment.

FIG. 3 is a flowchart which shows an example of a flow of processing executed by the acoustic signal output device 1 of the embodiment.

FIG. 4 is a diagram which shows an application example of the acoustic signal output device 1 of the embodiment.

FIG. 5 is an explanatory diagram which describes a steering vector and a spectrum expansion matrix in the embodiment.

FIG. 6 is a first diagram which shows simulation results.

FIG. 7 is a second diagram which shows simulation results.

FIG. 8 is a third diagram which shows simulation results.

FIG. 9 is a fourth diagram which shows simulation results.

FIG. 10 is a fifth diagram which shows simulation results.

FIG. 11 is a sixth diagram which shows simulation results.

FIG. 12 is a seventh diagram which shows simulation results.

FIG. 13 is an eighth diagram which shows simulation results.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 is a diagram which shows an example of a configuration of an acoustic signal output device 1 of an embodiment. The acoustic signal output device 1 includes a microphone array 10 and an acoustic signal processing device 20. The microphone array 10 includes microphones 11-m (m is an integer of 1 or more and M or less; M is an integer of 2 or more). The microphones 11-m are located at different positions. The microphones 11-m collect a sound Z1_mwhich has arrived at the microphones 11-m. The sound Z1_marriving at the microphones 11-m includes, for example, a direct sound that is emitted by a sound source and an indirect sound that arrives after being reflected, absorbed, or scattered by a wall or the like. For this reason, a frequency spectrum of a sound source and a frequency spectrum of a sound collected by the microphones 11-m are not necessarily the same.

The microphones 11-m convert the collected sound Z1_minto an acoustic signal such as an electrical signal or an optical signal. The converted electrical signal or optical signal is an analog signal Z2_mwhich represents a relationship between a magnitude of the collected sound and a time at which the sound is collected. That is, the analog signal Z2_mrepresents a waveform in a time domain of the collected sound.

The microphone array 10 which includes m microphones 11-m outputs acoustic signals of M channels to the acoustic signal processing device 20.

The acoustic signal processing device 20 includes, for example, a central processing unit (CPU), a memory, an auxiliary storage device, and the like connected by a bus, and executes a program. The acoustic signal processing device 20 functions as a device including, for example, an analog to digital (AD) converter 21-1, an AD converter 21-2, . . . , an AD converter 21-M, an acoustic signal processing unit 22, and an ideal signal conversion unit 23 according to execution of a program. The acoustic signal processing device 20 acquires the acoustic signals of M channels from the microphone array 10, estimates the sampling frequency ω_mwhen an acoustic signal collected by the microphones 11-m is converted into a digital signal, and calculates an acoustic signal resampled at a virtual sampling frequency ω_idealusing an estimated sampling frequency ω_m.

The AD converter 21-m is included in each of the microphones 11-m and acquires the analog signal Z2_moutput by the microphones 11-m. The AD converter 21-m samples the acquired analog signal Z2_mat the sampling frequency ω_min the time domain. Hereinafter, a signal representing a waveform after execution of the sampling is referred to as a time domain digital signal Yall_m. Hereinafter, a signal in one frame, which is part of the time domain digital signal Yall_m, is referred to as a single frame time domain digital signal Y_mto simplify the description. Hereinafter, a g^thframe arranged in a time order is referred to as a frame g. Hereinafter, it is assumed that the frame is a frame g to simplify the description.

The single frame time domain digital signal Y_mis represented by the following expression (1).

Y_m=(y_m,0,y_m,1, . . . ,y_m,L-1)^T (1)

y_m, ξ is a (ξ+1)^thelement of the single frame time domain digital signal Ym. ξ is an integer of 0 or more and (L−1) or less. The element y_m, ξ is a magnitude of sound represented by the single frame time domain digital signal Y_m, and is a magnitude of sound at a ξ^thtime after the execution of the sampling, which is a time in one frame. Note that T in Expression (1) represents a transposition of a vector. Hereinafter, T in an expression like Expression (1) represents a transposition of a vector. Note that, L is a signal length of the single frame time domain digital signal Y_m.

The AD converter 21-m (analog to digital converter) includes a vibrator 211-m. The AD converter 21-m operates in synchronization with a sampling frequency generated by the vibrator 211-m.

The acoustic signal processing unit 22 acquires a sampling frequency ω_mand a sample time τ_m. The acoustic signal processing unit 22 converts a time domain digital signal Yall_minto an ideal signal to be described below on the basis of the acquired sampling frequency ω_mand sample time τ_m.

Note that the sample time τ_mis a start time for the AD converter 21-m to start sampling of the analog signal Z2_m. The sample time τ_mis a time difference which represents a gap between an initial phase of sampling by the AD converter 21-m and a phase serving as a predetermined reference.

Here, a sampling frequency generated by a vibrator will be described.

Since there are individual differences in respective vibrators 211-m and environmental influences such as heat or humidity for respective vibrators 211-m are not necessarily the same, sampling frequencies generated by respective vibrators 211-m are not necessarily the same in all of the vibrators 211-m. For this reason, all of the sampling frequencies ωm are not necessarily the same sampling frequency ω_ideal.

Hereinafter, a virtual sampling frequency of the vibrator 211-m is referred to as the virtual frequency ω_ideal. Note that a variation in sampling frequency generated by each of M vibrators 211-m is near a variation in reference transmission frequency of the vibrators 211-m, and a nominal frequency is, for example, about ×10⁻⁶±20% with respect to 16 kHz.

In addition, since the sampling frequencies generated by the vibrators 211-m are not necessarily the same in all of the vibrators 211-m, not all sample times τ_mare necessarily the same time.

Hereinafter, a sample time in a case in which there are no individual differences between the vibrators 211-m and no environmental influences such as heat or humidity with respect to the vibrators 211-m is referred to as a virtual time τ_ideal.

In this manner, respective sampling frequencies ω_mare not necessarily the same, and respective sample times τ_mare not necessarily the same. In addition, the microphones 11-m are not positioned at the same position. For this reason, each single frame time domain digital signal Y_mis not necessarily the same as an ideal signal. An ideal signal is a signal obtained by sampling the analog signal Z2_mat the virtual frequency ω_idealand the virtual time τ_ideal.

FIG. 2 is a diagram which shows an example of a functional configuration of the acoustic signal processing unit 22 in the embodiment.

The acoustic signal processing unit 22 includes a storage unit 220, a spectrum calculation processing unit 221, a steering vector generation unit 222, a spectrum expansion matrix generation unit 223, an evaluation unit 224, and a resampling unit 225.

The storage unit 220 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 220 stores the virtual frequency ω_ideal, the virtual time τ_ideal, the trial frequency ω_m, and the trial time T_m. The virtual frequency ω_idealand the virtual time τ_idealare known values stored in the storage unit 220 in advance. The trial frequency W_mis a value that is updated according to an evaluation result of the evaluation unit 224 to be described below and is a value of a physical quantity having the same dimension as the sampling frequency ω_m. The trial frequency W_mis a predetermined initial value until it is updated according to the evaluation result of the evaluation unit 224. The trial time T_mis a value that is updated according to the evaluation result of the evaluation unit 224 to be described below and is a value of a physical quantity having the same dimension as the sample time τ_m. The trial time T_mis a predetermined initial value until it is updated according to the evaluation result of the evaluation unit 224.

Note that, as an example, when the virtual frequency ω_idealis 16000 Hz, a trial frequency W₁is 15950 Hz, a trial time τ₁is 0 msec, a trial frequency W₂is 15980 Hz, a trial time τ₂is 0 msec, a trial frequency W₃is 16020 Hz, a trial time τ₃is 0 msec, a trial frequency W₄is 16050 Hz, a trial time τ₄is 0 msec, and the like.

Note that the acoustic signal processing unit 22 performs processing on an acquired acoustic signal, for example, for every length L.

The spectrum calculation processing unit 221 acquires an acoustic signal output by the AD converter 21 and calculates a spectrum by performing a Fourier transform on the acquired acoustic signal. The spectrum calculation processing unit 221 acquires a spectrum of a waveform represented by a single frame time domain digital signal Y_mfor all frames.

The spectrum calculation processing unit 221 acquires, for example, first, a time domain digital signal Yall_mfor each frame. Next, the spectrum calculation processing unit 221 acquires a spectrum X_mof the single frame time domain digital signal Y_min the frame g by performing a discrete Fourier transform on the single frame time domain digital signal Y_mfor each frame g.

Since the spectrum X_mis a Fourier component of the digital signal Y_m, the following expression (2) is established between the spectrum X_mand the digital signal Y_m.

X_m=DY_m (2)

In Expression (2), D is a matrix of L rows and L columns. An element D_<j_x,j_y> (j_xand j_yare integers of 1 or more and L of less) at row j_xand column j_yof the matrix D is represented by the following expression (3). Hereinafter, D is referred to as a discrete Fourier transform matrix.

X_mis a vector having L elements. In Expression (3), i represents an imaginary unit.

Note that an underscore represents that a letter or number to the right of the underscore is a subscript of a letter or number to the left of the underscore. For example, j_x represents j_x.

Note that < . . . > to the left of an underscore represents that the letters or numbers in < . . . > are a subscript of a letter or number to the right of the underscore. For example, y_<n,ξ> represents y_n,ξ.

$\begin{matrix} D_{j_{x}, j_{y}} = \frac{1}{\sqrt{L}} e^{- \frac{2 π i (j_{x} - 1) (j_{y} - 1)}{L}} & (3) \end{matrix}$

The steering vector generation unit 222 generates a steering vector for each microphone 11-m on the basis of the spectrum X_m. A steering vector is a vector having a transfer function from a microphone to a sound source as an element. The steering vector generation unit 222 may also generate a steering vector in a known manner.

The steering vector represents a difference between positions of the microphones 11-m having a transfer characteristic from the sound source to each of the microphones 11-m. The positions of the microphones 11-m are positions at which the microphones 11-m collect sounds.

The spectrum expansion matrix generation unit 223 acquires the trial frequency W_mand the trial time T_mstored in the storage unit 220, and generates a spectrum expansion matrix on the basis of the acquired trial frequency W_mand trial time T_m. A spectrum expansion matrix is a matrix representing a conversion from a frequency spectrum of an ideal signal into a frequency spectrum of a signal obtained by sampling the analog signal Z2_mat the sampling frequency W_mand the sampling time T_m.

The evaluation unit 224 determines whether the trial frequency W_mand the trial time T_msatisfy a predetermined condition (hereinafter referred to as an “evaluation condition”) on the basis of the steering vector, the spectrum expansion matrix, and the spectrum X_m.

Note that an evaluation condition is a condition based on the steering vector, the spectrum expansion matrix, and the spectrum X_m. The evaluation condition is, for example, a condition of satisfying Expression (21) described below.

The evaluation condition may be any other condition as long as all values obtained by multiplying the spectrum X_mby an inverse matrix of the spectrum expansion matrix, and dividing each element value of a vector of a result of the multiplication by an element value of the steering vector are values within a predetermined range.

The evaluation unit 224 determines the trial frequency W_mas the sampling frequency ω_mand determines the trial time T_mas the sample time τ_mwhen the trial frequency W_mand the trial time T_msatisfy the evaluation condition.

The evaluation unit 224 updates the trial frequency W_mand the trial time T_musing, for example, a Metropolis algorithm when the trial frequency W_mand the trial time T_mdo not satisfy the evaluation condition. A method of updating, by the evaluation unit 224, the trial frequency W_mand the trial time T_mis not limited thereto, and any algorithm such as a Monte Carlo method and the like may be used.

The resampling unit 225 converts the time domain digital signal Yall_minto an ideal signal on the basis of the sampling frequency ω_mand sample time τ_mdetermined by the evaluation unit 224.

FIG. 3 is a flowchart which shows an example of a flow of processing executed by the acoustic signal output device 1 of the embodiment.

Each microphone 11-m collects a sound and converts the collected sound into an electrical signal or an optical signal (step S101).

The AD converter 21-m samples a time domain digital signal Yall_mwhich is the electrical signal or optical signal converted in step S101 using the frequency ω_min the time domain (step S102).

The spectrum calculation processing unit 221 calculates a spectrum (step S103).

The steering vector generation unit 222 generates a steering vector for each microphone 11-m on the basis of the spectrum X_m(step S104).

The spectrum expansion matrix generation unit 223 acquires a trial frequency W_mand a trial time T_mstored in the storage unit 220, and generates a spectrum expansion matrix on the basis of the acquired trial frequency W_mand trial time T_m(step S105).

The evaluation unit 224 determines whether the trial frequency W_mand the trial time T_msatisfy the evaluation condition on the basis of the steering vector, the spectrum expansion matrix, and the spectrum X_m(step S106).

When the trial frequency W_mand the trial time T_msatisfy the evaluation condition (YES in step S106), the evaluation unit 224 determines the trial frequency W_mas the sampling frequency ω_m, and determines the trial time T_mas the sample time τ_m. Next, the resampling unit 225 converts the time domain digital signal Yall_minto an ideal signal on the basis of the sampling frequency ω_mand the sample time τ_mdetermined by the evaluation unit 224.

On the other hand, when the trial frequency W_mand the trial time T_mdo not satisfy the evaluation condition (No in step S106), values of the trial frequency W_mand the trial time T_mare updated.

Note that, in the processing from step S105 to S106, a spectrum expansion matrix is generated on the basis of the trial frequency W_mand the trial time T_m, and other processing may be used as long as it is based on an optimization algorithm for determining the sampling frequency ω_mand the sample time τ_mwhich satisfy the evaluation condition on the basis of the spectrum expansion matrix and the steering vector.

The optimization algorithm may also be another algorithm. The optimization algorithm may be, for example, a gradient descent method. In addition, the optimization algorithm may be, for example, a Metropolis algorithm. A Metropolis algorithm is one simulation method and is a kind of Monte Carlo method.

The acoustic signal output device 1 configured in this manner estimates the sampling frequency ω_mand the sample time τ_mon the basis of the spectrum expansion matrix and the steering vector and converts the time domain digital signal Yall_minto an ideal signal on the basis of the estimated sampling frequency ω_mand sample time τ_m. For this reason, the acoustic signal output device 1 configured in this manner can suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.

Application Example

FIG. 4 is a diagram which shows an application example of the acoustic signal output device 1 of the embodiment. FIG. 4 shows a sound source identification device 100 which is an application example of the acoustic signal output device 1.

The sound source identification device 100 includes, for example, a CPU, a memory, an auxiliary storage device, and the like connected by a bus and executes a program. The sound source identification device 100 functions as a device including the acoustic signal output device 1, an ideal signal acquisition unit 101, a sound source localization unit 102, a sound source separation unit 103, a speech zone detection unit 104, a feature amount extraction unit 105, an acoustic model storage unit 106, and a sound source identification unit 107 by executing the program.

Hereinafter, components having the same function as in FIG. 1 will be denoted by the same reference numerals and description thereof will be omitted.

Hereinafter, it is assumed that there are a plurality of sound sources to simplify the description.

The ideal signal acquisition unit 101 acquires ideal signals of M channels which are converted by the acoustic signal processing unit 22 and outputs the acquired ideal signals of the M channels to the sound source localization unit 102 and the sound source separation unit 103.

The sound source localization unit 102 determines a direction in which the sound sources are located (sound source localization) on the basis of the ideal signals of the M channels output by the ideal signal acquisition unit 101. The sound source localization unit 102 determines, for example, a direction in which each sound source is located for each frame of a predetermined length (for example, 20 ms). The sound source localization unit 102 calculates, for example, a spatial spectrum indicating power in each direction using a multiple signal classification (MUSIC) method in sound source localization. The sound source localization unit 102 determines a sound source direction for each sound source on the basis of the spatial spectrum. The sound source localization unit 102 outputs sound source direction information indicating a sound source direction to the sound source separation unit 103 and the speech zone detection unit 104.

The sound source separation unit 103 acquires the sound source direction information output by the sound source localization unit 102 and the ideal signals of the M channels output by the ideal signal acquisition unit 101. The sound source separation unit 103 separates the ideal signals of the M channels into ideal signals for each sound source which are signals indicating components of each sound source on the basis of a sound source direction indicated by the sound source direction information. The sound source separation unit 103 uses, for example, a geometric-constrained high-order decorrelation-based source separation (GHDSS) method at the time of separating the ideal signals into ideal signals for each sound source. The sound source separation unit 103 calculates spectrums of the separated ideal signals and outputs them to the speech zone detection unit 104.

The speech zone detection unit 104 acquires the sound source direction information output by the sound source localization unit 102 and the spectrums of ideal signals output by the sound source localization unit 102. The speech zone detection unit 104 detects a speech zone for each sound source on the basis of the acquired spectrums of separated acoustic signals and the acquired sound source direction information. For example, the speech zone detection unit 104 performs sound source detection and speech zone detection at the same time by performing threshold processing on an integrated spatial spectrum obtained by integrating spatial spectrums obtained for each frequency using the MUSIC method in the frequency direction. The speech zone detection unit 104 outputs a result of the detection, the direction information, and the spectrums of acoustic signals to the feature amount extraction unit 105.

The feature amount extraction unit 105 calculates an acoustic feature amount for acoustic recognition from the separated spectrums output by the speech zone detection unit 104 for each sound source. The feature amount extraction unit 105 calculates an acoustic feature amount by calculating, for example, a static mel-scale log spectrum (MSLS), a delta MSLS and one delta power for each predetermined time (for example, 10 ms). Note that the MSLS is obtained by performing an inverse discrete cosine conversion on a mel-frequency cepstrum coefficient (MFCC) using a spectrum feature amount as a feature amount of acoustic recognition. The feature amount extraction unit 105 outputs the obtained acoustic feature amount to the sound source identification unit 107.

The acoustic model storage unit 106 stores a sound source model. The sound source model is a model used by the sound source identification unit 107 to identify collected acoustic signals. The acoustic model storage unit 106 sets an acoustic feature amount of the acoustic signals to be identified as a sound source model and stores it in association with information indicating a sound source name for each sound source.

The sound source identification unit 107 identifies a sound source with reference to an acoustic model stored by the acoustic model storage unit 106, which indicates an acoustic feature amount output by the feature amount extraction unit 105.

Since the sound source identification device 100 configured in this manner includes the acoustic signal output device 1, it is possible to suppress an increase in errors of the identification of a sound source that are errors caused by the fact that all of the microphones 11-m are not located at the same position.

Hereinafter, a spectrum expansion matrix and a steering vector will be described according to a mathematical expression.

First, the spectrum expansion matrix will be described.

The spectrum expansion matrix is, for example, a function satisfying the following expression (4).

X_n=A_nX_ideal (4)

In Expression (4), A_nrepresents the spectrum expansion matrix. The spectrum expansion matrix A_nin Expression (4) represents conversion from a spectrum X_idealof an ideal signal to a spectrum X_nof a time domain digital signal Yall_n. Note that n is an integer of 1 or more and M or less.

Since the spectrum X_nand the spectrum X_idealof an ideal signal are vectors, A_nis a matrix.

A_nsatisfies a relationship of Expression (5).

A_n=DB_nD⁻¹ (5)

Expression (5) shows that A_nis a value obtained by applying a discrete Fourier transform matrix D from a left side and an inverse matrix of the discrete Fourier transform matrix D from a right side to a resampling matrix B_n.

The resampling matrix B_nis a matrix which converts a single frame time domain digital signal Y_idealinto a single frame time domain digital signal Y_n. To represent this using a mathematical expression, the resampling matrix B_nis a matrix satisfying a relationship of the following Expression (6). The single frame time domain digital signal Y_idealis a signal of the frame g of the ideal signal.

Y_n=B_nY₀ (6)

The value at row θ and column φ of the resampling matrix B_nis set to B_n,θ,φ (θ and φ are integers of 1 or more), and B_n,θ,φ satisfies a relationship of the following Expression (7).

$\begin{matrix} b_{n, θ, ϕ} = \sin c (\frac{π}{ω_{μ}} (θ \cdot ω_{n} + (τ_{n} - τ_{ideal}) - ϕ \cdot ω_{ideal})) & (7) \end{matrix}$

In Expression (7), ω_nrepresents a sampling frequency in a channel n. The channel n is an n^thchannel among a plurality of channels. In Expression (7), τ_nrepresents a sample time in the channel n.

The function sin c ( . . . ) appearing on the right side of Expression (7) is a function defined by the following Expression (8). In Expression (8), t is an arbitrary number.

$\begin{matrix} \sin c (t) = \frac{\sin (t)}{t} & (8) \end{matrix}$

The relationships represented by Expression (6) to Expression (8) are expressions known to be established between the single frame time domain digital signal Y_nand the single frame time domain digital signal Y_ideal.

Next, the steering vector will be described.

To simply the following description, a steering vector in a frequency bin f will be described. The steering vector in the frequency bin f is a function R_fsatisfying the following Expression (9). The steering vector R_fin the frequency bin f is a vector having M elements.

(χ_1,f, . . . ,χ_M,f)^T=R_fs_f (9)

In Expression (9), s_frepresents a spectrum intensity of a sound source in the frequency bin f. In Expression (9), χ_m,fis a spectrum intensity in the frequency bin f of a frequency spectrum of the analog signal Z2_msampled at the virtual frequency ω_ideal.

Hereinafter, vectors (χ_1,f, . . . , χ_M,f) on the left side in Expression (9) are referred to as a simultaneous observation spectrum E_fat the frequency bin f.

Here, a vector E_allin which the simultaneous observation spectrum E_fthroughout the frequency bin f is integrated is defined. Hereinafter, E_allis referred to as an entire simultaneous observation spectrum. The entire simultaneous observation spectrum E_allis a direct product of E_fin all frequency bins f. Specifically, the entire simultaneous observation spectrum E_allis represented by Expression (10).

Hereinafter, it is assumed that f is an integer of 0 or more and (F−1) or less, and the total number of frequency bins is F to simplify the description.

$\begin{matrix} E_{all} \equiv (\begin{matrix} χ_{1, 0} \\ ⋮ \\ χ_{M, 0} \\ ⋮ \\ χ_{1, F - 1} \\ ⋮ \\ χ_{M, F - 1} \end{matrix}) & (10) \end{matrix}$

The entire simultaneous observation spectrum E_allsatisfies relationships of the following Expression (11) and Expression (12).

$\begin{matrix} E_{all} \equiv (\begin{matrix} χ_{1, 0} \\ ⋮ \\ χ_{M, 0} \\ ⋮ \\ χ_{1, F - 1} \\ ⋮ \\ χ_{M, F - 1} \end{matrix}) = (\begin{matrix} r_{1, 0} s_{0} \\ ⋮ \\ r_{M, 0} s_{0} \\ ⋮ \\ r_{1, F - 1} s_{F - 1} \\ ⋮ \\ r_{M, F - 1} s_{F - 1} \end{matrix}) = (\begin{matrix} R_{0} \\ ⋱ \\ R_{F - 1} \end{matrix}) S & (11) \\ S \equiv {(s_{0}, \dots, s_{F - 1})}^{T} & (12) \end{matrix}$

Hereinafter, S defined in Expression (12) is referred to as a sound source spectrum. In Expression (11), r_m,fis an m^thelement value of the steering vector R_f.

Alternatively, a relationship of the following Expression (14) is established for a modified simultaneous observation spectrum H_mdefined by Expression (13) in which an order of subscripts of x is switched according to Expression (11).

$\begin{matrix} H_{m} = {(χ_{m, 0}, \dots, χ_{m, F - 1})}^{T} & (13) \\ (\begin{matrix} H_{1} \\ ⋮ \\ H_{M} \end{matrix}) = (\begin{matrix} r_{1, 0} \\ ⋱ \\ r_{1, F - 1} \\ ⋮ \\ r_{M, 0} \\ ⋱ \\ r_{M, F - 1} \end{matrix}) S & (14) \end{matrix}$

Here, if a permutation matrix P of (M×F) rows and (M×F) columns having an element value p_<k_x,k_y> is used, Expression (14) is modified to the following Expression (15). Note that k_xand k_yare integers of 1 or more and (M×F) or less.

$\begin{matrix} (\begin{matrix} H_{1} \\ ⋮ \\ H_{M} \end{matrix}) = P (\begin{matrix} R_{0} \\ ⋱ \\ R_{F - 1} \end{matrix}) S & (15) \end{matrix}$

The element p_<k_x,k_y> at row k_xand column k_yof P is 1 when k_xand k_ysatisfying the following Expression (16) and Expression (17) are present, and is 0 when they are not present.

k_x=f×M+(m−1)+1 (16)
k_y=f+(m−1)×M+1 (17)

The permutation matrix P is, for example, the following Expression (18) when M is 2 and F is 3.

$\begin{matrix} P = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}) & (18) \end{matrix}$

P is a unitary matrix. In addition, a determinant of P is +1 or −1.

Here, a relationship between a sound source spectrum s and the spectrum X_mwill be described.

Hereinafter, the relationship between the sound source spectrum s and the spectrum X_min a spectrum expansion model will be described.

In the spectrum expansion and contraction model, a situation in which each microphone 11-m performs sampling at a different sampling frequency is considered. In the spectrum expansion and contraction model, it is assumed that conversion of a sampling frequency is performed independently using each microphone 11-m, and thus it does not affect a transmission system. Note that a spatial correlation matrix in this situation is a spatial correlation matrix when each microphone 11-m performs synchronous sampling at the virtual frequency ω_ideal.

A relationship of the following Expression (19) is established between the modified simultaneous observation spectrum H_mand the spectrum X_maccording to Expression (4).

$\begin{matrix} (\begin{matrix} X_{1} \\ ⋮ \\ X_{M} \end{matrix}) = (\begin{matrix} A_{1} \\ ⋱ \\ A_{M} \end{matrix}) (\begin{matrix} H_{1} \\ ⋮ \\ H_{M} \end{matrix}) & (19) \end{matrix}$

If Expression (15) is substituted into Expression (19), Expression (20) which represents a relationship between the sound source spectrum s and the spectrum X_mis derived.

$\begin{matrix} (\begin{matrix} X_{1} \\ ⋮ \\ X_{M} \end{matrix}) = (\begin{matrix} A_{1} \\ ⋱ \\ A_{M} \end{matrix}) P (\begin{matrix} R_{1} \\ ⋱ \\ R_{F - 1} \end{matrix}) S & (20) \end{matrix}$

An example of evaluation condition will be described using a mathematical expression.

The evaluation condition may be, for example, a condition that all of differences between values obtained by dividing the element χ_m,fof the simultaneous observation E_fby the element value r_m,fof the steering vector R_fare within a predetermined range when the following three incidental conditions are satisfied.

A first incidental condition is, for example, a condition that a probability distribution of possible values for the sampling frequency ω_mis a normal distribution having a dispersion σ_ω²centered about the virtual frequency ω_ideal.

A second incidental condition is a condition that a probability distribution of possible values for the sample time τ_mis a normal distribution having a dispersion σ_τ²centered about the virtual time τ_ideal.

A third incidental condition is a condition that possible values of each element value of the simultaneous observation spectrum E_fhave a probability distribution represented by a likelihood function p of the following Expression (21).

$\begin{matrix} p (X_{1}, \dots X_{M} ❘ ω_{m}, τ_{m}) \equiv \frac{1}{M σ^{2}} { \sum_{m = 1}^{M} \frac{A_{m}^{- 1} X_{m}}{r_{m}} }_{2}^{2} & (21) \end{matrix}$

In Expression (21), σ represents a dispersion of a spectrum in a process in which a sound source spectrum is observed using each microphone 11-m. In Expression (21), A_m⁻¹represents an inverse matrix of a spectrum expansion matrix A_m.

Expression (21) is a function having a maximum value when a sound source is set to white noise, and when the sampling frequency ω_mis all the same, the sample time τ_mis all the same, and the microphones 11-m are located at the same position. When the sound source is white noise and the value of Expression (21) becomes maximum, a value obtained by dividing an element value of the simultaneous observation spectrum in each frame g and each frequency bin f by an element value of the steering vector in each frame g and each frequency bin f coincides with the sound source spectrum. Specifically, a relationship of Expression (22) is established.

$\begin{matrix} \frac{χ_{1, f}}{r_{1, f}} = \dots = \frac{χ_{M, f}}{r_{M, f}} = s_{f} & (22) \end{matrix}$

An evaluation condition may be in a form of using a sum of L1 norms (absolute values) instead of a sum of norms (squares of absolute values) in Expression (21) as a third incidental condition. In addition, the evaluation condition may be in a form of defining a likelihood function using a cosine similarity degree of each term in Expression (22).

Here, the steering vector and the spectrum expansion matrix in the embodiment will be described with reference to FIG. 5.

FIG. 5 is an explanatory diagram which describes the steering vector and the spectrum expansion matrix in the embodiment.

In FIG. 5, sounds emitted from a sound source are collected by a (virtual) synchronous microphone group.

The (virtual) synchronous microphone group includes a plurality of virtual synchronous microphones 31-m. The virtual synchronous microphones 31-m in FIG. 5 are virtual microphones which include AD converters and convert collected sounds into digital signals. All of the virtual synchronous microphones 31-m include a common oscillator and have the same sampling frequency. The sampling frequency of all of the virtual synchronous microphones 31-m is ω_ideal. The virtual synchronous microphones 31-m are located differently in a space.

In FIG. 5, an asynchronous microphone group includes a plurality of asynchronous microphones 32-m. The asynchronous microphones 32-m include oscillators. The oscillators provided in the asynchronous microphones 32-m are independent from each other. For this reason, sampling frequencies of the asynchronous microphones 32-m are not necessarily the same. The sampling frequencies of the asynchronous microphones 32-m are ω_m. Positions of the asynchronous microphones 32-m are the same as those of the virtual synchronous microphones 31-m.

Sounds emitted from a sound source are modulated due to a transmission path until they reach each virtual synchronous microphone 31-m. The sounds collected by each virtual synchronous microphone 31-m are affected by a difference between the virtual synchronous microphones 31-m in a distance from the sound source to each virtual synchronous microphone 31-m, and differs from one another for each virtual synchronous microphone 31-m. The sounds collected by each virtual synchronous microphone 31-m are direct sounds and reflected sounds from walls or floors, and direct sounds and reflected sounds reaching each virtual synchronous microphone differ in accordance with a difference of a position of each microphone.

Such a difference in modulation due to the transmission path for each virtual synchronous microphone 31-m is represented by a steering vector. In FIG. 5, r₁, . . . , r_Mare element values of the steering vector, and represents modulation which the sounds emitted by the sound source are subjected to until they are collected by the virtual synchronous microphones 31-m due to the transmission path of the sounds.

The sampling frequencies of the asynchronous microphones 32-m are not necessarily the same as ω_ideal. For this reason, a frequency component of a digital signal by the virtual synchronous microphone 31-m and a frequency component of a digital signal by the asynchronous microphone 32-m are not necessarily the same. The spectrum expansion matrix represents a change in digital signal caused by such a difference in sampling frequency.

x_m,frepresents a spectrum intensity of the spectrum X_mat the frequency bin f.

Experimental Result

FIGS. 6 to 13 are simulation results which indicate a corresponding relationship between the virtual frequency and virtual time acquired by the acoustic signal processing unit 22 in the embodiment and actual sampling frequency and sample time. FIGS. 6 to 13 are first to eighth diagrams that show simulation results.

FIGS. 6 to 13 are experimental results of experiments using two microphones with an interval of 20 cm.

That is, the simulation results in FIGS. 6 to 13 are experimental results in a case in which M is 2. FIGS. 6 to 13 are experimental results in a case in which there is one sound source. FIGS. 6 to 13 are experimental results of experiments when the sound source is on a line connecting two microphones and the sound source is located at a distance of 1 m from a center of the line connecting two microphones. FIGS. 6 to 13 are experimental results of experiments in which the sampling frequency in a calculation of the steering vector is 16 kHz, the number of samples in the Fourier transform is 512, and the sound source is white noise.

In FIGS. 6 to 13, the horizontal axis represents a sampling frequency on, and the vertical axis represents a sampling frequency on.

FIGS. 6 to 13 show a combination of the sampling frequency on and the sampling frequency ω₂, which maximizes a posteriori probability acquired by the acoustic signal processing unit 22 when the sampling frequencies on and on are changed every 10 Hz between 15900 Hz to 16100 Hz.

In FIGS. 6 to 13, the sampling frequency on is a sampling frequency for a sound collected by a microphone close to the sound source. In FIGS. 6 to 13, the sampling frequency ω₂is a sampling frequency for a sound collected by a microphone far from the sound source. Note that the sample time τ_min the simulations indicating simulation results in FIGS. 6 to 13 is 0.

In FIG. 6, a marker A indicating the combination of the sampling frequencies on and on of a microphone in the simulation and a marker B indicating the combination of the sampling frequencies ω₁and ω₂, which maximizes a posteriori probability indicated by the simulation results, coincide with each other.

FIG. 6 shows that both of the sampling frequencies ω₁and ω₂which maximize the posteriori probability indicated by the simulation results are 16000 Hz when both of the sampling frequencies ω₁and ω₂of a microphone in the simulation are set to 16000 kHz.

In FIG. 7, the marker A indicating the combination of the sampling frequencies ω₁and ω₂of a microphone in the simulation and the marker B indicating the combination of the sampling frequencies ω₁and ω₂, which maximizes a posteriori probability indicated by the simulation results, coincide with each other.

FIG. 7 shows that both of the sampling frequencies ω₁and ω₂which maximize the posteriori probability indicated by the simulation results are 16020 Hz when both of the sampling frequencies ω₁and ω₂of a microphone in the simulation are set to 16020 kHz.

Hereinafter, values of the sampling frequencies ω₁and ω₂of a microphone in the simulation are referred to as true values.

FIGS. 6 and 7 shows that the values of the sampling frequencies ω₁and ω₂which maximize the posteriori probability coincide with the true values. For this reason, FIGS. 6 and 7 shows that the acoustic signal processing unit 22 can acquire a virtual frequency and a virtual time with high accuracy.

In FIG. 8, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are close to each other even though they do not coincide with each other.

The marker B of FIG. 8 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 16000 Hz and the true value of the sampling frequency ω₁is 15950 Hz.

In FIG. 9, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.

The marker B of FIG. 9 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 16000 Hz and the true value of the sampling frequency ω₁is 15980 Hz.

In FIG. 10, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.

The marker B of FIG. 10 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 16000 Hz and the true value of the sampling frequency ω₁is 16050 Hz.

In FIG. 11, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.

The marker B of FIG. 11 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 15990 Hz and the true value of the sampling frequency ω₁is 16010 Hz.

In FIG. 12, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.

The marker B of FIG. 12 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 15980 Hz and the true value of the sampling frequency ω₁is 16020 Hz.

In FIG. 13, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.

The marker B of FIG. 13 indicates the combination of the sampling frequencies ω₁and ω₂maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω₂is 15950 Hz and the true value of the sampling frequency ω₁is 16050 Hz.

In FIG. 8, the sampling frequency ω₁maximizing the posteriori probability is 15960 Hz, the sampling frequency ω₂maximizing the posteriori probability is 16010 Hz, the true value of the sampling frequency ω₁is 15950 Hz, and the true value of the sampling frequency ω₂is 16000 Hz. For this reason, in FIG. 8, a difference between the sampling frequency ω₁maximizing the posteriori probability and the sampling frequency ω₂maximizing the posteriori probability is equal to a difference between the true value of the sampling frequency ω₁and the true value of the sampling frequency ω₂.

This is because, even if the acoustic signal processing unit 22 does not acquire the sampling frequencies ω₁and ω₂that maximize the posteriori probability and equal to the true values, the result of FIG. 8 shows that the acoustic signal processing unit 22 acquires a reasonable combination of sampling frequencies to a certain extent.

Note that the posteriori probability is a product of a distribution of the sampling frequency ω_m, which is assumed in advance before simulation results are acquired, and a probability of the simulation results. The distribution of the sampling frequency ω_m, which is assumed in advance before simulation results are acquired, is, for example, a normal distribution. The probability of the simulation results is, for example, a likelihood function represented by Expression (21).

Modified Example

Note that the AD conversion unit 21-1 is not necessarily required to be included in the acoustic signal processing device 20, and may be included in the microphone array 10. In addition, the acoustic signal processing device 20 is not necessarily mounted on one case, and may be a device configured to be divided into a plurality of cases. In addition, the acoustic signal processing device 20 may be a device configured in a single case, or may be a device configured to be divided into a plurality of cases. When it is configured to be divided into a plurality of cases, a function of a part of the acoustic signal processing device 20 described above may be mounted at physically separated positions via a network. The acoustic signal output device 1 may also be a device configured in a single case or a device configured to be divided into a plurality of cases. When it is configured to be divided in a plurality of cases, the function of a part of the acoustic signal output device 1 may also be mounted on physically separated positions via the network.

Note that all or a part of respective functions of the acoustic signal output device 1, the acoustic signal processing device 20, and the sound source identification device 100 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). A program may be recorded in a computer-readable recording medium. The computer-readable recording medium is, for example, a portable disk such as a flexible disk a magneto-optical disc, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted via an electric telecommunication line.

As described above, the embodiment of the present invention has been described in detail with reference to the drawings, but a specific configuration is not limited to this embodiment, and design and the like within a range not departing from the gist of the present invention can be included.

Acoustic signal processing device, acoustic signal processing method, and program

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

Non-Patent Literature Citations (2)

Related Publications (1)