Acoustic signal processing device, acoustic signal processing method, and program

Information

  • Patent Grant
  • 10863271
  • Patent Number
    10,863,271
  • Date Filed
    Wednesday, August 28, 2019
    5 years ago
  • Date Issued
    Tuesday, December 8, 2020
    3 years ago
Abstract
An acoustic signal processing device includes an acoustic signal processing unit configured to calculate a spectrum of each acoustic signal and a steering vector having m elements on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), and to estimate a sampling frequency ωm in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ωideal that is a predetermined value.
Description
CROSS-REFERENCE TO RELATED APPLICATION

Priority is claimed on Japanese Patent Application No. 2018-165504, filed Sep. 4, 2018, the content of which is incorporated herein by reference.


BACKGROUND
Field of the Invention

The present invention relates to an acoustic signal processing device, an acoustic signal processing method, and a program.


Related Art

Conventionally, there is a technology of collecting sounds using a plurality of microphones and acquiring identification of a sound source on the basis of the collected sounds and information based on the collected sounds. In such a technology, the sounds collected by the microphones are converted into sampled electrical signals, signal processing is executed on the converted electrical signals, and thereby the information based on the collected sounds is acquired. In addition, signal processing in such a technology is processing in which the converted electrical signals are assumed to be electrical signals obtained by sampling sounds collected by microphones located at different positions at the same sampling frequency (for example, refer to Katsutoshi Itoyama, Kazuhiro Nakadai, “Synchronization between channels of a plurality of A/D converters based on probabilistic generation model,” Proceedings of the 2018 Spring Conference, Acoustical Society of Japan, 2018, pp. 505-508).


However, in practice, an AD converter provided for each microphone samples the converted electrical signals in synchronization with a clock generated by a vibrator provided for each AD converter. For this reason, there are cases in which sampling at the same sampling frequency is not necessarily performed depending on individual differences of the vibrators. In addition, in robots or the like which operate in extreme environments, external influences such as temperature or humidity are different for each vibrator. For this reason, in this case, not only the individual differences of each vibrator but also the external influences may cause a gap in a clock of each vibrator. To reduce such a gap, it has been proposed to use an oven-controlled crystal oscillator (OCXO), an oscillator with small individual difference such as an atomic clock, a large capacity capacitor, or the like. However, it is not realistic to actually mount it on a robot or the like and to operate it. Therefore, in such conventional technologies, accuracy of information based on sounds collected by a plurality of microphones may deteriorate in some cases.


SUMMARY OF THE INVENTION

Aspects of the present invention have been made in view of the above circumstances, and an object thereof is to provide an acoustic signal processing device, an acoustic signal processing method, and a computer program which can suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.


In order to solve the above problems, the present invention adopts the following aspects.


(1) An acoustic signal processing device according to one aspect of the present invention includes an acoustic signal processing unit configured to calculate a spectrum of each acoustic signal and a steering vector having m elements on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), and to estimate a sampling frequency ωm in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ωideal that is a predetermined value.


(2) In the aspect (1) described above, the steering vector may represent a difference between positions of the microphones having a transfer characteristic from a sound source of the sounds to each of the microphones.


(3) In the aspect (1) or (2) described above, a matrix representing a conversion of an analog signal from a spectrum of an ideal signal into a spectrum of a signal sampled at the sampling frequency ωm and a sample time τm is set to a spectrum expansion matrix, and the acoustic signal processing unit may estimate the sampling frequency ωm on the basis of the steering vector, the spectrum expansion matrix, and a spectrum Xm.


(4) An acoustic signal processing method according to another aspect of the present invention includes a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals, and an estimation step of estimating a sampling frequency ωm in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ωideal that is a predetermined value.


(5) A computer readable non-transitory storage medium according to still another aspect of the present invention stores a program causing a computer of an acoustic signal processing device to execute a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more), a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals, and an estimation step of estimating a sampling frequency ωm in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ωideal that is a predetermined value.


According to the aspects (1), (4), and (5), it is possible to synchronize a plurality of acoustic signals having different sampling frequencies. For this reason, according to the aspects (1), (4), and (5), it is possible to suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.


According to the aspect (2) described above, it is possible to include a distance difference of a microphone from a sound source, a direct sound, and a reflected sound.


According to the aspect (3) described above, it is possible to correct a gap between sampling frequencies ωm and ωideal.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a diagram which shows an example of a configuration of an acoustic signal output device 1 of an embodiment.



FIG. 2 is a diagram which shows an example of a functional configuration of an acoustic signal processing device 20 in the embodiment.



FIG. 3 is a flowchart which shows an example of a flow of processing executed by the acoustic signal output device 1 of the embodiment.



FIG. 4 is a diagram which shows an application example of the acoustic signal output device 1 of the embodiment.



FIG. 5 is an explanatory diagram which describes a steering vector and a spectrum expansion matrix in the embodiment.



FIG. 6 is a first diagram which shows simulation results.



FIG. 7 is a second diagram which shows simulation results.



FIG. 8 is a third diagram which shows simulation results.



FIG. 9 is a fourth diagram which shows simulation results.



FIG. 10 is a fifth diagram which shows simulation results.



FIG. 11 is a sixth diagram which shows simulation results.



FIG. 12 is a seventh diagram which shows simulation results.



FIG. 13 is an eighth diagram which shows simulation results.





DETAILED DESCRIPTION OF THE INVENTION


FIG. 1 is a diagram which shows an example of a configuration of an acoustic signal output device 1 of an embodiment. The acoustic signal output device 1 includes a microphone array 10 and an acoustic signal processing device 20. The microphone array 10 includes microphones 11-m (m is an integer of 1 or more and M or less; M is an integer of 2 or more). The microphones 11-m are located at different positions. The microphones 11-m collect a sound Z1m which has arrived at the microphones 11-m. The sound Z1m arriving at the microphones 11-m includes, for example, a direct sound that is emitted by a sound source and an indirect sound that arrives after being reflected, absorbed, or scattered by a wall or the like. For this reason, a frequency spectrum of a sound source and a frequency spectrum of a sound collected by the microphones 11-m are not necessarily the same.


The microphones 11-m convert the collected sound Z1m into an acoustic signal such as an electrical signal or an optical signal. The converted electrical signal or optical signal is an analog signal Z2m which represents a relationship between a magnitude of the collected sound and a time at which the sound is collected. That is, the analog signal Z2m represents a waveform in a time domain of the collected sound.


The microphone array 10 which includes m microphones 11-m outputs acoustic signals of M channels to the acoustic signal processing device 20.


The acoustic signal processing device 20 includes, for example, a central processing unit (CPU), a memory, an auxiliary storage device, and the like connected by a bus, and executes a program. The acoustic signal processing device 20 functions as a device including, for example, an analog to digital (AD) converter 21-1, an AD converter 21-2, . . . , an AD converter 21-M, an acoustic signal processing unit 22, and an ideal signal conversion unit 23 according to execution of a program. The acoustic signal processing device 20 acquires the acoustic signals of M channels from the microphone array 10, estimates the sampling frequency ωm when an acoustic signal collected by the microphones 11-m is converted into a digital signal, and calculates an acoustic signal resampled at a virtual sampling frequency ωideal using an estimated sampling frequency ωm.


The AD converter 21-m is included in each of the microphones 11-m and acquires the analog signal Z2m output by the microphones 11-m. The AD converter 21-m samples the acquired analog signal Z2m at the sampling frequency ωm in the time domain. Hereinafter, a signal representing a waveform after execution of the sampling is referred to as a time domain digital signal Yallm. Hereinafter, a signal in one frame, which is part of the time domain digital signal Yallm, is referred to as a single frame time domain digital signal Ym to simplify the description. Hereinafter, a gth frame arranged in a time order is referred to as a frame g. Hereinafter, it is assumed that the frame is a frame g to simplify the description.


The single frame time domain digital signal Ym is represented by the following expression (1).

Ym=(ym,0,ym,1, . . . ,ym,L-1)T  (1)


ym, ξ is a (ξ+1)th element of the single frame time domain digital signal Ym. ξ is an integer of 0 or more and (L−1) or less. The element ym, ξ is a magnitude of sound represented by the single frame time domain digital signal Ym, and is a magnitude of sound at a ξth time after the execution of the sampling, which is a time in one frame. Note that T in Expression (1) represents a transposition of a vector. Hereinafter, T in an expression like Expression (1) represents a transposition of a vector. Note that, L is a signal length of the single frame time domain digital signal Ym.


The AD converter 21-m (analog to digital converter) includes a vibrator 211-m. The AD converter 21-m operates in synchronization with a sampling frequency generated by the vibrator 211-m.


The acoustic signal processing unit 22 acquires a sampling frequency ωm and a sample time τm. The acoustic signal processing unit 22 converts a time domain digital signal Yallm into an ideal signal to be described below on the basis of the acquired sampling frequency ωm and sample time τm.


Note that the sample time τm is a start time for the AD converter 21-m to start sampling of the analog signal Z2m. The sample time τm is a time difference which represents a gap between an initial phase of sampling by the AD converter 21-m and a phase serving as a predetermined reference.


Here, a sampling frequency generated by a vibrator will be described.


Since there are individual differences in respective vibrators 211-m and environmental influences such as heat or humidity for respective vibrators 211-m are not necessarily the same, sampling frequencies generated by respective vibrators 211-m are not necessarily the same in all of the vibrators 211-m. For this reason, all of the sampling frequencies ωm are not necessarily the same sampling frequency ωideal.


Hereinafter, a virtual sampling frequency of the vibrator 211-m is referred to as the virtual frequency ωideal. Note that a variation in sampling frequency generated by each of M vibrators 211-m is near a variation in reference transmission frequency of the vibrators 211-m, and a nominal frequency is, for example, about ×10−6±20% with respect to 16 kHz.


In addition, since the sampling frequencies generated by the vibrators 211-m are not necessarily the same in all of the vibrators 211-m, not all sample times τm are necessarily the same time.


Hereinafter, a sample time in a case in which there are no individual differences between the vibrators 211-m and no environmental influences such as heat or humidity with respect to the vibrators 211-m is referred to as a virtual time τideal.


In this manner, respective sampling frequencies ωm are not necessarily the same, and respective sample times τm are not necessarily the same. In addition, the microphones 11-m are not positioned at the same position. For this reason, each single frame time domain digital signal Ym is not necessarily the same as an ideal signal. An ideal signal is a signal obtained by sampling the analog signal Z2m at the virtual frequency ωideal and the virtual time τideal.



FIG. 2 is a diagram which shows an example of a functional configuration of the acoustic signal processing unit 22 in the embodiment.


The acoustic signal processing unit 22 includes a storage unit 220, a spectrum calculation processing unit 221, a steering vector generation unit 222, a spectrum expansion matrix generation unit 223, an evaluation unit 224, and a resampling unit 225.


The storage unit 220 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 220 stores the virtual frequency ωideal, the virtual time τideal, the trial frequency ωm, and the trial time Tm. The virtual frequency ωideal and the virtual time τideal are known values stored in the storage unit 220 in advance. The trial frequency Wm is a value that is updated according to an evaluation result of the evaluation unit 224 to be described below and is a value of a physical quantity having the same dimension as the sampling frequency ωm. The trial frequency Wm is a predetermined initial value until it is updated according to the evaluation result of the evaluation unit 224. The trial time Tm is a value that is updated according to the evaluation result of the evaluation unit 224 to be described below and is a value of a physical quantity having the same dimension as the sample time τm. The trial time Tm is a predetermined initial value until it is updated according to the evaluation result of the evaluation unit 224.


Note that, as an example, when the virtual frequency ωideal is 16000 Hz, a trial frequency W1 is 15950 Hz, a trial time τ1 is 0 msec, a trial frequency W2 is 15980 Hz, a trial time τ2 is 0 msec, a trial frequency W3 is 16020 Hz, a trial time τ3 is 0 msec, a trial frequency W4 is 16050 Hz, a trial time τ4 is 0 msec, and the like.


Note that the acoustic signal processing unit 22 performs processing on an acquired acoustic signal, for example, for every length L.


The spectrum calculation processing unit 221 acquires an acoustic signal output by the AD converter 21 and calculates a spectrum by performing a Fourier transform on the acquired acoustic signal. The spectrum calculation processing unit 221 acquires a spectrum of a waveform represented by a single frame time domain digital signal Ym for all frames.


The spectrum calculation processing unit 221 acquires, for example, first, a time domain digital signal Yallm for each frame. Next, the spectrum calculation processing unit 221 acquires a spectrum Xm of the single frame time domain digital signal Ym in the frame g by performing a discrete Fourier transform on the single frame time domain digital signal Ym for each frame g.


Since the spectrum Xm is a Fourier component of the digital signal Ym, the following expression (2) is established between the spectrum Xm and the digital signal Ym.

Xm=DYm  (2)


In Expression (2), D is a matrix of L rows and L columns. An element D_<jx,jy> (jx and jy are integers of 1 or more and L of less) at row jx and column jy of the matrix D is represented by the following expression (3). Hereinafter, D is referred to as a discrete Fourier transform matrix.


Xm is a vector having L elements. In Expression (3), i represents an imaginary unit.


Note that an underscore represents that a letter or number to the right of the underscore is a subscript of a letter or number to the left of the underscore. For example, j_x represents jx.


Note that < . . . > to the left of an underscore represents that the letters or numbers in < . . . > are a subscript of a letter or number to the right of the underscore. For example, y_<n,ξ> represents yn,ξ.










D


j
x

,

j
y



=


1

L




e

-


2

π






i


(


j
x

-
1

)




(


j
y

-
1

)


L








(
3
)







The steering vector generation unit 222 generates a steering vector for each microphone 11-m on the basis of the spectrum Xm. A steering vector is a vector having a transfer function from a microphone to a sound source as an element. The steering vector generation unit 222 may also generate a steering vector in a known manner.


The steering vector represents a difference between positions of the microphones 11-m having a transfer characteristic from the sound source to each of the microphones 11-m. The positions of the microphones 11-m are positions at which the microphones 11-m collect sounds.


The spectrum expansion matrix generation unit 223 acquires the trial frequency Wm and the trial time Tm stored in the storage unit 220, and generates a spectrum expansion matrix on the basis of the acquired trial frequency Wm and trial time Tm. A spectrum expansion matrix is a matrix representing a conversion from a frequency spectrum of an ideal signal into a frequency spectrum of a signal obtained by sampling the analog signal Z2m at the sampling frequency Wm and the sampling time Tm.


The evaluation unit 224 determines whether the trial frequency Wm and the trial time Tm satisfy a predetermined condition (hereinafter referred to as an “evaluation condition”) on the basis of the steering vector, the spectrum expansion matrix, and the spectrum Xm.


Note that an evaluation condition is a condition based on the steering vector, the spectrum expansion matrix, and the spectrum Xm. The evaluation condition is, for example, a condition of satisfying Expression (21) described below.


The evaluation condition may be any other condition as long as all values obtained by multiplying the spectrum Xm by an inverse matrix of the spectrum expansion matrix, and dividing each element value of a vector of a result of the multiplication by an element value of the steering vector are values within a predetermined range.


The evaluation unit 224 determines the trial frequency Wm as the sampling frequency ωm and determines the trial time Tm as the sample time τm when the trial frequency Wm and the trial time Tm satisfy the evaluation condition.


The evaluation unit 224 updates the trial frequency Wm and the trial time Tm using, for example, a Metropolis algorithm when the trial frequency Wm and the trial time Tm do not satisfy the evaluation condition. A method of updating, by the evaluation unit 224, the trial frequency Wm and the trial time Tm is not limited thereto, and any algorithm such as a Monte Carlo method and the like may be used.


The resampling unit 225 converts the time domain digital signal Yallm into an ideal signal on the basis of the sampling frequency ωm and sample time τm determined by the evaluation unit 224.



FIG. 3 is a flowchart which shows an example of a flow of processing executed by the acoustic signal output device 1 of the embodiment.


Each microphone 11-m collects a sound and converts the collected sound into an electrical signal or an optical signal (step S101).


The AD converter 21-m samples a time domain digital signal Yallm which is the electrical signal or optical signal converted in step S101 using the frequency ωm in the time domain (step S102).


The spectrum calculation processing unit 221 calculates a spectrum (step S103).


The steering vector generation unit 222 generates a steering vector for each microphone 11-m on the basis of the spectrum Xm (step S104).


The spectrum expansion matrix generation unit 223 acquires a trial frequency Wm and a trial time Tm stored in the storage unit 220, and generates a spectrum expansion matrix on the basis of the acquired trial frequency Wm and trial time Tm (step S105).


The evaluation unit 224 determines whether the trial frequency Wm and the trial time Tm satisfy the evaluation condition on the basis of the steering vector, the spectrum expansion matrix, and the spectrum Xm (step S106).


When the trial frequency Wm and the trial time Tm satisfy the evaluation condition (YES in step S106), the evaluation unit 224 determines the trial frequency Wm as the sampling frequency ωm, and determines the trial time Tm as the sample time τm. Next, the resampling unit 225 converts the time domain digital signal Yallm into an ideal signal on the basis of the sampling frequency ωm and the sample time τm determined by the evaluation unit 224.


On the other hand, when the trial frequency Wm and the trial time Tm do not satisfy the evaluation condition (No in step S106), values of the trial frequency Wm and the trial time Tm are updated.


Note that, in the processing from step S105 to S106, a spectrum expansion matrix is generated on the basis of the trial frequency Wm and the trial time Tm, and other processing may be used as long as it is based on an optimization algorithm for determining the sampling frequency ωm and the sample time τm which satisfy the evaluation condition on the basis of the spectrum expansion matrix and the steering vector.


The optimization algorithm may also be another algorithm. The optimization algorithm may be, for example, a gradient descent method. In addition, the optimization algorithm may be, for example, a Metropolis algorithm. A Metropolis algorithm is one simulation method and is a kind of Monte Carlo method.


The acoustic signal output device 1 configured in this manner estimates the sampling frequency ωm and the sample time τm on the basis of the spectrum expansion matrix and the steering vector and converts the time domain digital signal Yallm into an ideal signal on the basis of the estimated sampling frequency ωm and sample time τm. For this reason, the acoustic signal output device 1 configured in this manner can suppress deterioration in accuracy of information based on sounds collected by a plurality of microphones.


Application Example


FIG. 4 is a diagram which shows an application example of the acoustic signal output device 1 of the embodiment. FIG. 4 shows a sound source identification device 100 which is an application example of the acoustic signal output device 1.


The sound source identification device 100 includes, for example, a CPU, a memory, an auxiliary storage device, and the like connected by a bus and executes a program. The sound source identification device 100 functions as a device including the acoustic signal output device 1, an ideal signal acquisition unit 101, a sound source localization unit 102, a sound source separation unit 103, a speech zone detection unit 104, a feature amount extraction unit 105, an acoustic model storage unit 106, and a sound source identification unit 107 by executing the program.


Hereinafter, components having the same function as in FIG. 1 will be denoted by the same reference numerals and description thereof will be omitted.


Hereinafter, it is assumed that there are a plurality of sound sources to simplify the description.


The ideal signal acquisition unit 101 acquires ideal signals of M channels which are converted by the acoustic signal processing unit 22 and outputs the acquired ideal signals of the M channels to the sound source localization unit 102 and the sound source separation unit 103.


The sound source localization unit 102 determines a direction in which the sound sources are located (sound source localization) on the basis of the ideal signals of the M channels output by the ideal signal acquisition unit 101. The sound source localization unit 102 determines, for example, a direction in which each sound source is located for each frame of a predetermined length (for example, 20 ms). The sound source localization unit 102 calculates, for example, a spatial spectrum indicating power in each direction using a multiple signal classification (MUSIC) method in sound source localization. The sound source localization unit 102 determines a sound source direction for each sound source on the basis of the spatial spectrum. The sound source localization unit 102 outputs sound source direction information indicating a sound source direction to the sound source separation unit 103 and the speech zone detection unit 104.


The sound source separation unit 103 acquires the sound source direction information output by the sound source localization unit 102 and the ideal signals of the M channels output by the ideal signal acquisition unit 101. The sound source separation unit 103 separates the ideal signals of the M channels into ideal signals for each sound source which are signals indicating components of each sound source on the basis of a sound source direction indicated by the sound source direction information. The sound source separation unit 103 uses, for example, a geometric-constrained high-order decorrelation-based source separation (GHDSS) method at the time of separating the ideal signals into ideal signals for each sound source. The sound source separation unit 103 calculates spectrums of the separated ideal signals and outputs them to the speech zone detection unit 104.


The speech zone detection unit 104 acquires the sound source direction information output by the sound source localization unit 102 and the spectrums of ideal signals output by the sound source localization unit 102. The speech zone detection unit 104 detects a speech zone for each sound source on the basis of the acquired spectrums of separated acoustic signals and the acquired sound source direction information. For example, the speech zone detection unit 104 performs sound source detection and speech zone detection at the same time by performing threshold processing on an integrated spatial spectrum obtained by integrating spatial spectrums obtained for each frequency using the MUSIC method in the frequency direction. The speech zone detection unit 104 outputs a result of the detection, the direction information, and the spectrums of acoustic signals to the feature amount extraction unit 105.


The feature amount extraction unit 105 calculates an acoustic feature amount for acoustic recognition from the separated spectrums output by the speech zone detection unit 104 for each sound source. The feature amount extraction unit 105 calculates an acoustic feature amount by calculating, for example, a static mel-scale log spectrum (MSLS), a delta MSLS and one delta power for each predetermined time (for example, 10 ms). Note that the MSLS is obtained by performing an inverse discrete cosine conversion on a mel-frequency cepstrum coefficient (MFCC) using a spectrum feature amount as a feature amount of acoustic recognition. The feature amount extraction unit 105 outputs the obtained acoustic feature amount to the sound source identification unit 107.


The acoustic model storage unit 106 stores a sound source model. The sound source model is a model used by the sound source identification unit 107 to identify collected acoustic signals. The acoustic model storage unit 106 sets an acoustic feature amount of the acoustic signals to be identified as a sound source model and stores it in association with information indicating a sound source name for each sound source.


The sound source identification unit 107 identifies a sound source with reference to an acoustic model stored by the acoustic model storage unit 106, which indicates an acoustic feature amount output by the feature amount extraction unit 105.


Since the sound source identification device 100 configured in this manner includes the acoustic signal output device 1, it is possible to suppress an increase in errors of the identification of a sound source that are errors caused by the fact that all of the microphones 11-m are not located at the same position.


<Description of Spectrum Expansion Matrix and Steering Vector Using Mathematical Expression>


Hereinafter, a spectrum expansion matrix and a steering vector will be described according to a mathematical expression.


First, the spectrum expansion matrix will be described.


The spectrum expansion matrix is, for example, a function satisfying the following expression (4).

Xn=AnXideal  (4)


In Expression (4), An represents the spectrum expansion matrix. The spectrum expansion matrix An in Expression (4) represents conversion from a spectrum Xideal of an ideal signal to a spectrum Xn of a time domain digital signal Yalln. Note that n is an integer of 1 or more and M or less.


Since the spectrum Xn and the spectrum Xideal of an ideal signal are vectors, An is a matrix.


An satisfies a relationship of Expression (5).

An=DBnD−1  (5)


Expression (5) shows that An is a value obtained by applying a discrete Fourier transform matrix D from a left side and an inverse matrix of the discrete Fourier transform matrix D from a right side to a resampling matrix Bn.


The resampling matrix Bn is a matrix which converts a single frame time domain digital signal Yideal into a single frame time domain digital signal Yn. To represent this using a mathematical expression, the resampling matrix Bn is a matrix satisfying a relationship of the following Expression (6). The single frame time domain digital signal Yideal is a signal of the frame g of the ideal signal.

Yn=BnY0  (6)


The value at row θ and column φ of the resampling matrix Bn is set to Bn,θ,φ (θ and φ are integers of 1 or more), and Bn,θ,φ satisfies a relationship of the following Expression (7).










b

n
,
θ
,
ϕ


=

sin






c


(


π

ω
μ




(


θ
·

ω
n


+

(


τ
n

-

τ
ideal


)

-

ϕ
·

ω
ideal



)


)







(
7
)







In Expression (7), ωn represents a sampling frequency in a channel n. The channel n is an nth channel among a plurality of channels. In Expression (7), τn represents a sample time in the channel n.


The function sin c ( . . . ) appearing on the right side of Expression (7) is a function defined by the following Expression (8). In Expression (8), t is an arbitrary number.










sin






c


(
t
)



=


sin


(
t
)


t





(
8
)







The relationships represented by Expression (6) to Expression (8) are expressions known to be established between the single frame time domain digital signal Yn and the single frame time domain digital signal Yideal.


Next, the steering vector will be described.


To simply the following description, a steering vector in a frequency bin f will be described. The steering vector in the frequency bin f is a function Rf satisfying the following Expression (9). The steering vector Rf in the frequency bin f is a vector having M elements.

1,f, . . . ,χM,f)T=Rfsf  (9)


In Expression (9), sf represents a spectrum intensity of a sound source in the frequency bin f. In Expression (9), χm,f is a spectrum intensity in the frequency bin f of a frequency spectrum of the analog signal Z2m sampled at the virtual frequency ωideal.


Hereinafter, vectors (χ1,f, . . . , χM,f) on the left side in Expression (9) are referred to as a simultaneous observation spectrum Ef at the frequency bin f.


Here, a vector Eall in which the simultaneous observation spectrum Ef throughout the frequency bin f is integrated is defined. Hereinafter, Eall is referred to as an entire simultaneous observation spectrum. The entire simultaneous observation spectrum Eall is a direct product of Ef in all frequency bins f. Specifically, the entire simultaneous observation spectrum Eall is represented by Expression (10).


Hereinafter, it is assumed that f is an integer of 0 or more and (F−1) or less, and the total number of frequency bins is F to simplify the description.










E
all



(




χ

1
,
0












χ

M
,
0












χ

1
,

F
-
1













χ

M
,

F
-
1






)





(
10
)







The entire simultaneous observation spectrum Eall satisfies relationships of the following Expression (11) and Expression (12).











E
all



(




χ

1
,
0












χ

M
,
0












χ

1
,

F
-
1













χ

M
,

F
-
1






)


=


(





r

1
,
0




s
0













r

M
,
0




s
0













r

1
,

F
-
1





s

F
-
1














r

M
,

F
-
1





s

F
-
1






)

=


(




R
0









































R

F
-
1





)


S






(
11
)






S



(


s
0

,





,

s

F
-
1



)

T





(
12
)







Hereinafter, S defined in Expression (12) is referred to as a sound source spectrum. In Expression (11), rm,f is an mth element value of the steering vector Rf.


Alternatively, a relationship of the following Expression (14) is established for a modified simultaneous observation spectrum Hm defined by Expression (13) in which an order of subscripts of x is switched according to Expression (11).










H
m

=


(


χ

m
,
0


,





,

χ

m
,

F
-
1




)

T





(
13
)







(




H
1











H
M




)

=


(




r

1
,
0










































r

1
,

F
-
1























r

M
,
0










































r

M
,

F
-
1






)


S





(
14
)







Here, if a permutation matrix P of (M×F) rows and (M×F) columns having an element value p_<kx,ky> is used, Expression (14) is modified to the following Expression (15). Note that kx and ky are integers of 1 or more and (M×F) or less.










(




H
1











H
M




)

=


P


(




R
0









































R

F
-
1





)



S





(
15
)







The element p_<kx,ky> at row kx and column ky of P is 1 when kx and ky satisfying the following Expression (16) and Expression (17) are present, and is 0 when they are not present.

kx=f×M+(m−1)+1  (16)
ky=f+(m−1)×M+1  (17)


The permutation matrix P is, for example, the following Expression (18) when M is 2 and F is 3.









P
=

(



1


0


0


0


0


0




0


0


1


0


0


0




0


0


0


0


1


0




0


1


0


0


0


0




0


0


0


1


0


0




0


0


0


0


0


1



)





(
18
)







P is a unitary matrix. In addition, a determinant of P is +1 or −1.


Here, a relationship between a sound source spectrum s and the spectrum Xm will be described.


Hereinafter, the relationship between the sound source spectrum s and the spectrum Xm in a spectrum expansion model will be described.


In the spectrum expansion and contraction model, a situation in which each microphone 11-m performs sampling at a different sampling frequency is considered. In the spectrum expansion and contraction model, it is assumed that conversion of a sampling frequency is performed independently using each microphone 11-m, and thus it does not affect a transmission system. Note that a spatial correlation matrix in this situation is a spatial correlation matrix when each microphone 11-m performs synchronous sampling at the virtual frequency ωideal.


A relationship of the following Expression (19) is established between the modified simultaneous observation spectrum Hm and the spectrum Xm according to Expression (4).










(




X
1











X
M




)

=


(




A
1









































A
M




)



(




H
1











H
M




)






(
19
)







If Expression (15) is substituted into Expression (19), Expression (20) which represents a relationship between the sound source spectrum s and the spectrum Xm is derived.










(




X
1











X
M




)

=


(




A
1









































A
M




)



P


(




R
1









































R

F
-
1





)



S





(
20
)







<Description of Evaluation Condition Using Mathematical Expression>


An example of evaluation condition will be described using a mathematical expression.


The evaluation condition may be, for example, a condition that all of differences between values obtained by dividing the element χm,f of the simultaneous observation Ef by the element value rm,f of the steering vector Rf are within a predetermined range when the following three incidental conditions are satisfied.


A first incidental condition is, for example, a condition that a probability distribution of possible values for the sampling frequency ωm is a normal distribution having a dispersion σω2 centered about the virtual frequency ωideal.


A second incidental condition is a condition that a probability distribution of possible values for the sample time τm is a normal distribution having a dispersion στ2 centered about the virtual time τideal.


A third incidental condition is a condition that possible values of each element value of the simultaneous observation spectrum Ef have a probability distribution represented by a likelihood function p of the following Expression (21).










p


(


X
1

,









X
M




ω
m


,

τ
m


)





1

M






σ
2











m
=
1

M





A
m

-
1




X
m



r
m





2
2






(
21
)







In Expression (21), σ represents a dispersion of a spectrum in a process in which a sound source spectrum is observed using each microphone 11-m. In Expression (21), Am−1 represents an inverse matrix of a spectrum expansion matrix Am.


Expression (21) is a function having a maximum value when a sound source is set to white noise, and when the sampling frequency ωm is all the same, the sample time τm is all the same, and the microphones 11-m are located at the same position. When the sound source is white noise and the value of Expression (21) becomes maximum, a value obtained by dividing an element value of the simultaneous observation spectrum in each frame g and each frequency bin f by an element value of the steering vector in each frame g and each frequency bin f coincides with the sound source spectrum. Specifically, a relationship of Expression (22) is established.











χ

1
,
f



r

1
,
f



=


=



χ

M
,
f



r

M
,
f



=

s
f







(
22
)







An evaluation condition may be in a form of using a sum of L1 norms (absolute values) instead of a sum of norms (squares of absolute values) in Expression (21) as a third incidental condition. In addition, the evaluation condition may be in a form of defining a likelihood function using a cosine similarity degree of each term in Expression (22).


Here, the steering vector and the spectrum expansion matrix in the embodiment will be described with reference to FIG. 5.



FIG. 5 is an explanatory diagram which describes the steering vector and the spectrum expansion matrix in the embodiment.


In FIG. 5, sounds emitted from a sound source are collected by a (virtual) synchronous microphone group.


The (virtual) synchronous microphone group includes a plurality of virtual synchronous microphones 31-m. The virtual synchronous microphones 31-m in FIG. 5 are virtual microphones which include AD converters and convert collected sounds into digital signals. All of the virtual synchronous microphones 31-m include a common oscillator and have the same sampling frequency. The sampling frequency of all of the virtual synchronous microphones 31-m is ωideal. The virtual synchronous microphones 31-m are located differently in a space.


In FIG. 5, an asynchronous microphone group includes a plurality of asynchronous microphones 32-m. The asynchronous microphones 32-m include oscillators. The oscillators provided in the asynchronous microphones 32-m are independent from each other. For this reason, sampling frequencies of the asynchronous microphones 32-m are not necessarily the same. The sampling frequencies of the asynchronous microphones 32-m are ωm. Positions of the asynchronous microphones 32-m are the same as those of the virtual synchronous microphones 31-m.


Sounds emitted from a sound source are modulated due to a transmission path until they reach each virtual synchronous microphone 31-m. The sounds collected by each virtual synchronous microphone 31-m are affected by a difference between the virtual synchronous microphones 31-m in a distance from the sound source to each virtual synchronous microphone 31-m, and differs from one another for each virtual synchronous microphone 31-m. The sounds collected by each virtual synchronous microphone 31-m are direct sounds and reflected sounds from walls or floors, and direct sounds and reflected sounds reaching each virtual synchronous microphone differ in accordance with a difference of a position of each microphone.


Such a difference in modulation due to the transmission path for each virtual synchronous microphone 31-m is represented by a steering vector. In FIG. 5, r1, . . . , rM are element values of the steering vector, and represents modulation which the sounds emitted by the sound source are subjected to until they are collected by the virtual synchronous microphones 31-m due to the transmission path of the sounds.


The sampling frequencies of the asynchronous microphones 32-m are not necessarily the same as ωideal. For this reason, a frequency component of a digital signal by the virtual synchronous microphone 31-m and a frequency component of a digital signal by the asynchronous microphone 32-m are not necessarily the same. The spectrum expansion matrix represents a change in digital signal caused by such a difference in sampling frequency.


xm,f represents a spectrum intensity of the spectrum Xm at the frequency bin f.


Experimental Result


FIGS. 6 to 13 are simulation results which indicate a corresponding relationship between the virtual frequency and virtual time acquired by the acoustic signal processing unit 22 in the embodiment and actual sampling frequency and sample time. FIGS. 6 to 13 are first to eighth diagrams that show simulation results.



FIGS. 6 to 13 are experimental results of experiments using two microphones with an interval of 20 cm.


That is, the simulation results in FIGS. 6 to 13 are experimental results in a case in which M is 2. FIGS. 6 to 13 are experimental results in a case in which there is one sound source. FIGS. 6 to 13 are experimental results of experiments when the sound source is on a line connecting two microphones and the sound source is located at a distance of 1 m from a center of the line connecting two microphones. FIGS. 6 to 13 are experimental results of experiments in which the sampling frequency in a calculation of the steering vector is 16 kHz, the number of samples in the Fourier transform is 512, and the sound source is white noise.


In FIGS. 6 to 13, the horizontal axis represents a sampling frequency on, and the vertical axis represents a sampling frequency on.



FIGS. 6 to 13 show a combination of the sampling frequency on and the sampling frequency ω2, which maximizes a posteriori probability acquired by the acoustic signal processing unit 22 when the sampling frequencies on and on are changed every 10 Hz between 15900 Hz to 16100 Hz.


In FIGS. 6 to 13, the sampling frequency on is a sampling frequency for a sound collected by a microphone close to the sound source. In FIGS. 6 to 13, the sampling frequency ω2 is a sampling frequency for a sound collected by a microphone far from the sound source. Note that the sample time τm in the simulations indicating simulation results in FIGS. 6 to 13 is 0.


In FIG. 6, a marker A indicating the combination of the sampling frequencies on and on of a microphone in the simulation and a marker B indicating the combination of the sampling frequencies ω1 and ω2, which maximizes a posteriori probability indicated by the simulation results, coincide with each other.



FIG. 6 shows that both of the sampling frequencies ω1 and ω2 which maximize the posteriori probability indicated by the simulation results are 16000 Hz when both of the sampling frequencies ω1 and ω2 of a microphone in the simulation are set to 16000 kHz.


In FIG. 7, the marker A indicating the combination of the sampling frequencies ω1 and ω2 of a microphone in the simulation and the marker B indicating the combination of the sampling frequencies ω1 and ω2, which maximizes a posteriori probability indicated by the simulation results, coincide with each other.



FIG. 7 shows that both of the sampling frequencies ω1 and ω2 which maximize the posteriori probability indicated by the simulation results are 16020 Hz when both of the sampling frequencies ω1 and ω2 of a microphone in the simulation are set to 16020 kHz.


Hereinafter, values of the sampling frequencies ω1 and ω2 of a microphone in the simulation are referred to as true values.



FIGS. 6 and 7 shows that the values of the sampling frequencies ω1 and ω2 which maximize the posteriori probability coincide with the true values. For this reason, FIGS. 6 and 7 shows that the acoustic signal processing unit 22 can acquire a virtual frequency and a virtual time with high accuracy.


In FIG. 8, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are close to each other even though they do not coincide with each other.


The marker B of FIG. 8 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 16000 Hz and the true value of the sampling frequency ω1 is 15950 Hz.


In FIG. 9, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.


The marker B of FIG. 9 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 16000 Hz and the true value of the sampling frequency ω1 is 15980 Hz.


In FIG. 10, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.


The marker B of FIG. 10 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 16000 Hz and the true value of the sampling frequency ω1 is 16050 Hz.


In FIG. 11, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.


The marker B of FIG. 11 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 15990 Hz and the true value of the sampling frequency ω1 is 16010 Hz.


In FIG. 12, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.


The marker B of FIG. 12 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 15980 Hz and the true value of the sampling frequency ω1 is 16020 Hz.


In FIG. 13, the marker A indicating the true value and the marker B indicating the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results are less likely to coincide with each other.


The marker B of FIG. 13 indicates the combination of the sampling frequencies ω1 and ω2 maximizing a posteriori probability indicated by the simulation results when the true value of the sampling frequency ω2 is 15950 Hz and the true value of the sampling frequency ω1 is 16050 Hz.


In FIG. 8, the sampling frequency ω1 maximizing the posteriori probability is 15960 Hz, the sampling frequency ω2 maximizing the posteriori probability is 16010 Hz, the true value of the sampling frequency ω1 is 15950 Hz, and the true value of the sampling frequency ω2 is 16000 Hz. For this reason, in FIG. 8, a difference between the sampling frequency ω1 maximizing the posteriori probability and the sampling frequency ω2 maximizing the posteriori probability is equal to a difference between the true value of the sampling frequency ω1 and the true value of the sampling frequency ω2.


This is because, even if the acoustic signal processing unit 22 does not acquire the sampling frequencies ω1 and ω2 that maximize the posteriori probability and equal to the true values, the result of FIG. 8 shows that the acoustic signal processing unit 22 acquires a reasonable combination of sampling frequencies to a certain extent.


Note that the posteriori probability is a product of a distribution of the sampling frequency ωm, which is assumed in advance before simulation results are acquired, and a probability of the simulation results. The distribution of the sampling frequency ωm, which is assumed in advance before simulation results are acquired, is, for example, a normal distribution. The probability of the simulation results is, for example, a likelihood function represented by Expression (21).


Modified Example

Note that the AD conversion unit 21-1 is not necessarily required to be included in the acoustic signal processing device 20, and may be included in the microphone array 10. In addition, the acoustic signal processing device 20 is not necessarily mounted on one case, and may be a device configured to be divided into a plurality of cases. In addition, the acoustic signal processing device 20 may be a device configured in a single case, or may be a device configured to be divided into a plurality of cases. When it is configured to be divided into a plurality of cases, a function of a part of the acoustic signal processing device 20 described above may be mounted at physically separated positions via a network. The acoustic signal output device 1 may also be a device configured in a single case or a device configured to be divided into a plurality of cases. When it is configured to be divided in a plurality of cases, the function of a part of the acoustic signal output device 1 may also be mounted on physically separated positions via the network.


Note that all or a part of respective functions of the acoustic signal output device 1, the acoustic signal processing device 20, and the sound source identification device 100 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). A program may be recorded in a computer-readable recording medium. The computer-readable recording medium is, for example, a portable disk such as a flexible disk a magneto-optical disc, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted via an electric telecommunication line.


As described above, the embodiment of the present invention has been described in detail with reference to the drawings, but a specific configuration is not limited to this embodiment, and design and the like within a range not departing from the gist of the present invention can be included.

Claims
  • 1. An acoustic signal processing device comprising: an acoustic signal processing unit configured to calculate a spectrum of each acoustic signal and a steering vector having m elements on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more) and to estimate a sampling frequency ω.sub.m in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω.sub.ideal that is a predetermined value;wherein the steering vector represents a difference between positions of the microphones having a transfer characteristic from a sound source of the sounds to each of the microphones.
  • 2. The acoustic signal processing device according to claim 1, wherein a matrix representing a conversion of an analog signal from a spectrum of an ideal signal into a spectrum of a signal sampled at the sampling frequency ωm and a sample time τm is set to a spectrum expansion matrix, andthe acoustic signal processing unit estimates the sampling frequency ωm on the basis of the steering vector, the spectrum expansion matrix, and a spectrum Xm.
  • 3. An acoustic signal processing method comprising: a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more);a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals; andan estimation step of estimating a sampling frequency ω.sub.m in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω.sub.ideal that is a predetermined value;wherein the steering vector represents a difference between positions of the microphones having a transfer characteristic from a sound source of the sounds to each of the microphones.
  • 4. A computer readable non-transitory storage medium which stores a program causing a computer of an acoustic signal processing device to execute: a spectrum calculation step of calculating a spectrum of each acoustic signal on the basis of m acoustic signals converted into m digital signals by sampling m analog signals representing sounds collected by m microphones (m is an integer of 1 or more and M or less, and M is an integer of 2 or more);a steering vector calculation step of calculating a steering vector having m elements on the basis of the m converted acoustic signals; andan estimation step of estimating a sampling frequency ω.sub.m in the sampling on the basis of the spectrum, the steering vector, and a sampling frequency ω.sub.ideal that is a predetermined value;wherein the steering vector represents a difference between positions of the microphones having a transfer characteristic from a sound source of the sounds to each of the microphones.
Priority Claims (1)
Number Date Country Kind
2018-165504 Sep 2018 JP national
Non-Patent Literature Citations (2)
Entry
Itoyama, Katsutoshi and Nakadai, Kazuhiro, “Synchronization between channels of a plurality of A/D converters based on probabilistic generation model,” Proceedings of the 2018 Spring Conference, Acoustical Society of Japan, 2018, pp. 505-508 (Year: 2018).
Itoyama, Katsutoshi and Nakadai, Kazuhiro, Synchronization of multiple A/D converters based on a statistical generative model*, 2018, pp. 505-508, discussed in specification, English translation included, 15 pages.
Related Publications (1)
Number Date Country
20200077187 A1 Mar 2020 US