The present invention relates to a technology for extracting a signal from each sound source included in a mixed acoustic signal observed by using a plurality of microphones.
Sound source extraction technology for estimating a signal from each sound source before mixing from a mixed acoustic signal (hereinafter simply referred to as an observed signal) observed using a plurality of microphones is widely used for speech recognition preprocessing. As a sound source extraction technology, for example, independent vector extraction (IVE) described in NPL 1 is known.
However, independent vector extraction of the related art has a problem that a processing time required for sound source extraction increases as the number of microphones increases.
Therefore, an object of the present invention is to provide a sound source signal generation technology based on an optimization algorithm that enables high-speed processing of sound source extraction.
An aspect of the present invention is a sound source signal generation device in which K and M are integers satisfying 1≤K<M, x(f, t) (f=1, . . . , F, t=1, . . . , T) (where f is an index indicating a frequency bin and t is an index indicating a time frame) is an observed signal of mixed sound from K sound sources observed using M microphones, xi(f, t) (i=1, . . . , K, f=1, . . . , F, t=1, . . . , T) is an i-th sound source signal, the i-th sound source signal being an estimation signal of an i-th sound source, W(f)=[w1(f), . . . , wK(f), WZ(f)] (where wi(f)∈CM (i=1, . . . , K) is a separation filter for the i-th sound source signal, and WZ(f)∈CM×(M−K) is a separation filter for a noise signal) is a separation matrix, Vi(f) (i=1, . . . , K) is an auxiliary function of the i-th sound source signal, and VZ(f) is an auxiliary function of the noise signal, the sound source signal generation device including an initialization unit configured to initialize a separation matrix W(f) and an auxiliary function VZ(f); an optimization unit configured to optimize the separation matrix W(f) using the observed signal x(f, t); and a sound source signal generation unit configured to generate an i-th sound source signal xi(f, t) from the observed signal x(f, t) using the separation matrix W(f), wherein the optimization unit includes an auxiliary function calculation unit configured to calculate the auxiliary function Vi(f) (i=1, . . . , K) using predetermined equations; a first separation filter calculation unit configured to calculate the separation filter wi(f) (i=1, . . . , K) using auxiliary functions Vi(f) (i=1, . . . , K) and Vz(f); and a second separation filter calculation unit configured to calculate a separation filter WZ(f) according to a predetermined equation when a convergence condition is satisfied.
According to the present invention, it is possible to execute sound source extraction processing at a high speed.
Hereinafter, embodiments of the present invention will be described in detail. Components having the same function are denoted by the same reference signs, and redundant description is omitted.
A notation method used in this specification will be described before the embodiments are described.
{circumflex over ( )} (caret) represents a superscript. For example, xy{circumflex over ( )}z means that yz is a superscript to x and xy{circumflex over ( )}z means that yz is a subscript to x. Further, _ (underscore) represents a subscript. For example, xy_z means that yz is a superscript to x and xy_z means that yz is a subscript to x.
Superscripts “{circumflex over ( )}” and “˜” as in {circumflex over ( )}x and ˜x for a certain character x would normally be written directly above “x,” but are written as {circumflex over ( )}x or ˜x here due to restrictions on notation in this specification.
C is a set of complex numbers, d and d′ are integers equal to or greater than 1, Id∈Cd×d represents a d-dimensional unit matrix, and Od,d′∈Cdxd′ represents a d×d′ zero matrix. Further, ej(d) represents a d-dimensional unit vector in which a j-th element is 1 and the other elements are 0.
For a vector v and a matrix A, vT and AT represent a transposed vector of the vector v and a transposed matrix of the matrix A, respectively. Further, vh and Ah represent a complex conjugate transposed vector of the vector v and a complex conjugate transposed matrix of the matrix A, respectively.
∥v∥ represents a Euclidean norm for the vector v. That is, ∥v∥=(vhv)1/2.
Hereinafter, sound source extraction technology is treated as sound source extraction in a the short-time Fourier transform domain.
A situation in which signals from K sound sources and (M−K)-dimensional noise signals are observed using the M microphones is considered. Here, it is assumed that 1≤K<M. f is an index indicating a frequency bin, t is an index indicating time, and an observed signal x(f, t) (f=1, . . . , F and t=1, . . . , T) in the short-time Fourier transform domain is represented as follows:
x(f,t)=As(f)s(f,t)+Az(f)z(f,t)∈CM [Math. 1]
A
s(f)=[a1(f), . . . ,aK(f)]∈CM×K [Math. 2]
s(f,t)=[s1(f,t), . . . ,sK(f,t)]T∈CK [Math 3]
A
z(f)∈CM×(M−K) [Math. 4]
z(f,t)∈CM−K [Math. 5]
Here, si(f, t)∈C(i=1, . . . , K) is an STFT coefficient of an i-th sound source, and z(f, t)∈CM−K is an STFT coefficient of the noise. Further, ai(f)∈CM (i=1, . . . , K) is an acoustic transfer function from the i-th sound source to the M microphones, and Az(f)∈CM×(M−K) is an acoustic transfer function of the noise up to the M microphones.
A blind sound source extraction problem (hereinafter referred to as a BSE problem) and a semi-blind sound source extraction problem (hereinafter referred to as a semi-BSE problem) are formulated as follows.
This is a problem for obtaining the i-th sound source signal xi(f, t)∈CM (i=1, . . . , K, f=1, . . . , F and t=1, . . . , T) that is an estimation signal of the i-th sound source, with the number K of sound sources and the observed signal x(f, t) (f=1, . . . , F, t=1, . . . , T) as inputs.
x
i(f,t)=ai(f)si(f,t) [Math. 6]
This is a problem for obtaining the i-th sound source signal xi(f, t) (i=1, . . . , K, f=1, . . . , F, t=1, . . . , T) that is an estimation signal of the i-th sound source, with the number K of sound sources, the observed signal x(f, t) (f=1, . . . , F, t=1, . . . , T), and an acoustic transfer function ai(f) (i=1, . . . , L, where L is an integer that satisfies 1≤L≤K) as inputs. When L=K, the semi-BSE problem is called a beamforming problem.
Next, an assumption for the BSE problem and the semi-BSE problem (hereinafter referred to as an independent vector extraction model) dealt with in the present invention will be described. A matrix A(f)∈CM×M is defined by the following equation.
A(f)=[a1(f), . . . ,aK(f),Az(f)] [Math. 7]
Further, a vector si(t)∈CF is defined by the following equation.
s
i(t)=[si(1,t), . . . ,si(F,t)]T [Math. 8]
It is assumed that there is a matrix W(f)∈CM×M that satisfies W(f)hA(f)=IM for the matrix A(f)∈CM×M. Here,
W(f)=[w1(f), . . . ,wK(f),Wz(f)] [Math. 9]
w
i(f)∈CM(i=1, . . . ,K) [Math. 10]
W
z(f)∈CM×(M−K) [Math. 11]
Here, wi(f)∈CM (i=1, . . . , K) is called a separation filter for an i-th sound source signal, WZ(f)∈CM×(M−K) is called a separation filter for a noise signal, and the matrix W(f) is called the separation matrix.
W(f)hA(f)=IM is equivalent to the following equation.
s
i(f,t)=wi(f)hx(f,t)∈C(i=1, . . . ,K) [Math. 12]
z(f,t)=Wz(f)hx(f,t)∈CM−K [Math. 13]
Probability variables {si(t), z(f, t)}i,f,t are assumed to be independent of each other. That is, it is assumed that the following equation is established.
p({si(t),z(f,t)}i,f,t)=Πi,tp(si(t))Πf,tp(z(f,t)) [Math. 14]
The vector si(t) is assumed to follow a cyclic symmetric super-Gaussian distribution. That is, it is assumed that the following equation is established.
−log p(si(t))=G(∥si(t)∥)+const [Math. 15]
Here, G(r) is a differentiable function from a set R≥0 of real numbers equal to or greater than 0 to a set R of the real numbers, and a function G′(r)/r (G′ represents a differential function of G) is assumed not to increase for r>0.
It is assumed that an STFT coefficient of the noise z(f, t)∈CM−K follows a complex Gaussian distribution of which a mean is a zero matrix 0M−K and a variance is a unit matrix IM−K. z(f, t)˜CN(0M−K, IM−K), that is,
is assumed to be established.
Therefore, the independent vector extraction model handled in the present invention is a model that satisfies Assumptions 1 to 4, and both the BSE problem and the semi-BSE problem come down to a problem of obtaining the separation matrix W(f) (f=1, . . . , F).
An algorithm for obtaining the separation matrix W(f) that is used in each embodiment of the present invention will be described herein. The present algorithm is based on a majorization-minimization (MM) approach, and consists of Algorithms 1, 2, 3, and 4.
Algorithm 1 optimizes the separation matrix W(f) using auxiliary functions Vi(f) (i=1, . . . , K) and Vz(f). Algorithm 1 is roughly divided into initialization processing, optimization processing, and sound source extraction processing. Any one of Algorithm 2, Algorithm 3, and Algorithm 4 is used in the optimization processing.
Algorithm 2 is an algorithm for solving the BSE problem when K=1. Here, high-speed sound source extraction is realized by optimizing only the separation filter wi(f) corresponding to the first sound source, instead of optimizing the separation matrix W(f).
Algorithm 3 is an algorithm for solving the BSE problem when K>1. Here, only the separation filters wi(f), . . . , wK(f) corresponding to K sound sources are optimized instead of the separation matrix W(f) being optimized, to achieve high-speed sound source extraction.
Algorithm 4 is an algorithm for solving the semi-BSE problem. An optimization algorithm of a linear constrained minimum variance (LCMV) beamformer is used for separation filters w1(f), . . . , wL(f) corresponding to L sound sources with a known acoustic transfer function, whereas optimization is performed in the same method as Algorithm 2 or Algorithm 3 for the remaining separation filters wL+1(f), . . . , wK(f) corresponding to K-L sound sources, thereby realizing high-speed sound source extraction.
First, Algorithm 1 is shown. In this algorithm, a function defined by the following equation is used as a function G in Assumption 3, and a parameter αi (i=1, . . . , K) is also an optimization target.
(where β is a predetermined constant.)
Here, it is assumed that −W(f)=[−wL+1(f), . . . , −wK(f), −WZ(f)] (here, −wi(f)∈C(M−L) (i=L+1, . . . , K) is a separation filter for the i-th sound source signal, and −WZ(f)∈C(M−L)×(M−K) is a separation filter for a noise signal).
z(f) ← W2′(f)hVz(f)W2′(f)
−Ez = [eK−L+1(M−L), ..., eM−L(M−L)])
Next, Algorithm 2 is shown.
Next, Algorithm 3 is shown.
Finally, Algorithm 4 is shown.
i(ƒ) ← W2′(ƒ)hVi(ƒ)W2′(ƒ)
z(ƒ)ū = λmax
k(ƒ) ← (
k(ƒ) ←
i(ƒ)
In the present embodiment, a form for solving the BSE problem will be described.
The sound source signal generation device 100 generates the i-th sound source signal xi(f, t) (i=1, . . . , K, f=1, . . . , F, t=1, . . . , T) that is an estimation signal of the i-th sound source from the observed signal x(f, t) (f=1, . . . , F, t=1, . . . , T) of mixed sound from the K sound sources observed using the M microphones. Here, K and M are integers satisfying 1≤K<M. Further, W(f)=[w1(f), . . . , wK(f), WZ(f)] (where wi(f)∈CM (i=1, . . . , K) is a separation filter for an i-th sound source signal, and WZ(f)∈CM×(M−K) is a separation filter for a noise signal) is a separation matrix, Vi(f) (i=1, . . . , K) is an auxiliary function of the i-th sound source signal, and VZ(f) is an auxiliary function of the noise signal.
The sound source signal generation device 100 will be described below with reference to
The operation of the sound source signal generation device 100 will be described according to
In S110, the initialization unit 110 initializes and outputs the separation matrix W(f) and the auxiliary function VZ(f). The separation matrix W(f) and the auxiliary function VZ(f) may be initialized, for example, by processing from 1 to 5 in Algorithm 1 described in <Technical Background>.
In S120, the optimization unit 120 receives the observed signal x(f, t) as an input, optimizes the separation matrix W(f) using the observed signal x(f, t), and outputs a result thereof.
Hereinafter, the optimization unit 120 will be described with reference to
The operation of the optimization unit 120 will be described according to
In S121, the auxiliary function calculation unit 121 calculates an auxiliary function Vi(f) (i=1, . . . , K) using the following equation.
s
i(f,t)←wi(f)hx(f,t) [Math. 48]
r
i(t)←∥si(t)∥ [Math. 49]
(where si(t)=[si(1, t), . . . , si(F, t)]T)
(where β is a predetermined constant)
The auxiliary function calculation unit 121 may further perform processing for stabilizing numerical calculation, as in Algorithm 1.
In S122, the first separation filter calculation unit 122 calculates the separation filter wi(f) (i=1, . . . , K) using the auxiliary functions Vi(f) (i=1, . . . , K) and Vz(f). Specifically, IM−K is an (M−K)-dimensional unit matrix, ej(M) (j=1, . . . , M) is an M-dimensional unit vector in which an j-th element is 1 and other elements are 0, EB=[e1(M), . . . , eK(M)], and Ez=[eK+1(M), . . . , eM(M)], and the first separation filter calculation unit 120
calculates the separation filter w1(f) using the following equation when K=1,
w
1(f)←u(uhV1(f)u)−1/2 [Math. 53]
(where the vector u is a vector corresponding to the maximum eigenvalue λmax that satisfies Vz(f)u=λmaxV1(f)u), and calculates the separation filters wi(f) (i=1, . . . , K) using the following equation when K>1.
G
k(f)←Pk(f)hVk(f)Pk(f)(k=i,z) [Math. 55]
w
i(f)←Pi(f)b(bhGi(f)b)−1/2 [Math. 56]
(where the vector b is a vector corresponding to the maximum eigenvalue λmax that satisfies Gi(f)b=λmaxGz(f)b)
In S123, the convergence condition determination unit 123 determines whether or not a predetermined convergence condition is satisfied, and outputs the separation filters wi(f) (i=1, . . . , K) and proceeds to processing of S124 when the convergence condition is satisfied, and returns to the processing of S121 and repeats the processing of S121 to S123 when the convergence condition is not satisfied. As the predetermined convergence conditions, for example, a condition whether or not a predetermined number of repetitions has been reached, and a condition whether or not an update amount of each parameter (for example, the separation filters wi(f) (i=1, . . . , K)) is equal to or smaller than (or is smaller than) a predetermined threshold value can be used.
In S124, the second separation filter calculation unit 124 calculates a separation filter WZ(f) using the following equation.
(where Ws(f)=[w1(f), . . . , wK(f)])
In S130, the sound source signal generation unit 130 receives the observed signal x(f, t) and the separation matrix W(f) output in S120 as inputs, generates the i-th sound source signal xi(f, t) from the observed signal x(f, t) using the separation matrix W(f), and outputs the i-th sound source signal xi(f, t). The i-th sound source signal xi(f, t) may be calculated, for example, using the following equation.
x
i(f,t)←(W(f)−hei(M))wi(f)hx(f,t) [Math. 58]
According to the embodiment of the present invention, it is possible to execute sound source extraction processing at high speed.
In the present embodiment, a form for solving the Semi-BSE problem will be described.
The sound source signal generation device 200 generates the i-th sound source signal xi(f, t) (i=1, . . . , K, f=1, . . . , F, t=1, . . . , T) that is an estimation signal of the i-th sound source from the observed signal x(f, t) (f=1, . . . , F, t=1, . . . , T) of mixed sound from the K sound sources observed using the M microphones. Here, K and M are integers satisfying 1≤K<M. Further, W(f)=[w1(f), . . . , wK(f), WZ(f)] (where wi(f)∈CM (i=1, . . . , K) is a separation filter for an i-th sound source signal, and WZ(f)∈CM×(M−K) is a separation filter for a noise signal) is a separation matrix, Vi(f) (i=1, . . . , K) is an auxiliary function of the i-th sound source signal, and VZ(f) is an auxiliary function of the noise signal. L is an integer satisfying 1≤L≤K, ai(f)∈CM (i=1, . . . , L) is an acoustic transfer function from the i-th sound source to the M microphones, and A1(f)=[a1(f), . . . , aL(f)].
Hereinafter, the sound source signal generation device 200 will be described with reference to
The operation of the sound source signal generation device 200 will be described according to
In S210, the initialization unit 210 initializes and outputs the separation matrix W(f) and the auxiliary function VZ(f). The separation matrix W(f) and the auxiliary function VZ(f) may be initialized, for example, by processing 1 to 2 and 6 to 10 of Algorithm 1 described in <Technical Background>.
In S220, the optimization unit 220 receives the observed signal x(f, t), optimizes the separation matrix W(f) using the observed signal x(f, t), and outputs a result thereof.
Hereinafter, the optimization unit 220 will be described with reference to
The operation of the optimization unit 220 will be described according to
In S121, the auxiliary function calculation unit 121 calculates the auxiliary function Vi(f) (i=1, . . . , K) according to a predetermined equation. The auxiliary function calculation unit 121 may perform calculation using the equation used by the auxiliary function calculation unit 121 of the first embodiment.
In S222, the first separation filter calculation unit 122 calculates the separation filter wi(f) (i=1, . . . , K) using the auxiliary functions Vi(f) (i=1, . . . , K) and Vz(f). Specifically, IM−K is an (M−K)-dimensional unit matrix, ej(d) (j=1, . . . , d) is a d-dimensional unit vector in which a j-th element is 1 and other elements are 0, E2=[eL+1(M), . . . , eM(M)], W2′(f)=[A1(f), E2]−hE2, −Vz(f)=W2′(f)hVz(f)W2′(f), −W(f)=[−wL+1(f), . . . , −wK(f), −WZ(f)] (where −wi(f)∈C(M−L) (i=L+1, . . . , K) is a separation filter for an i-th sound source signal, −WZ(f)∈C(M−L)×(M−K) is a separation filter for a noise signal), −Es=[e1(M−L), . . . , eK−L(M−L)], −Ez=[eK−L+1(M−L), . . . , eM−L(M−L)], and the first separation filter calculation unit 220 calculates the separation filter wi(f) (i=1, . . . , K) using the following equation when L=K, and
w
i(f)←Vi(f)−1A1(f)(A1(f)hVi(f)A1(f))−1ei(K) [Math. 59]
calculates the separation filter wi(f) (i=1, . . . , K−1) using the following equation
and calculates the separation filter wK(f) using the following equation when L=K−1,
∇K(f)←W2′(f)hVK(f)W2′(f) [Math. 61]
w
K(f)←W2′(f)ū(ūh∇K(f)ū)−1/2 [Math. 62]
(where vector −u is a vector corresponding to a maximum eigenvalue λmax that satisfies −Vz(f)−u=λmax−Vk(f)−u), and calculates the separation filter wi(f) (i=1, . . . , L) using the following equation
and calculates the separation filter wi(f) (i=L+1, . . . , K) using the following equation when L<K−1.
(where vector −b is a vector corresponding to the maximum eigenvalue λmax that satisfies −Gi(f)−b=λmax−Gz(f)−b)
In S123, the convergence condition determination unit 123 determines whether or not a predetermined convergence condition is satisfied, and outputs the separation filters wi(f) (i=1, . . . , K) and proceeds processing of S224 when the convergence condition is satisfied, and returns to processing of S121 and repeats the processes of S121 to S123 when the convergence condition is not satisfied.
In S224, the second separation filter calculation unit 124 calculates the separation filter WZ(f) using the following equation.
In S130, the sound source signal generation unit 130 receives the observed signal x(f, t) and the separation matrix W(f) output in S120 as inputs, generates the i-th sound source signal xi(f, t) from the observed signal x(f, t) using the separation matrix W(f), and output the i-th sound source signal xi(f, t).
According to the embodiment of the present invention, it is possible to execute sound source extraction processing at high speed.
The device of the present invention includes, for example, as single hardware entities, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a communication unit to which a communication device (for example, a communication cable) capable of communication with the outside of the hardware entity can be connected, a CPU (Central Processing Unit, which may include a cache memory, a register, and the like), a RAM or a ROM that is a memory, an external storage device that is a hard disk, and a bus connected for data exchange with the input unit, the output unit, the communication unit, the CPU, the RAM, the ROM, and the external storage devices. Further, a device (drive) capable of reading and writing from and to a recording medium such as a CD-ROM may be provided in the hardware entity as necessary. An example of a physical entity including such hardware resources is a general-purpose computer.
A program necessary to realize the above-described functions, data necessary for processing of this program, and the like are stored in the external storage device of the hardware entity (the present invention is not limited to the external storage device, for example, the program may be stored in a ROM that is a read only storage device). Further, for example, data obtained by the processing of the program is appropriately stored in a RAM, the external storage device, or the like.
In the hardware entity, each program and data necessary for the processing of each program stored in the external storage device (or a ROM, for example) are read into a memory as necessary and appropriately interpreted, executed, or processed by a CPU. As a result, the CPU realizes a predetermined function (each of components represented by the unit, means, or the like).
The present invention is not limited to the above-described embodiment, and appropriate changes can be made without departing from the spirit of the present invention. Further, the processes described in the embodiments are not only executed in time series in the described order, but also may be executed in parallel or individually according to a processing capability of a device that executes the processes or as necessary.
As described above, when a processing function in the hardware entity (the device of the present invention) described in the embodiment is realized by a computer, processing content of a function that the hardware entity should have is described by a program. By executing this program using the computer, the processing function in the hardware entity is realized on the computer.
A program describing this processing content can be recorded on a computer-readable recording medium. An example of the computer-readable recording medium may include any recording medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, and a semiconductor memory. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like can be used as a magnetic recording device, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable)/RW (ReWritable), or the like can be used as an optical disc, an MO (Magneto-Optical disc) or the like can be used as a magneto-optical recording medium, and an EEP-ROM (Electrically Erasable and Programmable-Read Only Memory) or the like can be used as a semiconductor memory.
Further, this program is distributed by, for example, selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM on which the program has been recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer via a network.
The computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in a storage device of the computer. When the computer executes the processing, the computer reads the program stored in the storage device of the computer and executes processing according to the read program. Further, as another embodiment of the program, the computer may directly read the program from the portable recording medium and execute a process according to the program, and further, a process according to a received program may be sequentially executed each time the program is transferred from the server computer to the computer. Further, a configuration in which the above-described process is executed by a so-called ASP (Application Service Provider) type service for realizing a processing function according to only an execution instruction and result acquisition without transferring the program from the server computer to the computer may be adopted. It is assumed that the program in the present embodiment includes information provided for a process of an electronic computer and being pursuant to the program (such as data that is not a direct command to the computer, but has properties defining a process of the computer).
Further, although the hardware entity is configured by a predetermined program being executed on the computer in the present embodiment, at least a part of the processing content of the hardware entity may be realized in hardware.
The above description of the embodiments of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and there is no intention to limit the invention to a disclosed exact form. Modifications or variations are possible from the above-described teachings. The embodiments are selectively represented in order to provide the best illustration of the principle of the present invention and in order for those skilled in the art to be able to use the present invention in various embodiments and with various modifications so that the present invention is suitable for deliberated practical use. All of such modifications or variations are within the scope of the present invention defined by the appended claims interpreted according to a width given fairly, legally and impartially.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/046508 | 12/14/2020 | WO |