The present invention relates to the field of speaker diarization, and more particularly relates to a method of using labeled training data to train a speaker diarization system.
There is thus provided in accordance with the invention, a method of segmenting an audio stream into speaker homogenous segments, the method comprising the steps of creating a plurality of intra-speaker variability profiles from training data and analyzing said audio stream using said intra-speaker variability profiles, thereby marking speaker homogeneous segments within said audio stream.
There is also provided a accordance of the invention, a method of modeling intra speaker variability in an audio stream, the method comprising the steps of segmenting said audio stream into a plurality of evenly spaced segments, associating each said evenly spaced segment with a particular speaker identity; calculating a score representing the similarity between adjacent evenly spaced segments associated with the same speaker identity and clustering said scores, thereby creating a intra-speaker variability profile for each said speaker identity.
There is further provided a computer program product for segmenting an audio stream into speaker homogenous segments, the computer program product comprising a computer usable medium having computer usable code embodied therewith, the computer program product comprising computer usable code configured for creating a plurality of intra-speaker variability profiles from training data and computer usable code configured for analyzing said audio stream using said intra-speaker variability profiles, thereby marking speaker homogeneous segments within said audio stream.
The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
The following notation is used throughout this document:
The present invention is a method of using labeled training data and machine learning tools to train a speaker diarization system. Intra-speaker variability profiles are created from training data consisting of an audio stream labeled where speaker changes occur (i.e. which participant is speaking at any given time). These intra-speaker variability profiles are then applied to an (unlabeled) audio stream to cluster the audio stream into speaker homogeneous segments and to combine adjacent segments according to speaker identity.
One example application of the invention is to facilitate the development of tools to segment unlabeled audio streams into speaker homogeneous segments. Automated segmentation of audio stream helps optimize performance and accuracy of speech and speaker recognition systems.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method, computer program product or any combination thereof. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
A block diagram illustrating an example computer processing system adapted to implement the trainable speaker diarization method of the present invention is shown in
The computer system is connected to one or more external networks such as a LAN or WAN 23 via communication lines connected to the system via data I/O communications interface 22 (e.g., network interface card or NIC). The network adapters 22 coupled to the system enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters. The system also comprises magnetic or semiconductor based storage device 52 for storing application programs and data. The system comprises computer readable storage medium that may include any suitable memory means, including but not limited to, magnetic storage, optical storage, semiconductor volatile or non-volatile memory, biological memory devices, or any other memory storage device.
Software adapted to implement the trainable speaker diarization method of the present invention is adapted to reside on a computer readable medium, such as a magnetic disk within a disk drive unit. Alternatively, the computer readable medium may comprise a floppy disk, removable hard disk, Flash memory 16, EEROM based memory, bubble memory storage, ROM storage, distribution media, intermediate storage media, execution memory of a computer, and any other medium or device capable of storing for later reading by a computer a computer program implementing the method of this invention. The software adapted to implement the trainable speaker diarization method of the present invention may also reside, in whole or in part, in the static or dynamic main memories or in firmware within the processor of the computer system (i.e. within microcontroller, microprocessor or microcomputer internal memory).
Other digital computer system configurations can also be employed to implement the complex event processing system rule generation mechanism of the present invention, and to the extent that a particular system configuration is capable of implementing the system and methods of this invention, it is equivalent to the representative digital computer system of
Once they are programmed to perform particular functions pursuant to instructions from program software that implements the system and methods of this invention, such digital computer systems in effect become special purpose computers particular to the method of this invention. The techniques necessary for this are well-known to those skilled in the art of computer systems.
It is noted that computer programs implementing the system and methods of this invention will commonly be distributed to users on a distribution medium such as floppy disk or CD-ROM or may be downloaded over a network such as the Internet using FTP, HTTP, or other suitable protocols. From there, they will often be copied to a hard disk or a similar intermediate storage medium. When the programs are to be run, they will be loaded either from their distribution medium or their intermediate storage medium into the execution memory of the computer, configuring the computer to act in accordance with the method of this invention. All these operations are well-known to those skilled in the art of computer systems.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In accordance with the invention, intra-speaker variability profiles are first created from training data comprising an audio stream labeled where each participant is speaking. The intra-speaker variability profiles are then applied to an unlabeled audio stream. Analysis of the unlabeled audio stream (using the intra-speaker variability profiles) segments the audio stream into speaker homogeneous segments.
A block diagram illustrating an example implementation of the intra-speaker variability profile creation method of the present invention is shown in
A block diagram illustrating an example implementation of the speaker diarization method of the present invention is shown in
A flow diagram illustrating the intra-speaker variability profile creation method of the present invention is shown in
A flow diagram illustrating the speaker diarization of the present invention is shown in
In one embodiment of the present invention, kernel principal component analysis (PCA) is a method used to create the intra-speaker variability profiles from the training data (i.e. the labeled audio stream) and to define the speaker homogeneous segments in the test data (i.e., the unlabeled audio stream). Kernel-PCA is a kernelized version of the PCA algorithm. Function K(x,y) is a kernel if there exists a dot product space F (named “feature space”) and a mapping f:V→F from observation space V (named ‘input space’) for which:
∀x,yεV K(x,y)=(f(x),f(y) (1)
Given a set of reference vectors A1, . . . , An in V, the kernel-matrix K is defined as Ki,j=K(Ai, Aj). The goal of kernel-PCA is to find an orthonormal basis for the subspace spanned by the set of mapped reference vectors f(A1), . . . , f(An). The outline of the kernel-PCA algorithm is as follows:
{tilde over (K)}=K−1nK−K1n+1nK1n (2)
{tilde over (v)}
i
=v
i/√{square root over (λi)}, I={1, . . . , m} (3)
The ith eigenvector in feature space denoted by fi is:
f
i=(f(A1), . . . , f(An)){tilde over (v)}i (4)
The set of eigenvectors {f1, . . . , fm} is an orthonormal basis for the subspace spanned by {f(A1), . . . , f(An)}.
Let x be a vector in input space V with a projection in feature space denoted by f(x), f(x) can be uniquely expressed as a linear combination of basis vectors {fi(x)} with coefficients {αix}, and a vector ux in V/span {f1, . . . , fm} which is the complementary subspace of span {f1, . . . , fm}.
Note that αix=f(x),fi. Using equations (1) and (4), αix can be expressed as:
αix=(K(x,A1), . . . , K(x,An)){tilde over (v)}i (6)
We define a projection T:V→Rm as:
T(x)=({tilde over (v)}1, . . . , {tilde over (v)}m)T(K(x,A1), . . . , K(x,An))T (7)
The following property holds for projection T:
Equation (8) implies that projection T preserves distances in the feature subspace spanned by {f(A1), . . . , f(An)}.
Given a set of sequences of frames corresponding to speaker homogeneous segments, it is desirable to project them into a space where speaker variation can naturally be modeled, while still preserving relevant information. Relevant information is defined in this paper as distances in feature space F defined by a kernel function. Equation (7) suggests such a projection. Using projection T as the chosen projection has the advantage of having Rm as a natural target space for modeling. Equation (8) quantifies the amount distances are distorted by projection T. In order to capture some of the information lost by projection T we define a second projection:
U(x)=ux (9)
Although we cannot explicitly apply projection U, we can easily calculate the distance between two vectors ux and uy using the distance between x and y in feature space F and their distance after projection with T.
∥U(x)−U(y)∥2=∥f(x)−f(y)∥2−∥T(x)−T(y)∥2 (10)
Using both projections T and U enables capturing the relevant information. The subspace spanned by {f(A1), . . . , f(An)} is named the common-speaker subspace, as attributes that are common to several speakers will typically be projected into it. The complementary space is named the speaker-unique space, as attributes that are unique to a speaker will typically be projected to that subspace.
The next step is modeling in common speaker subspace. The purpose of the projection of the common-speaker subspace into Rm using projection T is to enable modeling of inter-segment speaker variability. Inter-segment speaker variability is closely related to intersession variability modeling which has proven to be extremely successful for speaker recognition. We model speakers' distributions in common-speaker subspace as multivariate normal distributions with a shared full covariance matrix S which is m×m dimensional (m is the dimension of the common-speaker space).
Given an annotated training dataset, we extract non-overlapping speaker homogeneous segments (of fixed length). Given speakers s1, . . . , sk with n(si) segments for speaker si, T(xs
where μs
We regularize S by adding a positive noise component η to the elements of its diagonal
{tilde over (Σ)}=Σ+ηI (13)
The resulting covariance matrix is guaranteed to have eigenvalues greater than η, therefore it is invertible.
Given a pair of segments x and y projected into common-speaker subspace (T(x) and T(y) respectively), the likelihood of T(y) conditioned on T(x) and assuming x and y share the same speaker identity is
where 2{tilde over (Σ)} is the covariance matrix of the random variable T(y)−T(x).
For the sake of efficiency, diagonalize the covariance matrix 2{tilde over (Σ)} by computing its eigenvectors {ei} and eigenvalues {íi}. Defining E as e1T, . . . , emT), equation (14) reduces to:
where {tilde over (T)}(x)=E·T(x), {tilde over (T)}(y)=E·T(y) and [x]i is the ith coefficient of x.
There is also modeling in speaker unique subspace. Δu(x,y)2 denotes the squared distance between segments x and y projected into the speaker unique subspace. We assume
and estimate su from the development data.
When modeling in segment space, the likelihood of segment y given segment x and given the assumption that both segments share the same speaker identity is
Pr(y|x,x˜y)=Pr(T(y)|T(x),x˜y)Pr(Δu(x,y)2|x˜y) (17)
The expression in equation (17) can be calculated using equations (15) and (16).
To normalize scores, the speaker similarity score between segments x and y is defined as log(Pr(y|x,x˜y). Score normalization is a standard and extremely effective method in speaker recognition. We use T-norm (4) and TZ-norm (2) for score normalization in the context of speaker diarization. Given held out segments t1, . . . , tT from a development set, The T-normalized score (S(x,y)) of segment y given segment x is:
The TZ-normalized score of segment y given segment x is calculated similarly according to equation (10).
Finally, kernels for speaker diarization are defined. In equation (5) it was shown that under reasonable assumptions a GMM trained on a test utterance is as appropriate for representing the utterance as the actual test frames (the GMM is approximately a sufficient statistic for the test utterance w.r.t. GMM scoring). Therefore the kernels used are based on GMM parameters trained for the scored segments. GMMs are maximum-posteriori (MAP) adapted from a universal background model (UBM) of order 1024 with diagonal covariance matrices.
The kernel described supra was inspired by equation (14). The kernel is based on the weighted-normalized GMM means:
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. As numerous modifications and changes will readily occur to those skilled in the art, it is intended that the invention not be limited to the limited number of embodiments described herein. Accordingly, it will be appreciated that all suitable variations, modifications and equivalents may be resorted to, falling within the spirit and scope of the present invention. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.