This invention relates to automated recognition events in a data series using self-organizing units, and in particular to recognition of events in a speech signal.
Many speech applications require large amounts of transcribed audio for supervised training of the speech recognition models. For some domains, transcribed audio can be difficult to come by. Different approaches for speech recognition training have recently been proposed for using various amounts of limited resources, such as converting models from related languages, or bootstrapping with a small amount of transcribed data.
Many approaches for the analysis of speech signals use an automated transcription (i.e., the word sequence output of an automated speech recognizer, also referred to as a speech-to-text system) as an intermediate representation of a speech signal. The automated transcription is then used for further processing. For example, topic identifications (TID) system for speech signals can be based on the characteristics of the words in the automated transcription. Note that some approaches do not require that the words in the transcripts are meaningful—what is important is that the word sequences in the automated transcriptions capture the information that is useful for further processing, for example, by statistically capturing information indicative of the topic of a conversation.
One general approach to zero resource (i.e., no transcriptions) speech recognition is described in A. Park and J. Glass, “Towards unsupervised pattern discovery in speech,” Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico, 2005; A. Park, T. J. Hazen, and J. Glass, “Automatic processing of audio lectures for information retrieval: Vocabulary selection and language modeling,” Proc. ICASSP, Philadelphia, 2005, pp. 497-450; and Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams,” Proc. ASRU, 2009, pp. 398-403. This approach generally employs a dynamic time warping (DTW) matching of features to find common occurrences of patterns. These DTW methods rely on individual string patterns results without necessarily capturing natural variations Is string patterns that are generated by the same underlying event. For example, DTW-based speech recognizers are highly speaker dependent.
Another approach to self-organizing speech recognition for information extraction is described in U.S. Pat. No. 7,389,233, issued to Gish on Jun. 17, 2008. Related approaches are described in H. Gish, et al, “Unsupervised training of an HMM-based speech recognition system for topic classification,” Interspeech 2009; in M. Siu et al., “Improved topic classification and keyword discovery using an HMM based speech recognizer trained without supervision”, Interspeech 2010; and in M. Siu et al. “Unsupervised Audio Patterns Discovery using HMM-based Self-Organized Units”, Interspeech 2011. This prior patent and related papers are incorporated herein by reference. In some of these approaches, an iterative unsupervised HMM training strategy where the HMM was used to transcribe the audio into a sequence of self-organized speech units (SOUs) using only untranscribed speech for training. One application of the resulting unit sequences is for topic identification (TID). One significant advantage of completely unsupervised training is a reduction of mismatch between training and test data, for example, because the untranscribed test data can, if needed, be added for acoustic training
In one aspect, in general, an approach automated processing for audio or other data series or signals where little or no transcribed training data is available makes uses identification of self-organizing units (SOUs) in conjunction with automated creation of, or augmentation of an existing dictionary, with “pseudo-words” or tokens represented in terms of the SOUs. In some examples, the dictionary is iteratively updated (i.e., augmented) during training, optionally with updating of models of the SOUs during the iteration.
In one aspect, in general, an approach to automatically forming a dictionary of tokens from a signal makes use of an iterative approach in which successive dictionaries are determined through successive processing of the signal, and in at least some iterations, a signal model is determined from the signal for the tokens of the dictionary determined at that iteration. In some examples, the approach is applied to a speech signal, for example, to automatically form a dictionary of discovered words or commonly repeated phonetic patterns in a language or vocabulary domain for which an adequate dictionary is not available. Such a dictionary can be applied to speech processing tasks, for example, for automated topic or speaker classification.
A computer-implemented method is used to form a dictionary representing events in a first signal (120) in a series of iterations. In each iteration (N) of a series of iterations (N≧1), a current dictionary (DN) that includes a plurality of tokens determined prior to the iteration us used to determine a modified the dictionary (DN+1) that includes tokens not present in the current dictionary. Each iteration includes the following steps. First, a current token series (TN) representing the first signal (120) in terms of tokens of the current dictionary is determined by using a computer-implemented signal analysis module (206) to process the first signal using a current signal model (MN) characterizing signal characteristics of tokens in the current dictionary. Then the modified dictionary (DN+1) is determined by identifying one or more events represented in the current token series (TN) and one or more tokens are added to the current dictionary to form the modified dictionary. Each added token represents one of the events identified in the current token series. In at least some iterations (e.g., other than a final iteration of the series of iterations) a modified token series ({tilde over (T)}N+1) in terms of tokens of the modified dictionary (DN+1) is determined using to the current token series (TN), and a computer-implemented model training module (216) is used to process the first signal (120) according to the modified token series ({tilde over (T)}N+1) to determine a modified signal model (MN+1) characterizing signal characteristics of the tokens of the modified dictionary (DN+1). The modified dictionary is then used as the current dictionary and the modified signal model is used as the current signal model in a subsequent iteration of the series of iterations.
Aspects can include one or any combination of more than one of the following features.
Identifying the one or more events represented in the token series includes identifying repeated token sequences in the token series. Identifying repeated events can include counting occurrences of token n-grams in the current token series, and selecting one or more of the token n-grams according to their counts of occurrences as the one or more events.
The modified dictionary determined at an iteration includes a representation of each added token in terms of units used to represent tokens in the current dictionary.
The modified signal model includes data representing a Hidden Markov Model (HMM) characterizing the tokens of the modified dictionary. The data representing the HMM can include data characterizing a plurality of units used to represent the tokens of the modified dictionary.
Prior to the series of iterations a dictionary (D1) is initialized by grouping segments of the first signal into groups according to similarity of signal characteristics, each group of segments being associated with a label of the group, each token of the dictionary (D1) corresponding to one group. An initial token series (T0) is also determined according to the labels associated with successive segments of the data signal.
Prior to the series of iterations the model training module (214) is used to process the data signal according to the initial token series (T0) to determine an initial signal model (M1) characterizing signal characteristics of the tokens of the initialized dictionary (D1).
Prior to the series of iterations, a dictionary (D1) is initialized to include tokens each representing a predetermined signal unit, and providing an initial signal model (M1) trained on a second signal other than the first signal. The first data signal can represent a speech signal and the predetermined signal units comprise word units. The initial model can be trained using a transcription of at least some of the second signal. The predetermined signal units can comprise subword units. At least some of the subword units can be phoneme units. The initial model can be trained on a transcribed speech signal other than the first signal. The subword units can be associated with a language other than that represented in the first signal.
The first data signal represents a speech signal, and a word transcription of at least some of the first data signal is accepted in which each word of the transcription has a spelling in a pre-specified alphabet. A token series of the at least some of the first data signal and the word transcription are used to form a mapping from spellings to token sequences. The mapping is used to add tokens to a dictionary, including accepting a word to add to the dictionary and mapping a spelling of the word to a token sequence for the word. The spellings can be orthographic spellings. The spellings can alternatively be phonetic spellings, and the pre-specified alphabet is a phonetic alphabet.
After the series of iterations, a third signal is processed to determine a token series (T) representing the third signal in terms of tokens of a modified dictionary (DN+1) determined in the series of iterations. This processing includes using the signal analysis module (202) to process the third signal using a modified signal model (MN+1) characterizing signal characteristics of tokens in the modified dictionary (DN+1). The third signal is classified according to statistical characteristics the determined token series. The classifying can be according to a topic, or a speaker.
At least some of the tokens of a modified dictionary can correspond to vocabulary items. At least some of the tokens of a modified dictionary can correspond to prosodic patterns.
The first signal can be a video signal, or a biological signal (e.g., an ECG signal).
Advantages of one or more aspects include being able to apply vocabulary-based speech analysis approaches to domains in which a dictionary of vocabulary items is not available. For example, a suitable dictionary may not be available because the language is not known or a dictionary in a known language is not available, or because the signal includes vocabulary items that are particular to a domain, for example, relating to technical terms, proper names, etc. that are not available prior to processing.
The approaches introduced above can be used in applications where a time series (e.g., speech and video) has events that can be characterized by similar repeated patterns. For example, in speech, there is repeated occurrence of certain words and in video we can have repeated patterns such as a moving vehicle. The detection of such repeated patterns can be employed in the characterization of the time series, e.g., the word occurrences are indicative of a particular topic being spoken about.
Other features and advantages of the invention are apparent from the following description, and from the claims.
In a first implementation, an approach to speech processing is directed to a situation in which a speech corpus is available, but that corpus does not have a corresponding transcription or dictionary available. For example, the language being spoken may not be known, or if known, a dictionary or other enumeration of linguistic units may not be available for that language. Nevertheless, the approach infers structure (e.g., linguistic units) that represents or are analogous to words from the speech itself. In the discussion below, these units are referred to as “pseudo-words” or “tokens” without intending to require that the units are truly related to words in the language being spoken. Presence of these tokens in a signal is treated as events, which can be used for further processing of the signal.
In some examples, this speech corpus is referred to as a “training” corpus in that a speech processing system is configured (i.e., trained using statistical techniques) based on this corpus, and then the configured system is used to process yet other speech, which is referred to as a “test” corpus. Note that although the language being spoken and transcription is not available, other information about the training corpus may be known. For example, topic labels may be associated with parts of the training corpus, and the system may be configured to distinguish topics in the test corpus based on the inferred structure and its correlation with topics in the training corpus. In other examples, the “training” corpus is itself the target of analysis, for example, based on clustering or other grouping of parts of the corpus according to occurrence of the inferred tokens.
This implementation involves an initialization phase in which a set of underlying self-organizing units are identified and then the training corpus is transcribed (at least partially) according to those units. An iterative phase is then conducted in which in each iteration (a) a dictionary is updated with tokens represented in terms of the units, and (b) models of the signal realization of the units are updated and optionally a sequence model (e.g., “language model”) for the tokens of the dictionary are updated based on the current dictionary.
Without intending limit the meaning or scope of the terms being used, the units (i.e., self-organizing units) can be considered analogous to phonetic units, and the tokens of the dictionary considered analogous to words. However, the terms “units” and “tokens” are generally used below to reinforce the fact that there may be no linguistic basis for the units and tokens. Furthermore, in some examples, the training corpus may be partitioned, for example, into separate utterances. However, this is not required, and the input, whether partitioned or not, is generally referred to as a “signal”. In the discussion below, the term “dictionary” is used broadly to include a data structure that encodes a mapping from tokens (e.g., pseudo-words) to structure represented in terms of the underlying units. An example of such a structure, without being limited to this form, is an enumeration of the tokens in the dictionary, and for each token, a sequential list of units that represent the realization of that token in terms of the units, analogous to a phonetic spelling if the units were phonemes and the tokens words. Other forms of dictionary are within the scope of the discussion below, including dictionaries that for each token can represent multiple alternative unit sequences (e.g., as alternatives, as a graphs, and/or generated according to a probabilistic process).
This implementation, as well as a number of alternative implementations discussed below, makes use of an initialization phase which results in a full or partial transcription of the training signal in terms of a set of units.
Referring to
Generally, initialization is continued by using the cluster labels to form statistical models for each of the clusters, for example, using a segmental Gaussian model for each segment, and using this as to initialize a mixture of segmental Gaussian models that is iteratively improved using an Estimate-Maximize procedure. The result is that each segment is associated with probability distribution of which cluster it is associated with, as well as a cluster with the highest probability. These highest probability labels for the segments are used as the initializing transcription of the training signal.
An initial dictionary, referred to as D1, is formed such that each token in the dictionary represents a different signal unit determined in the clustering procedure above. In some implementations, the dictionary has a set of tokens {W1, W2, . . . } where each token has a corresponding representation as a sequence of units. This initial dictionary include m words, such that the ith word Wi is represented as the sequence of a single unit Wi→[Pi]. The initial transcription of the training signal in terms of the tokens Wi is referred to as {tilde over (T)}1. The transcription comprises a sequence of tokens Wi from the initial dictionary D1, which corresponds directly to the sequence of labels Pi determined from the segment labels.
A final stage of initialization makes use of the initial transcription {tilde over (T)}1 to form and initial model M1 for the units. Specifically, a conventional approach to training a Hidden Markov Model (HMM) makes use of the transcription and the training signal to estimate HMM models for the units. In general this initial training is itself iterative (e.g., using an iterative Baum-Welch approach). The model M1 also optionally includes a statistical sequence model (e.g., language model) for the tokens of the dictionary as represented in the initial transcription {tilde over (T)}1, for example, represented as an n-gram (Markov) model. Although not explicitly relied upon, a result for this stage is a new segmentation of the training signal, which is not necessarily the same as the original segmentation from which the clustering was performed.
Note that some prior approaches make use of these trained HMM models to convert training and test signals into sequences of segment labels. Then, subsequent processing makes use of characteristics of these label sequences, for example, based on identification of repeated subsequences.
Referring to
At the Nth iteration, a signal analysis module 206 processes (recognize) the training signal 120 according to the current dictionary DN and the current model MN. Note that optionally, the training signal 120 used in the procedure shown in
A next step of the Nth iteration involves using the transcription Tn to process the current dictionary DN in the dictionary processing element 208 to form the incrementally changed (e.g., augmented) dictionary DN+1. A number of specific procedures for performing this incremental change are discussed below. As a representative procedure, a single new token is added to represent a concatenation of two existing tokens in the dictionary. The two tokens to concatenate are chosen according to the statistics of the joint occurrence in the transcription TN. Based on the new dictionary, the transcription TN is processed by a token series processing element 210 to use the newly added tokens in the dictionary DN+1. As an example, in the first iteration, a sequence of tokens Wi, Wj may be identified in the transcription TN. In this example for the first iteration, Wi→[Pi] and Wj→[Pj]. An (m+1)st word is added to the dictionary as W(m+1)→[Pi,Pj]. In this example, the token series processing element 210 replaces each occurrence of Wi, Wj in TN with W(m+1) to form TN+1. More generally at subsequent iterations, the new token added to the dictionary is formed by concatenation of multiple-unit subsequences.
A next step of the Nth iteration is to improve the models MN to form new models MN+1. For example, a conventional HMM training module 216 optionally uses the models MN as an initial estimate and the new transcription TN+1 and dictionary DN+1 to determine the new models MN+1. This model training module is similar to the model initialization module 116, with the exception that the HMM training module 216 optionally starts with the current models.
This completes the Nth iteration, and the (N+1)st iteration begins. Assuming that a single token is added to the dictionary at each iteration, and there are initially m units and therefore tokens in D1, there are now m+N tokens in dictionary DN.
Referring to
Note that addition of only a single new token to the dictionary in each iteration is not required. Rather, a multiple new tokens can be added at each iteration. Furthermore, the new tokens can represent sequences of more than two tokens in the recognized sequence. For example, all n-grams with frequency in the recognized sequence that is significantly higher than predicted by the sequence model can be added to the dictionary. In some examples, tokens may be removed from the dictionary at an iteration, for example, because those tokens are underrepresented in the transcription.
While using frequently occurring strings of tokens as the basis for creating new tokens to be added to the dictionary is an important way of augmenting the dictionary there are other ways of adding tokens to the dictionary. In addition to the frequently occurring tokens less frequently occurring tokens can be added, especially those that may carry significant information. For example, if the training signal consists of audio from conversational speech and each conversation in our training corpus as a “document”, we can, for any string of tokens, create a term frequency, inverse document frequency score for assessing the possible importance of any token string. This measure, usually referred to as a TF-IDF (“term frequency-inverse document frequency”) score is well known in text classification applications, and assigns good scores to strings that are frequently occurring and do not occur in all “documents”. Also, we can find strings of tokens that are effective in discriminating between topics of conversations that we may want to identify. These strings of tokens can be generated by a feature generating application for improving classification performance between topics. For example, a feature generator for Support Vector Machine (SVM) classifiers (see, e.g., “Discriminative Keyword Selection Using Support Vector Machines”, by Campbell and Richardson) for adding such features (strings of pseudo-words) to the dictionary as part of the iterative process.
In the implementation described above, the initialization stage does not require any transcription or any prior models of the units represented in the training signal. In a first alternative implementation, the initial set of units (and thereby the initial dictionary D1) is chosen to be a predefined set of linguistically based units (e.g., English phonemes). Referring to
In a second alternative implementation, the training signal 120 is not assumed to be completely untranscribed. In a first variant of this approach, a very small about of transcribed signal (e.g., in the order of 15 minutes of audio) is available, while the rest is untranscribed. We assume that there is inadequate data to train phonetic models as is done in the first alternative implementation described above. In this alternative, the initialization procedure that is described with reference to
In a second variant of the second alternative approach, more transcription than for the first variant is available (e.g., two hours), but the training signal is still largely untranscribed. In this variant, it becomes feasible to train a phone recognizer in the language of interest, assuming that the transcription is accompanied by a dictionary or text-to-sound rules suitable for mapping the words of the transcription to phonemes. Having trained phoneme models, these phoneme models can be used in place of the segmental Gaussian models in the initialization procedure described above, or alternatively, the phone models can be used as described in the first alternative implementation described above.
In a third variant, a significant amount of the training signal is transcribed (e.g., more than 10 hours). A word recognizer is training on the transcribed signal and the dictionary is initialized to include the words of the transcription. Note that this assumes that we have a dictionary of the words of the language in terms of the phonemes of the language. In the dynamic dictionary part of the iterative training we will be able to create compound words from the words that exist in the dictionary as well as pseudo-words as described above. The combination of the two aspects of dictionary updating exploits the structure that exists in the data in a way the conventional training of recognizer does not utilize.
Although described in the context of speech processing, the approaches are clearly not limited to speech. For example, the approaches can be used to form token sequences from a dictionary in other contexts. Other signals that represent a scalar or vector time series in which underlying events have the property that similar events produce similar time series such as repeated words (the events) produce similar audio patterns, it is then possible tokenize/label the time series using the approaches described above. That is, a state-of-the-art Hidden Markov Model recognizer can be employed to recognize the repeated patterns that occur in the data into similar strings of units utilizing tokens (e.g. pseudo-words) in the recognition process. Success of applying these approaches to other signals may depend on having a meaningful way of creating the initial tokenization (i.e., segmentation of the signal).
The use of an SGMM for initialization for other time series is quite feasible if it is possible to automatically segment different events and be able to measure the similarity between events. The segmentation and similarity measure requirements enable us to cluster the event segments for which it is then possible to create an SGMM model.
It should be understood that the form of the dictionary described above is only one example of a representation of tokens (e.g., events) that are present in a data series. In some alternatives, a hierarchical representation is maintained, for example, in the form of a phrase-structured grammar where the dictionary maintains the record of the combinations that were added to the dictionary (e.g., Wn→Wi, Wj, Wk) and a token series can effectively be parsed to reflect now only top-level tokens, but also the constituents that make them up. Various approaches to forming such a grammar could also be used.
Also, as discussed above, each token in the dictionary can be represented as a sequence of units. It should be recognized that the models MN can include context-dependent models as is done with phonemes in phonetic-based speech recognition such that a unit Pj in the context [ . . . , Pi, Pj, Pk, . . . ] has model parameters that depend on Pi and Pk.
The approaches described above may be applied to other audio processing than speech recognition. For example, an audio process, but not speech, is the prosodic behavior of the human voice. Prosody is characterized by variations in acoustic energy and pitch movement and can be segmented based on changes in these quantities. Using the technology disclosed in U.S. Pat. No. 7,389,233 we are able to create an SGMM recognizer for the prosodic patterns of a speaker or group of speakers. Using the technology of this patent disclosure we can use the SGMM recognizer to perform an initialization of the tokenization/labeling, into SOUs of the prosodic data for a speaker or group of speakers and then iteratively train the HMM, using dictionary updating. The result is an HMM prosody recognizer that can tokenize/label prosodic activity in patterns of discovered “pseudo-words”. Such patterns can be of importance in ascertaining a speaker's identity as well as the speaker's emotional state.
The same process employed for prosody can be applied to such audio that is obtained from animals such as whales. A recognizer and SOUs created for whale sounds can be used detect different species and the presence of such species in certain locations. One can use the spectral discontinuity measure for segmentation described in U.S. Pat. No. 7,389,233 and by some other means. After segmentation the process will follow the process we have specified for speech.
Similar to audio, video can be represented as a vector time series. Video is naturally divided into frames and the vector features can either be extracted on a per-frame basis or, more generally from variable frame lengths. The most trivial feature would be the pixels on each frame, but can increase in complexity to the motion vectors used in video coding, or to features extracted from scene analysis etc. Depending on the feature extracted, video SOUs may represent particular video objects, scene movements or other video patterns.
As we have noted above when we have very limited transcribed audio we can create a multi-gram mapping between the letters of a language and SOUs. This mapping can help with the dictionary of the SOU recognizer as well as enable one to find words in audio based on their SOU representations. In addition to learning such a mapping with very limited transcribed audio such a mapping can be learned without supervision, (that is, without any transcribed audio), if there are a large collection of audio that can be tokenized into SOUs, and a large collection of text from similar sources as the audio. Then such a mapping can be learned by minimizing the differences between the sequence statistics of the SOU sequences from the audio, and the sequence statistics of the token sequences after mapping the text sequences into SOUs. The idea of learning a mapping between two non-parallel token corpora based on their frequency statistics has been used in cryptography to break the classical ciphers such as substitution cipher. In our present case, this mapping is generalized to handle mapping of variable length token strings, higher order sequence statistics such as n-grams, as well as fractional count estimates. The problem can also be-formulated as finding two concatenated mappings, one from the text-to-phonemes and one from phonemes-to-SOUs using the same approach. If a pronunciation dictionary is available for the words in the text, in terms of phonemes, we can solve the assignment problem by finding the mapping between phonemes and SOUs.
The approach can also be used to analyze other data series, for example, in analysis of biological signals (e.g., electro-cardiograms, electro-encephalograms, etc.) or analysis and prediction of financial data series. For example, in the electro-cardiogram case, tokens can correspond to sequences of multiple beats. The approach can also be applied to analysis of printed or handwritten text, for example, by forming a data series representing segments along lines of text.
Implementations of the approaches described above can be in software, for example, comprising instructions stored on a tangible computer readable medium having instructions for causing one or more data processing systems to perform the procedures described above. The data processing systems may be distributed in time or space, for example, with the initialization procedure performed on one computer and the iterative training procedure performed on another computer. Implementations can include signal acquisition or storage components, for example, to acquire the training or test signals (e.g., using microphones, data network interfaces, etc.). Implementations can also include output components, for example, to store models, dictionaries, transcriptions, etc. on data storage devices for presentation to a user or for further processing, or to transmit such elements (e.g., over a data communication network) to other data processing systems.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
This invention was made with government support under H98230-06-C-0482/0000. The government has certain rights in the invention.