The present disclosure is related to audio data processing and, in particular, to a system and method for identifying a spoken language in audio data which reduce processing time and improve identification efficiency and throughput.
In audio data processing, it is often desirable to identify the language being spoken by a speaker in the audio data. Language identification (LID) is the labeling of audio data (recording) with the identity of the language being spoken. Conventionally, language identification can require a large amount of processing resources due to the amount of audio data typically being analyzed to make the identification determination.
According to a first aspect, a system for identifying a language in audio data is provided. The system includes a feature extraction module for receiving an unknown input audio data stream and dividing the unknown input audio data stream into a plurality of segments of unknown input audio data. A similarity module receives the plurality of segments of the unknown input audio data and receives a plurality of known-language audio data models for a respective plurality of known languages. For each segment of the unknown input audio data, the similarity module performs comparisons between the segment of unknown input audio data and the plurality of known-language audio data models and generates a respective plurality of probability values representative of the probabilities that the segment includes audio data of the known languages. A processor receives the plurality of probability values for each segment and computes an entropy value for the probabilities for each segment. If the entropy value for a segment is less than the entropy value for a previous segment, the processor terminates the comparisons prior to completing comparisons for all segments of the unknown input audio data.
According to some exemplary embodiments, each segment of unknown audio data comprises an unknown data vector comprising a plurality of data values associated with the segment of unknown input audio data; and each known-language audio data model comprises a known data vector comprising a plurality of data values associate with the known-language audio data model.
According to some exemplary embodiments, the feature extraction module comprises a deep neural network.
According to some exemplary embodiments, the similarity module performs a probabilistic linear discriminant analysis (PLDA) in generating the plurality of probability values.
According to some exemplary embodiments, extents of each segment are defined by a time duration. Alternatively, extents of each segment are defined by a quantity of data in the segment.
According to another aspect, a method for identifying a language in audio data is provided. The method comprises: (i) receiving, at a feature extraction module, an unknown input audio data stream and dividing the unknown input audio data stream into a plurality of segments of unknown input audio data; (ii) receiving, at a similarity module, the plurality of segments of the unknown input audio data and receiving, at the similarity module, a plurality of known-language audio data models for a respective plurality of known languages, for each segment of the unknown input audio data, the similarity module performing comparisons between the segment of unknown input audio data and the plurality of known-language audio data models and generating a respective plurality of probability values representative of the probabilities that the segment includes audio data of the known languages; (iii) computing an entropy value for the probabilities for each segment; and (iv), if the entropy value for a segment is less than the entropy value for a precious segment, terminating the comparisons prior to completing comparisons for all segments of the unknown input audio data.
According to some exemplary embodiments, each segment of unknown audio data comprises an unknown data vector comprising a plurality of data values associated with the segment of unknown input audio data; and each known-language audio data model comprises a known data vector comprising a plurality of data values associate with the known-language audio data model.
According to some exemplary embodiments, the feature extraction module comprises a deep neural network.
According to some exemplary embodiments, the similarity module performs a probabilistic linear discriminant analysis (PLDA) in generating the plurality of probability values.
According to some exemplary embodiments, extents of each segment are defined by a time duration. Alternatively, extents of each segment are defined by a quantity of data in the segment.
The present disclosure is further described in the detailed description which follows, in reference to the noted plurality of drawings by way of non-limiting examples of embodiments of the present disclosure, in which like reference numerals represent similar parts throughout the several views of the drawings.
Language identification (LID) is the labeling of an audio data (recording) with the identity of a language being spoken by a person whose speech is contained within the audio. Current approaches to LID typically include an offline component (training) and a runtime component (recognition). The “fast forward” approach of the current disclosure improves this architecture by speeding up the recognition component.
Current LID systems perform recognition of a new audio recording by applying a feature extraction module, which computes a low-dimensional feature vector from the entire recoding duration. This feature vector is compared, using a similarity function, to a set of known feature vectors, one feature vector per target language. The similarities between the compared unknown feature vector and the known-language feature vectors are output as the probabilities of detection for each of the multiple languages being compared.
In the “fast-forward” approach of the present disclosure, the above system flow is substantially improved upon such that language detections/determinations are reported without using the entire unknown audio recording. According to the present disclosure, a feature vector is computed on each of one or more segments, i.e., “chunks,” of the new unknown audio. That is, according to the approach of the disclosure, a feature vector for the entire audio recording is not generated. The similarity function is applied to these individual segment feature vectors, and probabilities between the segment feature vectors and each the known feature vectors for each language are obtained. For each segment, the entropy of the set of probabilities is computed. If the entropy from one segment to the next decreases, then the certainty of the language with the highest similarity being the correct determination increases to the point that the most likely language will not change with more data. With this conclusion, processing is stopped early, i.e., before the entire unknown audio recording is processed. This approach of the present disclosure reduces the amount of audio data needed at runtime to make a language designation, which directly reduces the processing cost and resources in terms of computation cycles. According to the present disclosure, using incremental processing and entropy-based processing termination, the overall computation load is reduced, and processing speed is increased. The approach of the present disclosure can be implemented as an add-on module that enhances the runtime performance of an existing language identification system.
According to some exemplary embodiments, the approach to language identification is carried out in two stages, namely, a feature extraction stage and a similarity function stage.
In the embodiment illustrated in
The feature extraction process 210, which generates the input audio data segments, is the slowest and most processing-intensive part of the process. According to the approach of the present disclosure, the amount of input audio data, i.e., the number of input audio data segments, required to be processed to arrive at a language identification is substantially reduced, resulting in a more efficient language identification process and system.
It will be understood that either or both of feature extraction module 210 and similarity function module 212 include all of the processing capabilities required to carry out their individual functions and the overall functions of language identification system 200, as described herein in detail. These processing capabilities can be implemented in either or both modules 210, 212, and can include for example, one or more dedicated processors, memories, input/output devices, interconnection devices, and any other required devices or subsystems. Alternatively, these modules 210, 212 and system 200 can be implemented on a general purpose computer executing instructions to implement the technology described herein.
Feature extraction module 210 takes audio 202 as input, and outputs a fixed-dimensional feature vector 214, i.e., X-vector as shown in
X=Extract(Audio) (1)
Similarity Module takes two vectors, i.e., unknown audio X-vector 214 and known model audio M-vector 216 as input and outputs a single numeric value that captures the “closeness” of the vectors. For example, a similarity function can be defined as:
S=Similarity(X,Y) (2)
The intuition to this function is that the larger the value of S, the “closer”, i.e., more similar, X and Y are. Two common geometric similarity functions are the Euclidian distance and Cosine similarity. The distance function is turned into a similarity function by subtracting from 1. The Euclidean distance in two dimensions is another name for the Pythagorean Theorem.
S=Similarity(X,Y) (3)
S=1−Distance(X,Y) (4)
S=1−√{square root over ((x1−y1)2+(x2−y2)2)} (5)
The Cosine similarity captures the angle between two vectors and is a common metric used in high dimensions (greater than 3).
Probability functions, which return a value of 0 to 1, are also an intuitive set of similarity functions. If there is a probability that X and Y are the same, then the higher the probability S=P(X,Y), the “closer” or more similar X and Y are. In some exemplary embodiments, the similarity function 212 is Probabilistic Linear Discriminant Analysis (PLDA). PLDA is a probability-based metric that is a log-likelihood ratio, a comparison of two probabilities:
PLDA and other log-likelihood ratios range from −∞ to ∞, with 0 being the point at which it is completely uncertain whether the unknown audio is the known language. Positive values indicate that it is more likely than not to be the known language, and negative values indicate that it is more likely than not that the unknown audio includes the known language. This fits the requirement for a similarity metric in that larger values of S mean “closer” or “more similar.”
According to some exemplary embodiments, feature extraction module 210 uses the neural network model developed at Johns Hopkins University, which is commonly referred to as an x-vector extractor. This model is well-known in the art, and is described in, for example, D. Snyder, et al., “X Spoken Language Recognition using x-vectors,” in Proc. Odyssey, 2018. The x-vector neural network is an extraction function that satisfies the condition for equation 1 but is internally implemented with a neural network.
As illustrated below in detail, adding additional audio data into feature extraction module 210, without re-initializing the network, gives a better estimate of feature vector 214. It is related to the effect in statistical estimation that more data points give a more accurate measurement estimate.
According to the present disclosure, entropy of the probability scores 204 is computed and analyzed to determine whether processing can be stopped before all of the input audio data 202 is processed. Entropy is a measure of uncertainty and is computed over the set of probability scores. Specifically, if the unknown input audio must be one of N known languages, entropy E can be computed from the probability P(j) that the unknown input audio is language j, as follows:
E=−Σj=1NP(j)·log P(j) (7)
Entropy is mathematically zero when it is entirely certain of one language, for example, if P(French)=1.0. In contrast, entropy is highest when uncertainty is equal across all languages, for example, is P(French)=P(Spanish)=P(Russian)=0.33.
According to some exemplary embodiments, similarity scores are converted to probabilities, and then the probabilities are converted to entropies. To that end, the similarity scores are first transformed to positive values, preserving their relative magnitudes. Then, each is divided by the sum of the scores, which results in N values that sum to 1.
According to the approach of the present disclosure, as noted above, the two main components of a language identification system, i.e., feature extraction 210 and similarity 212 operations, are used in efficiently producing a language identification in audio data of an unknown language. System 200 receives as inputs the unknown audio 202 and a set of target language models 206, e.g., French 206(a), Spanish (206(b), Russian 206(c). System 200 generates as outputs a set of probabilities for each language.
According to the present disclosure, input audio 202 is broken into multiple adjacent chunks or segments 202(a), 202(b), 202(c) of a particular time duration, for example, c=10 sec, each. For each chunk i from 0 to N (the number of chunks), feature extraction 210 is called to compute a feature vector X[i]. Similarity function 212 is called to compute similarity scores S[i] for each target language model M[j] 206, i.e., Similarity (X[i], M[j]), where j ranges from 1 to L, where L is the number of language models 206, which in the illustrated embodiments is three. The similarity scores are normalized to probabilities P[j] for each language j, as noted above. The entropy of the language probabilities is computed as E[i], for each chunk i. If entropy drops from one chunk to the next succeeding chunk, that is, if E[i]<E[i−1], then processing stops, even if all chunks 202(a), 202(b), 202(c) have not been processed. According to exemplary embodiments, the last set of probability scores P[j] for each language j are returned as the final probability scores for each language.
Next, as illustrated in
Next, as illustrated in
Next, as illustrated in
The output of language identification system 200 is the set of probability scores 205. In some exemplary systems, the highest score can be interpreted as being the identified language. Thus, in the case of
Hence, the approach of the invention saves considerable processing cost and time because of its ability to eliminate the processing of large amounts of audio data. In particular, reducing usage of feature extraction module 210 is beneficial, especially since, under operation, that is where the bulk of processing time and cost is expended.
As described above in detail, according to the present disclosure, the language identification processing proceeds in chunks or segments. In some particular exemplary embodiments, the processing of a particular chunk builds on the information identified from the previous chunk or chunks. In these embodiments, a layer of the network keeps a running tally of statistics. To obtain the result, the state from previous chunks is maintained, so subsequent chunks incorporate the information from one or more previous chunks. The effect is the same, i.e., when the system receives data for a particular chunk, it only processes the audio data of that chunk.
Whereas many alterations and modifications of the disclosure will become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that the particular embodiments shown and described by way of illustration are in no way intended to be considered limiting. Further, the subject matter has been described with reference to particular embodiments, but variations within the spirit and scope of the disclosure will occur to those skilled in the art. It is noted that the foregoing examples have been provided merely for the purpose of explanation and are in no way to be construed as limiting of the present disclosure.
While the present inventive concept has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present inventive concept as defined by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6542869 | Foote | Apr 2003 | B1 |
7337115 | Liu et al. | Feb 2008 | B2 |
7437284 | Margulies | Oct 2008 | B1 |
7801910 | Houh et al. | Sep 2010 | B2 |
10388272 | Thomson | Aug 2019 | B1 |
10402500 | Chochowski | Sep 2019 | B2 |
10573312 | Thomson | Feb 2020 | B1 |
11176934 | Venkatesh Raman | Nov 2021 | B1 |
20010012998 | Jouet | Aug 2001 | A1 |
20040083104 | Liu et al. | Apr 2004 | A1 |
20060206310 | Ravikumar | Sep 2006 | A1 |
20070112837 | Houh et al. | May 2007 | A1 |
20100125448 | Goswami | May 2010 | A1 |
20100191530 | Nakano | Jul 2010 | A1 |
20120017146 | Travieso | Jan 2012 | A1 |
20120323573 | Yoon | Dec 2012 | A1 |
20130311190 | Reiner | Nov 2013 | A1 |
20150194147 | Yoon | Jul 2015 | A1 |
20150228279 | Biadsy | Aug 2015 | A1 |
20160042739 | Cumani et al. | Feb 2016 | A1 |
20160240188 | Seto | Aug 2016 | A1 |
20160267904 | Biadsy | Sep 2016 | A1 |
20170011735 | Kim | Jan 2017 | A1 |
20170061002 | Roblek et al. | Mar 2017 | A1 |
20170092266 | Wasserblat | Mar 2017 | A1 |
20170294192 | Bradley et al. | Oct 2017 | A1 |
20170365251 | Park | Dec 2017 | A1 |
20180012594 | Behzadi | Jan 2018 | A1 |
20180053502 | Biadsy | Feb 2018 | A1 |
20180061412 | Cho | Mar 2018 | A1 |
20180068653 | Trawick | Mar 2018 | A1 |
20180174600 | Chaudhuri et al. | Jun 2018 | A1 |
20180342239 | Baughman | Nov 2018 | A1 |
20180357998 | Georges | Dec 2018 | A1 |
20180374476 | Lee | Dec 2018 | A1 |
20190108257 | Lefebure | Apr 2019 | A1 |
20190138539 | Moreno Mengibar | May 2019 | A1 |
20190304470 | Ghaemmaghami et al. | Oct 2019 | A1 |
20190371318 | Shukla | Dec 2019 | A1 |
20190385589 | Muramatsu | Dec 2019 | A1 |
20200021949 | Edge et al. | Jan 2020 | A1 |
20200027444 | Prabhavalkar | Jan 2020 | A1 |
20200035739 | Saito et al. | Jan 2020 | A1 |
20200074992 | Xiong | Mar 2020 | A1 |
20200111476 | Kamano | Apr 2020 | A1 |
20200175961 | Thomson | Jun 2020 | A1 |
20200219492 | Apsingekar | Jul 2020 | A1 |
20200243094 | Thomson | Jul 2020 | A1 |
20200293875 | Zhang et al. | Sep 2020 | A1 |
20200357391 | Ghoshal | Nov 2020 | A1 |
20200380215 | Kannan | Dec 2020 | A1 |
20200387677 | Kim | Dec 2020 | A1 |
20210232776 | Kim | Jul 2021 | A1 |
Number | Date | Country |
---|---|---|
2016189307 | Dec 2016 | WO |
Entry |
---|
Snyder et al., “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” Center for Language and Speech Processing & Human Language Technology Center of Excellence, The John Hopkins University, Baltimore, Maryland, USA. |
Snyder et al., “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018. |
Dehak et al., “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, No. 4, pp. 788-798, 2011. |
Kinnunen, et al., “Real-time speaker identification and verification.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, No. 1, 277-288, 2005. |
Kinnunen et al., “A speaker pruning algorithm for real-time speaker identification,” in International Conference on Audio-and Video-Based Biometric Person Authentication. Springer, 639-646, 2003. |
Sarkar et al., “Fast Approach to Speaker Identification for Large Population using MLLR and Sufficient Statistics,” in 2010 National Conference on Communications (NCC) IEEE, 1-5, 2010. |
Schmidt et al., “Large-scale Speaker Identification,” in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 1650-1654, 2014. |
Zhu et al., “Self-attentive Speaker Embeddings for Text-Independent Speaker Verification,” INTERSPEECH, 2018. |
David Snyder, “SRE16 Xvector Model,” http://kaldi-asr.org/models/m3, 2017, Accessed: Oct. 10, 2018. |
He et al., “Streaming End-to-End Speech Recognition for Mobile Devices,” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 12-17, 2019. |
Chen et al., “Query-by-Example Keyword Spotting Using Long Short-Term Memory Networks,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 19-24, 2015. |
Zhang et al., “Unsupervised Spoken Keyword Spotting Via Segmental DTW on Gaussian Posteriorgrams,” Proceedings of the Automatic Speech Recognition & Understanding (ASRU) Workshop, IEEE, 2009, 398-403. |
Miller et al., “Rapid and Accurate Spoken Term Detection,” Proceedings of Interspeech, ISCA, 2007, pp. 314-317. |
International Search Report and Written Opinion for International Application No. PCT/US2020/066298 dated Mar. 26, 2021. |
Number | Date | Country | |
---|---|---|---|
20220013107 A1 | Jan 2022 | US |