This application is related to U.S. application Ser. No. 13/800,764, entitled “Data Shredding for Speech Recognition Acoustic Model Training under Data Retention Restrictions,” filed on Mar. 13, 2013. The entire teachings of the above application are incorporated herein by reference.
A speech recognition system typically collects automatic speech recognition (ASR) statistics to train the speech recognition system. The ASR statistics can be used to train language models and acoustic models, which may be employed by the speech recognition system. In general, language models relate to the probability of particular word sequences. Acoustic models relate to sounds in a language.
A method or system for enabling training of a language model according to an example embodiment of the present invention includes producing segments of text in a text corpus and counts corresponding to the segments of text, the corpus being in a depersonalized state. The method further includes enabling a system to train a language model using the segments of text in the depersonalized state and the counts.
The text corpus may be one ore more messages, e.g., voice mail messages, or transcripts of interview recordings. The segments of text can be n-tuples (or n-grams) and may be non-overlapping segments of text. In some embodiments, the method may further include maintaining a store of the segments of text and the counts. Maintaining the store can include removing all segments of text whose corresponding counts are less than N, and maintaining only the remaining segments of text and the counts.
The method may further include depersonalizing the corpus to change it from a personalized state to the depersonalized state. In an embodiment, depersonalizing the corpus includes replacing personally identifiable information in the corpus with class labels, wherein the personally identifiable information being replaced is personally identifiable information whose type can be identified by the class labels. For example, the personally identifiable information being replaced can include at least one of the following: a phone number, credit card number, name of a person, name of a business, or location. The method may include maintaining a list, not linked to the corpus, of the class labels and counts corresponding to the class labels.
In an embodiment, the method may further include filtering the segments of text by removing from the segments of text those segments that contain personally identifiable information.
The method may further include labeling the corpus or the text segments and counts with metadata. The metadata may include at least one of the following: time of day of the message, area code of the sender, area code of the recipient, call duration, device type, or message type (e.g., automated customer service message).
In an embodiment, the method includes replacing one or more words of the corpus with corresponding one or more word indices, wherein each word index is generated through use of a random hash. A map to the random hashes may be kept secure.
In one embodiment, a system for enabling training of a language model includes a segmentation module configured to produce segments of text in a text corpus and counts corresponding to the segments of text, the text corpus being in a depersonalized state. The system further includes an enabling module configured to enable a system to train a language model using the segments of text in the depersonalized state and their counts.
Embodiments of the present invention have many advantages. Dynamically shredding the text and/or speech corpus, as described herein, results in a list of text segments, e.g., n-grams, and their associated depersonalized audio features (DAFs). The text segments and DAFs cannot be traced back to the original messages, since the original messages (text and audio) themselves are not retained, i.e., they are deleted. Furthermore, embodiments can prevent re-construction of the original messages, since all the text segments and corresponding DAFs (e.g., the shreds) can be randomized and aggregated across a large number of messages. In addition, embodiments allow for all other data from the original message (such as time of conversion, calling identifiers, etc.) to be deleted. What remains is a large collection of text segments (e.g., n-grams or n-tuples), with associated audio features, representing an aggregation of what has been said to the system. The collection of text segments (e.g., n-grams or n-tuples) and audio features can be maintained in a generic, impersonal form that is useful for training a speech recognition system to recognize future utterances. In certain embodiments, the resulting ASR statistics may contain no Personally Identifiable Information (PII).
The collection of ASR statistics is useful for (re-)training a speech recognition system that employs Language Models (LMs) and/or Acoustic Models (AMs). For example, when the original data cannot be retained, the ASR statistics can be used to retrain the ASR models (LM and AM). Benefits of using ASR statistic to (re-)train a speech recognition system include better accuracy of conversions, an ability to keep up to date with user trends in speech and usage, an ability to customize the speech recognition to the needs of the specific users, and a reduction of the volume of unconvertible messages.
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
A description of example embodiments of the invention follows.
Training of speech recognition systems typically requires in-domain training data, but the in-domain data often contains personally identifiable information (PII) that cannot be retained due to data retention restrictions.
In general, the audio data 102 is captured or generated by a user of the speech recognition system 100, and may be considered an input to the speech recognition system. The metadata 110 relates to the audio data 102 and may be generated or used as part of the processing of the audio data 102 and may be provided to the speech recognition system 100. The metadata is usually delivered in addition to the audio itself. For example, the carrier will send the voice mail recording and at the same time (e.g., in an XML format) the number of the caller. This additional descriptive data, i.e., data about the actual audio data, is commonly referred to as metadata. In a dictation application, metadata can, for example, include the time of the dictation and the name of the user dictation. In the police interview case, metadata can, for example, include the participant(s) and the police case number and the like. The metadata 110 may be used in an embodiment to label the text corpus, segments of text and/or counts of the text segments with the metadata. The transcript data 120 typically relates to the output of the speech recognition system 100, for example, the presentation of the converted text to the user. In some cases, the transcript data 120 can include corrections of the automatic speech recognition output by a human operator/user or entirely manually created transcription. As shown in
As shown in
As shown in
Embodiments of the invention split the speech and/or text corpus up into smaller bits or shreds that are still usable for training while not containing personally identifiable information (PII). This process can be a compromise, because the smaller the bits or shreds, the less useful they are for training but the lower the risk of accidentally retaining any PII. By depersonalizing and/or filtering the data, e.g., the audio features and text segments, the systems and methods described herein can keep larger shreds while still removing PII.
Labeling the corpus and/or the text segments and counts with metadata is useful so that one can still train specific ASR models, or sub-models, after shredding. For example, if one wanted to train a model for the weekend, the metadata can be used to select, from the collection of shreds or segments, the shreds or segments from messages received during weekends.
Acoustic feature extraction results in a compression of the audio data. For example the audio data may be processed in frames where each frame includes a certain number of audio samples. In one example, the audio data is processed at 80 audio samples per frame. The acoustic feature extraction may result in 13 depersonalized audio features per frame. Since the original audio data are not retained, feature extraction results in a compression of audio data.
Acoustic feature extraction is a partially reversible process. The original audio cannot be re-created, but some form of audio can be created. The audio that can be created contains the same words as originally spoken, but, for example, without speaker-specific intonation.
In some embodiments, the audio features 210 are extracted from the speech corpus 302 and depersonalized. Depersonalization of the audio features may include applying cepstral mean subtraction (CMS), cepstral variance normalization, Gaussianisation, or vocal tract length normalization (VTLN) to the audio features. CMS is useful in removing an offset. For example, CMS can be used to remove a voice from a communication channel. VTLN is useful to normalize voices or voice data. It has been observed that female speakers typically have a shorter vocal tract than male speakers. VTLN can be used to normalize the voice data based on that observation.
Depersonalizing the audio features can include using a neural network to depersonalize the audio features. For example, a neural network system based on trainable features may be used, where the features which are trained to produce the posterior probability of the current frame of input (or a frame with a fixed offset to the current frame) correspond to one or more of a set of linguistic units including word and sub-word units, such as phone units, context-dependent phone units, grapheme units and the like. The depersonalized features can be a fixed linear or non-linear transform of the trainable features. The depersonalized features may be produced via an intermediate “bottleneck” layer created to produce features in trainable structures, such as multi-layer perceptrons, deep neural networks and deep belief networks. Furthermore, depersonalizing the audio features can include applying one or more (e.g., a set of) speaker-specific transforms to the audio features to remove speaker information. The types of speaker-specific transforms that may be used can include linear transforms, such as constrained maximum likelihood linear regression and variants, and speaker-specific non-linear transforms. An advantage of applying speaker-specific transforms is that the system can train for each speaker in the set (using any transform). The system can train the speaker characteristics in order to remove them to thereby depersonalize the audio features.
The collection or store 314 of strips or shreds 308 can be mixed up in a randomization of mixing operation 316. Each text segment 312 and the corresponding depersonalized audio feature 301 can be stored in a store and maintained for use by the system. For example, the text segments 312 and audio features 310 can be used to enable training of an acoustic model. The fact that the shreds 308 are in randomized order does not affect the training, because acoustic models for speech recognition relate to individual sounds. Maintaining the store can include storing each segment 312 together with its corresponding depersonalized audio feature 310, the text segments and corresponding depersonalized audio features being randomized, as shown at 318 in
In some embodiments, the method or system of enabling training of an acoustic model may further include filtering the depersonalized audio features, for example, by removing the depersonalized audio features that are longer than a certain length. Filtering the depersonalized audio features can include examining the content of the text segments and removing the personalized audio features based on the content of the corresponding text segments. In some embodiments, removing the depersonalized audio features includes removing the depersonalized audio features whose corresponding text segments contain a phone number and at least two more words.
It should be noted that the shredding process as described herein is a one-way only process. The original message, or message corpus, cannot be reconstructed from the shreds.
The number of occurrences of audio features for each text segment is an indication of how many times a particular text segment was spoken in a particular speech corpus. As described in reference to
As shown in
Optionally, filtering the DAFs can be combined with content identification. For example, filtering the DAFs can include examining content of the text segments and removing DAFs based on the content of the corresponding text segments. In the example shown in
The segments of text can be n-tuples (or n-grams) and may be non-overlapping segments of text. In some embodiments, the system 700 may be configured to maintain a store of the segments of text and the counts. As shown in
As shown in
The system 700 may be configured to maintain a list of the class labels and counts corresponding to the class labels, the list not being linked to the corpus. For example, the system may retain one or more general class membership frequency lists, which are stored separately and without link or reference to any individual message or document. The system 700 can maintain the counts of what has been replaced per class label. In the above example, maintaining the list would result in count(MaleFirstName,Uwe)+=1 and count(FemaleFirstName, Jill)+=1. But the system is not maintaining any link of where in the depersonalized corpus these instances came from. The system, however, keeps track of how common “Uwe” is as a MaleFirstName. In an embodiment, the depersonalization module 718 is configured to maintain the list of class labels.
The system 700 can further include a filtering module 710 configured to filter the segments of text by removing from the segments of text those segments that contain personally identifiable information. The system 700 may further include a labeling module 714 configured to label the text segments and the counts with metadata. The metadata can be leveraged to accumulate statistics per metadata value/cluster. For example, the system may track Count(Year=2012,WordTuple), where WordTuple denotes the text segment(s). The metadata may include at least one of the following: time of day of the message, area code of the sender, area code of the recipient, or call duration.
In an embodiment, the system 700 includes an indexing module 716 configured to replace one or more words of the corpus with corresponding one or more word indices, wherein each word index is generated by a random hash. Furthermore, the system, e.g., indexing module 716, may be configured to keep a map to the random hashes secure.
A system in accordance with the invention has been described which enables a system, e.g., a speech recognition system, to train a language model and/or an acoustic model. Components of such a system, for example a shredding module, segmentation module, enabling module and other systems discussed herein may, for example, be a portion of program code, operating on a computer processor.
Portions of the above-described embodiments of the present invention can be implemented using one or more computer systems, for example, to permit generation of ASR statistics for training of a language and/or an acoustic model. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be stored on any form of non-transient computer-readable medium and loaded and executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, desktop computer, laptop computer, or tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.
Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, at least a portion of the invention may be embodied as a computer readable medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the invention discussed above. The computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above.
In this respect, it should be appreciated that one implementation of the above-described embodiments comprises at least one computer-readable medium encoded with a computer program (e.g., a plurality of instructions), which, when executed on a processor, performs some or all of the above-described functions of these embodiments. As used herein, the term “computer-readable medium” encompasses only a non-transient computer-readable medium that can be considered to be a machine or a manufacture (i.e., article of manufacture). A computer-readable medium may be, for example, a tangible medium on which computer-readable information may be encoded or stored, a storage medium on which computer-readable information may be encoded or stored, and/or a non-transitory medium on which computer-readable information may be encoded or stored. Other non-exhaustive examples of computer-readable media include a computer memory (e.g., a ROM, RAM, flash memory, or other type of computer memory), magnetic disc or tape, optical disc, and/or other types of computer-readable media that can be considered to be a machine or a manufacture.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present invention as discussed above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present invention.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims. It should also be appreciated that the various technical features of the embodiments that have been described may be combined in various ways to produce numerous additional embodiments.
Number | Name | Date | Kind |
---|---|---|---|
6141753 | Zhao et al. | Oct 2000 | A |
6404872 | Goldberg et al. | Jun 2002 | B1 |
6600814 | Carter et al. | Jul 2003 | B1 |
6874085 | Koo et al. | Mar 2005 | B1 |
7512583 | Benson et al. | Mar 2009 | B2 |
7526455 | Benson et al. | Apr 2009 | B2 |
8185392 | Strope et al. | May 2012 | B1 |
8229742 | Zimmerman et al. | Jul 2012 | B2 |
8401859 | Dhawan et al. | Mar 2013 | B2 |
8423476 | Bishop et al. | Apr 2013 | B2 |
8433658 | Bishop et al. | Apr 2013 | B2 |
8473451 | Hakkani-Tur | Jun 2013 | B1 |
8489513 | Bishop et al. | Jul 2013 | B2 |
8515745 | Garrett et al. | Aug 2013 | B1 |
8515895 | Benson et al. | Aug 2013 | B2 |
8561185 | Muthusrinivasan | Oct 2013 | B1 |
8700396 | Mengibar et al. | Apr 2014 | B1 |
9131369 | Ganong, III et al. | Sep 2015 | B2 |
20020023213 | Walker et al. | Feb 2002 | A1 |
20030037250 | Walker et al. | Feb 2003 | A1 |
20030172127 | Northrup et al. | Sep 2003 | A1 |
20050065950 | Chaganti et al. | Mar 2005 | A1 |
20060085347 | Yiachos | Apr 2006 | A1 |
20060136259 | Weiner et al. | Jun 2006 | A1 |
20060190263 | Finke et al. | Aug 2006 | A1 |
20070118399 | Avinash et al. | May 2007 | A1 |
20070282592 | Huang et al. | Dec 2007 | A1 |
20080086305 | Lewis et al. | Apr 2008 | A1 |
20080147412 | Shaw et al. | Jun 2008 | A1 |
20080209222 | Narayanaswami et al. | Aug 2008 | A1 |
20080294435 | Reynolds et al. | Nov 2008 | A1 |
20090132803 | Leonard et al. | May 2009 | A1 |
20100071041 | Ikegami | Mar 2010 | A1 |
20100242102 | Cross et al. | Sep 2010 | A1 |
20100255953 | McCullough et al. | Oct 2010 | A1 |
20100281254 | Carro | Nov 2010 | A1 |
20110022835 | Schibuk | Jan 2011 | A1 |
20110054899 | Phillips et al. | Mar 2011 | A1 |
20110131138 | Tsuchiya | Jun 2011 | A1 |
20110197159 | Chaganti et al. | Aug 2011 | A1 |
20120010887 | Boregowda et al. | Jan 2012 | A1 |
20120011358 | Masone | Jan 2012 | A1 |
20120059653 | Adams et al. | Mar 2012 | A1 |
20120079581 | Patterson | Mar 2012 | A1 |
20120095923 | Herlitz | Apr 2012 | A1 |
20120101817 | Mocenigo et al. | Apr 2012 | A1 |
20120166186 | Acero et al. | Jun 2012 | A1 |
20120201362 | Crossan et al. | Aug 2012 | A1 |
20120278061 | Weinstein et al. | Nov 2012 | A1 |
20130073672 | Ayed | Mar 2013 | A1 |
20130104251 | Moore et al. | Apr 2013 | A1 |
20130243186 | Poston, Jr. et al. | Sep 2013 | A1 |
20130262873 | Read et al. | Oct 2013 | A1 |
20130263282 | Yamada et al. | Oct 2013 | A1 |
20130346066 | Deoras et al. | Dec 2013 | A1 |
20140058723 | Shen et al. | Feb 2014 | A1 |
20140067738 | Kingsbury | Mar 2014 | A1 |
20140143533 | Ganong, III et al. | May 2014 | A1 |
20140143550 | Ganong, III et al. | May 2014 | A1 |
20140163954 | Joshi et al. | Jun 2014 | A1 |
20140207442 | Ganong, III et al. | Jul 2014 | A1 |
20140278366 | Jacob et al. | Sep 2014 | A1 |
20140278426 | Jost et al. | Sep 2014 | A1 |
Entry |
---|
U.S. Appl. No. 13/800,764, “Data Shredding for Speech Recognition Acoustic Model Training Under Data Retention Restrictions,” filed Mar. 13, 2013. |
Calpe, J., et al., “Toll-quality digital secraphone,” IEEE conference, 8th Mediterranean vol. 3:1714-1717 (1996). |
De Andrade, J. et al., “Speech privacy for modern mobile communication systems,” IEEE ICASSP 2008 conference Las Vegas, NV, vol. 1: 1777-1780 (2008). |
Fazeen, M. et al., Context-Aware Multimedia Encryption in Mobile Platforms, 9th Annual Cyber and Information Security Research Conference, CISR '14:53-56 (2014). |
Office Action dated Apr. 2, 2015 for U.S. Appl. No. 13/800,765 entitled “Data Shredding for Speech Recognition Acoustic Model Training Under Data Retention Restrictions”. |
Servetti, A. et al., “Perception-based partial encryption of compressed speech,” IEEE Transactions on Speech and Audio Processing, 10(8):637-643 (2002). |
Office Action dated Sep. 2, 2015 for U.S. Appl. No. 13/800,764. |
Office Action dated Dec. 10, 2015 for U.S. Appl. No. 13/800,764. |
Chaudhari et al., “Privacy Protection for Life-log Video,” Signal Processing Applications for Public Security and Forensics, 2007, Published Apr. 11-13, 2007. |
Final Office Action for U.S. Appl. No. 13/800,764 dated May 6, 2016. |
Notice of Allowance for U.S. Appl. No. 13/800,764, dated Aug. 26, 2016. |
Number | Date | Country | |
---|---|---|---|
20140278425 A1 | Sep 2014 | US |