This invention relates generally to document sanitization and, more specifically, to a system and method for automatically sanitizing text in heterogeneous documents.
It is very useful for a business to be able to automate aspects of their customer service, thereby increasing efficiency and reducing required resources. Many such businesses employ Automatic Speech Recognition (ASR) systems and various user interface systems. In order to train the neural networks used to automate such systems, a lot of data from user interactions is needed. Unfortunately, the data used to train such systems often contains sensitive information, and it is difficult to determine which of the data is sensitive and which of the data is non-sensitive in nature. Known document sanitization techniques rely on the detection of specific information from concrete domains (e.g., credit card and social security numbers from particular databases). In other words, such techniques require customization for each domain and lack generality and scalability. Therefore, there is a need for a document sanitization method that performs automatically with respect to heterogenous documents.
Any such method would require a careful balance between how much data to redact and how much data to maintain. If a system over-redacts data, then the remaining data is less useful for training purposes. If the system does not redact enough, then it may leave sensitive information in the documents. Therefore, there is also a need to optimize redaction while removing any sensitive information, such that the data is still useful for training.
The present disclosure describes a system, method, and computer program for text sanitization. The method is performed by a computer system that includes servers, storage systems, networks, operating systems, and databases.
The present invention provides a system and method for automatically sanitizing text in heterogeneous documents. It does this through three main sanitization steps with respect to a corpus of document vectors (i.e., vector representations of documents based on the tokens in the documents). First, the system filters tokens in a new document against a privacy threshold, where the tokens having a frequency in the corpus of document vectors below a threshold are flagged as unsafe. This is based on the principle that very infrequent words are likely to be sensitive information, but frequent words are likely to be common information. Second, the system performs a k-anonymity sanitization process such that the new document vector becomes indistinguishable from at least k other document vectors in the corpus of document vectors. While k-anonymity is a known technique, it is not well-understood, routine, or conventional in the field to combine it with privacy threshold filtering. Furthermore, it is challenging to select the optimal k-value such that the system redacts the minimum number of tokens. While a linear programming approach may be helpful in finding an optimal k vector, it is computationally very expensive. The present invention provides a heuristic approach to finding the near-optimal k vector, which is computationally much more efficient and, therefore, faster. Third, the system replaces or redacts the tokens in the document flagged as unsafe using a machine-learning language model to predict a vector representation.
In one embodiment, a method for text sanitization comprises the following steps:
The present disclosure describes a system, method, and computer program for text sanitization. The method is performed by a computer system that includes servers, storage systems, networks, operating systems, and databases (“the system”).
Example implementations of the methods are described in more detail with respect to
1. Method for Privacy-Preserving Text Sanitization
1.1 Corpus Building
As seen in
In certain embodiments, for each of the vector representations, the system creates an equivalent vector of cryptographically secure hashes (e.g., the equivalent vector is comprised of the SHA256 hash of each of the words in the original vector using a PBKDF2 approach) (step 220) and builds a corpus of document vectors comprising the equivalent vectors (step 225), where a corpus is a database of vectors. This allows for the original vector to be maintained without storing any sensitive information.
1.2 Text Preprocessing
The system receives a new document for text sanitization (step 230). As seen in
1.3 Privacy Threshold Filtering
The privacy threshold filtering step of
1.3 k-Anonymity Sanitization
The k-anonymity sanitization step of
1.4 Document Sanitization
The document sanitization step of
1.5 Updating Corpus
The updating corpus step of
1.6 Graphical Example
2. Method for Heuristic Approach to k-Anonymity Sanitization of a Document
The system begins processing each token in the sorted list of tokens (in order) (step 420). The system identifies all other documents in the set S that include the token (step 425). The system determines if the number of documents is greater than k (step 430). If the system determines that the number of documents is not greater than k, the system flags the token as unsafe and proceeds to step 450 (step 435). If the system determines that the number of documents is greater than k, the system flags the token as safe (step 440).
If the token being processed is flagged as safe, the system updates set S by removing all documents not having the token from S (step 445). If, however, the token is flagged as unsafe, the system leaves set S unchanged. The system determines whether there is another token for processing (step 450). If the system determines that there is another token for processing, the system returns to step 420 to begin processing the next token (step 455). If the system determines there is no other token for processing, the k-anonymity sanitization is complete (step 460). When the words flagged as unsafe are redacted, the document will be indistinguishable (with respect to token content) from at least k other documents in set S (step 460).
In certain embodiments, where a BOW vector is used (i.e., a bag of words vector where “1” means the token is present in the document, and “0” means the token is not present in the document), the bit value in the vector corresponding to an unsafe token is changed from “1” to “0,” and then the unsafe token is later redacted from the document.
3. Method for Document Sanitization
4. Example System Architecture
As illustrated in
5. General
The methods described with respect to
As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosure is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5649060 | Ellozy et al. | Jul 1997 | A |
7003462 | Shambaugh et al. | Feb 2006 | B2 |
7502741 | Finke et al. | Mar 2009 | B2 |
7633551 | Sullivan | Dec 2009 | B2 |
8068588 | Ramanathan et al. | Nov 2011 | B2 |
8086458 | Finke et al. | Dec 2011 | B2 |
8131545 | Moreno et al. | Mar 2012 | B1 |
8230343 | Logan et al. | Jul 2012 | B2 |
8289366 | Greenwood et al. | Oct 2012 | B2 |
8620663 | Kondo et al. | Dec 2013 | B2 |
8665863 | Silverman | Mar 2014 | B2 |
9313332 | Kumar et al. | Apr 2016 | B1 |
10375237 | Williams et al. | Aug 2019 | B1 |
10528866 | Dai et al. | Jan 2020 | B1 |
10554817 | Sullivan et al. | Feb 2020 | B1 |
10572534 | Readler | Feb 2020 | B2 |
11055055 | Fieldman | Jul 2021 | B1 |
11106442 | Hsiao et al. | Aug 2021 | B1 |
11138970 | Han et al. | Oct 2021 | B1 |
11238278 | Swanson et al. | Feb 2022 | B1 |
11361084 | So | Jun 2022 | B1 |
11487944 | Yang et al. | Nov 2022 | B1 |
11521639 | Shon et al. | Dec 2022 | B1 |
11663407 | Fusco | May 2023 | B2 |
11687730 | Yang et al. | Jun 2023 | B1 |
11748410 | Kiefer | Sep 2023 | B2 |
11763803 | Griffiths et al. | Sep 2023 | B1 |
20020116361 | Sullivan | Aug 2002 | A1 |
20030084300 | Koike | May 2003 | A1 |
20030154072 | Young et al. | Aug 2003 | A1 |
20040002970 | Hur | Jan 2004 | A1 |
20050117879 | Sullivan | Jun 2005 | A1 |
20050151880 | Sullivan | Jul 2005 | A1 |
20050222036 | During et al. | Oct 2005 | A1 |
20070011012 | Yurick et al. | Jan 2007 | A1 |
20070206881 | Ashikaga | Sep 2007 | A1 |
20080092168 | Logan et al. | Apr 2008 | A1 |
20080107255 | Geva et al. | May 2008 | A1 |
20100280828 | Fein et al. | Nov 2010 | A1 |
20110238408 | Larcheveque et al. | Sep 2011 | A1 |
20120140918 | Sherry | Jun 2012 | A1 |
20120278071 | Garland et al. | Nov 2012 | A1 |
20120331553 | Aziz et al. | Dec 2012 | A1 |
20130039483 | Wolfeld et al. | Feb 2013 | A1 |
20130071837 | Winters-Hilt et al. | Mar 2013 | A1 |
20130124984 | Kuspa | May 2013 | A1 |
20130129071 | Teitelman et al. | May 2013 | A1 |
20130144674 | Kim et al. | Jun 2013 | A1 |
20130191185 | Galvin | Jul 2013 | A1 |
20140136443 | Kinsey, II et al. | May 2014 | A1 |
20140140497 | Ripa et al. | May 2014 | A1 |
20140168354 | Clavel et al. | Jun 2014 | A1 |
20140229866 | Gottlieb | Aug 2014 | A1 |
20140241519 | Watson et al. | Aug 2014 | A1 |
20140258872 | Spracklen et al. | Sep 2014 | A1 |
20150106091 | Wetjen et al. | Apr 2015 | A1 |
20150195220 | Hawker et al. | Jul 2015 | A1 |
20150235655 | Dimitriadis et al. | Aug 2015 | A1 |
20150278225 | Weiss et al. | Oct 2015 | A1 |
20150281436 | Kumar et al. | Oct 2015 | A1 |
20150281445 | Kumar et al. | Oct 2015 | A1 |
20150286627 | Chang et al. | Oct 2015 | A1 |
20150341322 | Levi et al. | Nov 2015 | A1 |
20160078339 | Li et al. | Mar 2016 | A1 |
20160088153 | Wicaksono et al. | Mar 2016 | A1 |
20160117339 | Raskin et al. | Apr 2016 | A1 |
20160173693 | Spievak et al. | Jun 2016 | A1 |
20160352907 | Raanani et al. | Dec 2016 | A1 |
20160358321 | Xu et al. | Dec 2016 | A1 |
20170062010 | Pappu et al. | Mar 2017 | A1 |
20170187880 | Raanani et al. | Jun 2017 | A1 |
20170300990 | Tanaka et al. | Oct 2017 | A1 |
20180007204 | Klein et al. | Jan 2018 | A1 |
20180007205 | Klein et al. | Jan 2018 | A1 |
20180013699 | Sapoznik et al. | Jan 2018 | A1 |
20180096271 | Raanani et al. | Apr 2018 | A1 |
20180124243 | Zimmerman | May 2018 | A1 |
20180130484 | Dimino, Jr. et al. | May 2018 | A1 |
20180165554 | Zhang et al. | Jun 2018 | A1 |
20180165723 | Wright et al. | Jun 2018 | A1 |
20180204111 | Zadeh et al. | Jul 2018 | A1 |
20180301143 | Shastry et al. | Oct 2018 | A1 |
20190065515 | Raskin et al. | Feb 2019 | A1 |
20190103095 | Singaraju et al. | Apr 2019 | A1 |
20190132273 | Ryan | May 2019 | A1 |
20190188590 | Wu et al. | Jun 2019 | A1 |
20190189117 | Kumar | Jun 2019 | A1 |
20190190890 | Druker et al. | Jun 2019 | A1 |
20190205748 | Fukuda et al. | Jul 2019 | A1 |
20190236204 | Canim et al. | Aug 2019 | A1 |
20190251165 | Bachrach et al. | Aug 2019 | A1 |
20190278942 | Baudart et al. | Sep 2019 | A1 |
20190287114 | Li | Sep 2019 | A1 |
20200097820 | Song et al. | Mar 2020 | A1 |
20200098370 | Arar et al. | Mar 2020 | A1 |
20200099790 | Ma et al. | Mar 2020 | A1 |
20200153969 | Dougherty et al. | May 2020 | A1 |
20200184207 | Breslav | Jun 2020 | A1 |
20200184278 | Zadeh et al. | Jun 2020 | A1 |
20200218780 | Mei et al. | Jul 2020 | A1 |
20200227026 | Rajagopal et al. | Jul 2020 | A1 |
20200242444 | Zhang et al. | Jul 2020 | A1 |
20200265273 | Wei et al. | Aug 2020 | A1 |
20200279567 | Adlersberg et al. | Sep 2020 | A1 |
20200344194 | Hosseinisianaki et al. | Oct 2020 | A1 |
20210074260 | Gopala et al. | Mar 2021 | A1 |
20210081613 | Begun et al. | Mar 2021 | A1 |
20210081615 | McRitchie et al. | Mar 2021 | A1 |
20210089624 | Bealby-Wright et al. | Mar 2021 | A1 |
20210103720 | Kim et al. | Apr 2021 | A1 |
20210141861 | Kalluri | May 2021 | A1 |
20210157834 | Sivasubramanian et al. | May 2021 | A1 |
20210233520 | Sar Shalom et al. | Jul 2021 | A1 |
20210233535 | Shir | Jul 2021 | A1 |
20210256417 | Kneller et al. | Aug 2021 | A1 |
20210295822 | Tomkins et al. | Sep 2021 | A1 |
20210304019 | Anderson et al. | Sep 2021 | A1 |
20210304075 | Duong et al. | Sep 2021 | A1 |
20210304747 | Haas et al. | Sep 2021 | A1 |
20210304769 | Ye et al. | Sep 2021 | A1 |
20210319481 | Maheswaran et al. | Oct 2021 | A1 |
20220093101 | Krishnan et al. | Mar 2022 | A1 |
20220094789 | Lau et al. | Mar 2022 | A1 |
20220197306 | Cella et al. | Jun 2022 | A1 |
20220198229 | Onate Lopez et al. | Jun 2022 | A1 |
20220300885 | Yannam et al. | Sep 2022 | A1 |
20220319514 | Hosomi | Oct 2022 | A1 |
20220383867 | Faulkner et al. | Dec 2022 | A1 |
20220391233 | Decrop et al. | Dec 2022 | A1 |
20220394348 | Hatambeiki et al. | Dec 2022 | A1 |
20220398598 | Das et al. | Dec 2022 | A1 |
Number | Date | Country |
---|---|---|
106569998 | Apr 2017 | CN |
112001484 | Nov 2020 | CN |
116629804 | Jan 2024 | CN |
2019076866 | Apr 2019 | WO |
Entry |
---|
Alvarez-Melis, David et al., “Towards Robust Interpretability with Self-Explaining Neural Networks”, 32nd Conference on Neural Information Processing Systems, 2018. |
Alvarez-Melis, David et al., “Gromov-Wasserstein Alignment of Word Embedding Spaces”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1881-1890, Nov. 2018. |
Bahdanau, Dzmitry et al., “Neural Machine Translation by Jointly Learning to Align and Translate”, International Conference on Learning Representations, 2015. |
Bastings, Jasmijn et al., “Interpretable Neural Predictions with Differentiable Binary Variables”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2963-2977, Aug. 2019. |
Bellare et al., “Learning Extractors from Unlabeled Text using Relevant Databases”, In Sixth International Workshop on Information Integration on the Web, 2007. |
Beryozkin et al., “A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy”, In Proceedings of the Association for Computational Linguistics (ACL), 2019. |
Bickel, Steffen et al., “Multi-View Clustering”, In Proceedings of the IEEE International Conference on Data Mining (ICDM) 2004. |
Blitzer et al., “Domain Adaptation with Structural Correspondence Learning”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 2006. |
Bojanowski, Piotr, et al., “Enriching Word Vectors with Subword Information”, Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017. |
Bucilua et al., “Model Compression”, In Proceedings of Knowledge Discovery and Data Mining (KDD), 2006. |
Budzianowski, Pawel, et al., “MultiWOZ—A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP) 2018. |
Chang, Shiyu et al., “A Game Theoretic Approach to Class-wise Selective Rationalization”, In Advances in Neural Information Processing Systems, 2019. |
Chaudhuri, Kamalika et al., “Multi-View Clustering via Canonical Correlation Analysis”, In Proceedings of the International Conference on Machine Learning (ICML) 2009. |
Chen et al., “Transfer Learning for Sequence Labeling Using Source Model and Target Data”, In Proceedings of the National Conference on Artificial Intelligence (AAAI), 2019. |
Chen, Jianbo et al., “Learning to Explain: An Information-Theoretic Perspective on Model Interpretation”, Proceedings of the 35th International Conference on Machine Learning, 2018. |
Chen, Liqun et al., “Adversarial Text Generation via Feature-Mover's Distance”, 32nd Conference on Neural Information Processing Systems, 2018. |
Cheung, Jackie Chi Kit et al., “Sequence Clustering and Labeling for Unsupervised Query Intent Discovery”, In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM) 2012. |
Cho, J., et al. “Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts,” Proc. Interspeech 2018, pp. 247-251. |
Cieri, C., et al. “The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text” In LREC, vol. 4, 2004, pp. 69-71. |
Ciresan, Dan et al., “Multi-column Deep Neural Networks for Image Classification”, In IEEE Conference on Computer Vision and Pattern Recognition, 2012. |
Cuturi, Marco “Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances”, In Advances in Neural Information Processing Systems, pp. 2292-2300, 2013. |
David et al., “Analysis of Representations for Domain Adaptation”, In Neural Information Processing Systems (NIPS), 2007. |
Deepak, Padmanabhan “Mixkmeans: Clustering Question-Answer Archives”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP) 2016. |
Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018. |
Ganin et al., “Unsupervised Domain Adaptation by Backpropagation”, In Proceedings of the International Conference on Machine Learning (ICML), 2015. |
Glorot et al., “Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach”, In Proceedings of the International Conference on Machine Learning (ICML), 2011. |
Greenberg et al., “Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 2018. |
Haghani, P., et al. “From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding,” In 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 720-726. |
Han, K. J., et al. “Multistream CNN for Robust Acoustic Modeling,” 2020. |
Haponchyk, Iryna et al., “Supervised Clustering of Questions into Intents for Dialog System Applications”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP) 2018. |
Harvard Business Review Analytic Services, “Real Time Analytics”, Harvard Business Review, Jun. 4, 2018. |
Hemphill, Charles T., et al., “The Atis Spoken Language Systems Pilot Corpus”, In Proceedings of the Workshop on Speech and Natural Language, 1990. |
Henderson, Matthew et al., “The Second Dialog State Tracking Challenge”, In Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) 2014. |
Henderson, Matthew et al., “Word-Based Dialog State Tracking with Recurrent Neural Networks”, In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) 2014. |
Hinton et al., “Distilling the Knowledge in a Neural Network”, 2015. |
Hooker, Andrew “Improving the State Machine of Your Business”, Nov. 15, 2018, pp. 1-11. |
Huang et al., “Learning a Unified Named Entity Tagger From Multiple Partially Annotated Corpora For Efficient Adaptation”, In Proceedings of the Conference on Natural Language Learning (CoNLL), 2019. |
Jain, Sarthak et al., “Attention is not Explanation”, 2019. |
Jansen, Bernard et al., “Classifying the User Intent of Web Queries Using k-means Clustering”, Internet Research, vol. 20, No. 5, 2010. |
Jeon, Jiwoon, “Finding Similar Questions in Large Question and Answer Archives”, In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) 2005. |
Jie et al., “Better Modeling of Incomplete Annotations for Named Entity Recognition”, In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. |
Kanaan-Izquierdo, Samir et al., “Multiview and Multifeature Spectral Clustering Using Common Eigenvectors”, Pattern Recognition Letters, 102, 2018. |
Kim, E., et al. “DNN-Based Emotion Recognition Based on Bottleneck Acoustic Features and Lexical Features,” In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 6720-6724. |
Kim, Seokhwan et al., “The Fifth Dialog State Tracking Challenge”, In Proceedings of the IEEE Spoken Language Technology Workshop (SLT) 2016. |
Kim et al., “Sequence-Level Knowledge Distillation”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 2016. |
Kim, S., et al. “Joint CTC—Attention Based End-to-End Speech Recognition Using Multi-Task Learning,” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 4835-4839. |
Kumar, Abhishek et al., “Co-regularized Multi-View Spectral Clustering”, In Neural Information Processing Systems (NIPS) 2011. |
Kuo, et al. “End-to-End Spoken Language Understanding Without Full Transcripts,” 2020. |
Kusner, Matt J., et al., “From Word Embeddings To Document Distances”, Proceedings of the 32nd International Conference on Machine Learning, vol. 37, ICML'15, pp. 957-966, JMLR.org, 2015. |
Lafferty et al., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, In Proceedings of the International Conference on Machine Learning (ICML), 2001. |
Zadeh, A., et al. “Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph,” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), 2018, pp. 2236-2246. |
Lample et al., “Neural Architectures for Named Entity Recognition”, In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016. |
Lei, Tao et al., “Rationalizing Neural Predictions”, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107-117, Nov. 2016. |
Lei, Tao et al., “Simple Recurrent Units for Highly Parallelizable Recurrence”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. |
Li, P., et al. “An Attention Pooling based Representation Learning Method for Speech Emotion Recognition,” Proc. Interspeech 2018, pp. 3087-3091. |
Li, R., et al. “Dilated Residual Network with Multi-Head Self-Attention for Speech Emotion Recognition,” In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 6675-6679. |
Li et al., “Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits”, CoRR, 2016. |
Li, Jiwei et al., “Understanding Neural Networks through Representation Erasure”, 2016. |
Lin, Bingqian et al., “Jointly Deep Multi-View Learning for Clustering Analysis”, arXiv preprint arXiv:1808.06220, 2018. |
Lin, Junyang et al., “Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2985-2990, Nov. 2018. |
Linton, Ian “How to Make a Customer Service Process Flowchart”, Sep. 26, 2017, pp. 1-8. |
Liu, Y., et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” 2019. |
Logeswaran, Lajanugen et al., “An Efficient Framework for Learning Sentence Representations”, In Proceedings of the International Conference on Learning Representations (ICLR) 2018. |
Lu, Z., et al. “Speech Sentiment Analysis via Pre-trained Features from End-to-End ASR Models,” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 7149-7153. |
Lugosch, L., et al. “Speech Model Pre-training for End-to-End Spoken Language Understanding,” 2019, pp. 814-818. |
Martins, Andre F.T., et al., “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification”, 2016. |
Mintz et al., “Distant Supervision for Relation Extraction Without Labeled Data”, In Proceedings of the Association for Computational Linguistics (ACL), 2009. |
Mirsamadi, S., et al. “Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention,” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227-2231. |
Mohammad, S. “A Practical Guide to Sentiment Annotation: Challenges and Solutions,” In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2016, pp. 174-179. |
Mrk{hacek over (s)}ić, Nikola et al., “Multi-Domain Dialog State Tracking Using Recurrent Neural Networks”, In Proceedings of the Association for Computational Linguistics (ACL) 2015. |
Parikh, Ankur P., et al., “A Decomposable Attention Model for Natural Language Inference”, 2016. |
Pennington et al., “GloVe: Global Vectors for Word Representation”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), 2014. |
Peyré, Gabriel et al., “Computational Optimal Transport”, Foundations and Trends in Machine Learning, vol. 11, No. 5-6, pp. 355-607, 2019. |
Quattoni et al., “Conditional Random Fields for Object Recognition”, In Advances in Neural Information Processing Systems, 2005. |
Rabiner, Lawrence R., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, 1989. |
Sadikov, Eldar et al., “Clustering Query Refinements by User Intent”, In Proceedings of the International Conference on World Wide Web, 2010. |
Sanh, V., et al. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,” 2019. |
Shah, Darsh et al., “Adversarial Domain Adaptation for Duplicate Question Detection”, In Proceedings of Empirical Methods for Natural Language Processing (EMNLP) 2018. |
Sinkhorn, Richard et al., “Concerning Nonnegative Matrices and Doubly Stochastic Matrices”, Pacific Journal of Mathematics, vol. 21, No. 2, pp. 343-348, 1967. |
Siriwardhana, S., et al. “Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition,” Proc. Interspeech 2020, pp. 3755-3759. |
Snell, Jake et al., “Prototypical Networks for Few-Shot Learning”, In Neural Information Processing Systems (NIPS) 2017. |
Stubbs et al., “Annotating Longitudinal Clinical Narratives for De-identification: The 2014 i2b2/UTHealth Corpus”, Journal of Biomedical Informatics, 2015. |
Tableau “Customer Experience, Service, and Support Dashboards”, Oct. 24, 2020. |
Tian et al., “Contrastive Representation Distillation”, In Proceedings of the International Conference on Learning Representations (ICLR), 2020. |
Tzeng et al., “Adversarial Discriminative Domain Adaptation”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. |
Venkatesan et al., “A Novel Progressive Learning Technique for Multi-class Classification”, Neurocomputing, 2016. |
Wiegreffe, Sarah et al., “Attention is not not Explanation”, 2019. |
Williams, Jason, “A Belief Tracking Challenge Task for Spoken Dialog Systems”, In Proceedings of the NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community, 2012. |
Williams, Jason et al., The Dialog State Tracking Challenge, In Proceedings of the SIGDIAL Conference, 2013. |
Wolf, T., et al. “Transformers: State-of-the-Art Natural Language Processing,” 2019. |
Wu, X., et al. “Speech Emotion Recognition Using Capsule Networks,” In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6695-6699. |
Xie, Qizhe et al., “An Interpretable Knowledge Transfer Model for Knowledge Base Completion”, 2017. |
Xie, Y., et al. “Speech Emotion Classification Using Attention-Based LSTM,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, No. 11, pp. 1675-1685, 2019. |
Xie, Junyuan et al., “Unsupervised Deep Embedding for Clustering Analysis”, In Proceedings of the International Conference on Machine Learning (ICML) 2016. |
Xu, Hongteng et al., “Gromov-Wasserstein Learning for Graph Matching and Node Embedding”, Proceedings of the 36th International Conference on Machine Learning, 2019. |
Yang, Bo et al., “Towards k-means-friendly Spaces: Simultaneous Deep Learning and Clustering”, In Proceedings of the International Conference on Machine Learning (ICML) 2016. |
Yang, Z., et al. “XLNET: Generalized Autoregressive Pretraining for Language Understanding,” 33rd Conference on Neural Information Processing Systems, 2019. |
Yang et al., “Design Challenges and Misconceptions in Neural Sequence Labeling”, In Proceedings of the 27th International Conference on Computational Linguistics (COLING), 2018. |
Yih et al., “Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base”, In Proceedings of the Association for Computational Linguistics (ACL), 2015. |
Yu, Mo et al., “Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control”, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. |
Yu, Qiuping “When Providing Wait Times, It Pays to Underpromise and Overdeliver”, Harvard Business Review, Oct. 21, 2020. |