1. Field of the Invention
The present invention relates to speech processing and more specifically to reusing existing spoken dialog data to generate a new natural language spoken dialog system.
2. Introduction
Natural language spoken dialog systems receive spoken language as input, analyze the received spoken language input to derive meaning from the input, and perform some action, which may include generating speech, based on the meaning derived from the input. Building natural language spoken dialog systems requires large amounts of human intervention. For example, a number of recorded speech utterances may require manual transcription and labeling for the system to reach a useful level of performance for operational service. In addition, the design of such complex systems typically includes a human being, such as, a User Experience (UE) expert to manually analyze and define system core functionalities, such as, a system's semantic scope (call-types and named entities) and a dialog manager strategy, which will drive the human-machine interaction. This approach to building natural language spoken dialog systems is extensive and error prone because it involves the UE expert making non-trivial design decisions, the results of which can only be evaluated after the actual system deployment. Thus, a complex system may require the UE expert to define the system's core functionalities via several design cycles which may include defining or redefining the core functionalities, deploying the system, and analyzing the performance of the system. Moreover, scalability is compromised by time, costs and the high level of UE know-how needed to reach a consistent design. A new approach that reduces the amount of human intervention required to build a natural language spoken dialog system is desired.
Applications for natural language dialog systems have already been built. Some new applications may be able to benefit from the data accumulated from existing natural language dialog applications. An approach that reuses the data accumulated from existing natural language dialog applications to build new natural language dialog applications would greatly reduce the time, labor, and expense of building such a system.
In a first aspect of the invention, a method is provided. User input indicating selections of spoken language dialog data may be received. The selections of spoken language dialog data may be extracted from a library of reusable spoken language dialog components. A Spoken Language Understanding (SLU) model or an Automatic Speech Recognition (ASR) model may be built based on the selected spoken language dialog data.
In a second aspect of the invention, a system for reusing spoken dialog components is provided. The system may include a processing device, an extractor, and a model building module. The processing device may be configured to receive user input selections indicating ones of a group of spoken dialog data stored in a library. The extractor may be configured to extract the ones of the group of spoken dialog data and a model building module may be configured to build one of a SLU model or an ASR model based on the extracted ones of the plurality of spoken dialog data.
In a third aspect of the invention, a machine-readable medium is provided. The machine-readable medium may include, recorded thereon, a set of instructions for receiving user input indicating selections of spoken language dialog data from a library, a set of instructions for extracting the selections of spoken language dialog data from the library, and a set of instructions for building at least one of an Automatic Speech Recognition (ASR) model or a SLU model based on the selected spoken language dialog data.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, explain the invention. In the drawings,
Various embodiments of the invention are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the invention.
Designing a new natural language spoken dialog system may require a great deal of human intervention. The first step may be collecting recordings of utterances from customers. These collected utterances may then be transcribed, either manually or via an ASR module. The transcribed utterances may provide a baseline for the types of requests (namely, the user's intent) that users make when they call. A UE expert working with a business customer, according to specific business rules and services requirements, may use either a spreadsheet or a text document to classify these calls into call-types. For example, the UE expert may classify or label input such as, for example, “I want a refund” as a REFUND call-type, and input such as, for example, “May I speak with an operator” as a GET_CUSTOMER_REP call-type. In this example, call-type is synonymous with label.
The end result of this process may be an annotation guide document that describes the semantic domain in terms of the types of calls that may be received and how to classify the calls. The annotation guide may be given to a group of “labelers” who are individuals trained to label thousands of utterances. The utterances and labels may then be used to create a SLU model for an application. The result of this labeling phase is typically a graphical requirement document, namely, a call flow document, which may describe the details of the human-machine interaction. The call flow document may define prompts, error recovery strategies and routing destinations based on the SLU call-types. Once this document is completed, the development of a dialog application may begin. After field tests, results may be given to the UE expert, who then may refine the call-types, create a new annotation guide, retrain the labelers, redo the labels and create new labels or call-types from new data and rebuild the SLU model.
U.S. patent application Ser. No. ______, entitled “SYSTEM AND METHOD FOR AUTOMATIC GENERATION OF A NATURAL LANGUAGE UNDERSTANDING MODEL,” (Attorney Docket No. 2003-0059), filed on ______ and herein incorporated by reference in its entirety, describes various tools for generating a Natural or Spoken Language Understanding model.
When models for an application are built, spoken dialog data, which may include utterance data, which may further include a category or verb, positive utterances, and negative utterances for the application may be stored in a library of reusable components and may be reused to bootstrap another application. The utterance data may be stored as part of a collection. A group of collections may be stored in a sector data set. The library is discussed in more detail below.
User device 102 may be a processing device such as, for example, a personal computer (PC), handheld computer, or any other device that may include a processor and memory. Server 104 may also be a processing device, such as, for example, a PC a handheld computer, or other device that may include a processor and memory. User device 102 may be connected to server 104 via a network, for example, the Internet, a Local Area Network (LAN), Wide Area Network (WAN), wireless network, or other type of network, or may be directly connected to server 104, which may provide a user interface (not shown), such as a graphical user interface (GUI) to user device 102. Alternatively, in some implementations consistent with the principles of the invention, user device 102 and server 104 may be the same device. In one implementation consistent with the principles of the invention, user device 102 may execute a Web browser application, which may permit user device 102 to interface with a GUI on server 104 through a network.
Server 104 may include an extractor for receiving indications of selected reusable components from user device 102 and for retrieving the selected reusable components from library 108. Model building module 107 may build a model, such as a SLU model or an ASR model or both the SLU model and the ASR model from the retrieved reusable components. Model building module 107 may reside on server 104, may be included as part of extractor 106, or may reside in a completely separate processing device from server 104.
Library 108 may include a database, such as, for example, a XML database, a SQL database, or other type of database. Library 108 may be included in server 104 or may be separate from and remotely located from server 104, but may be accessible by server 104 or extractor 106. Server 104 may include extractor 106, which may extract information from library 108 in response to receiving selections from a user. A request from a user may be specific, (e.g., “extract information relevant to requesting a new credit card”). Alternatively, extractor 106 may operate in an automated fashion in which it would use examples in library 108 to extract information from library 108 with only minimal guidance from the user, (e.g., “Extract the best combination of Healthcare and Insurance libraries and build a consistent call flow).
Processor 220 may include at least one conventional processor or microprocessor that interprets and executes instructions. Memory 230 may be a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 220. Memory 230 may also store temporary variables or other intermediate information used during execution of instructions by processor 220. ROM 240 may include a conventional ROM device or another type of static storage device that stores static information and instructions for processor 220. Storage device 250 may include any type of media, such as, for example, magnetic or optical recording media and its corresponding drive.
Input device 260 may include one or more conventional mechanisms that permit a user to input information to system 200, such as a keyboard, a mouse, a pen, a microphone, a voice recognition device, etc. Output device 270 may include one or more conventional mechanisms that output information to the user, including a display, a printer, one or more speakers, or a medium, such as a memory, or a magnetic or optical disk and a corresponding disk drive. Communication interface 280 may include any transceiver-like mechanism that enables system 200 to communicate via a network. For example, communication interface 280 may include a modem, or an Ethernet interface for communicating via a local area network (LAN). Alternatively, communication interface 280 may include other mechanisms for communicating with other devices and/or systems via wired, wireless or optical connections.
System 200 may perform such functions in response to processor 220 executing sequences of instructions contained in a computer-readable medium, such as, for example, memory 230, a magnetic disk, or an optical disk. Such instructions may be read into memory 230 from another computer-readable medium, such as storage device 250, or from a separate device via communication interface 280.
Spoken dialog data are data from existing applications, which may be stored in a library of reusable components. The library of reusable components may include SLU models, ASR models, named entity grammars, manual transcriptions, ASR transcriptions, call-type labels, audio data (utterances), dialog level templates, prompts, and other reusable data.
The data may be organized in various ways. For instance, in an implementation consistent with the principles of the invention, the data may be organized by industrial sector, such as, for example, financial, healthcare, insurance, etc. Thus, for example, to create a new natural language spoken dialog system in the healthcare sector, all the library components from the healthcare sector could be used to bootstrap the new natural language spoken dialog system. Alternatively, in other implementations consistent with the principles of the invention the data may be organized by category (e.g., Service Queries, Billing Queries, etc.) or according to call-types of individual utterances, or by words in the utterances such as, for example, frequently occurring words in utterances.
Any given utterance may belong to one or more call-types. Call-types may be given mnemonic names and textual descriptions to help describe their semantic scope. In some implementations, call-types may be assigned attributes that may be used to assist in library management, browsing, and to provide a level of discipline to the call-type design process. Attributes may indicate whether the call-type is generic, reusable, or specific to a given application. Call-types may include a category attribute or at a lower level may be characterized by a “verb” attribute such as “Request, Report, Ask, etc.” A given call-type may belong to a single industrial sector or to multiple industrial sectors. The UE expert may make a judgment call with respect to how to organize various application datasets into industrial sectors. Because the collection of utterances for any particular application is usually done in phases, each new application may have datasets from several data collection or time periods. Thus, each call-type may also have an attribute describing the data collection data set, such as, for example, a date and/or time of data collection.
Each of sectors 302 may include a SLU model, an ASR model, and named entity grammars and may have the same data organization. An exemplary data organization of a sector, such as financial sector 302-1, is illustrated in
One of ordinary skill in the art would understand that the audio data and corresponding transcriptions may be used to train an ASR model, and the call-type labels may be used to build new SLU models.
The labeled and transcribed data for each of data collections 304 may be imported into separate data collection databases. In one implementation consistent with the principles of the invention, the data collection databases may be XML databases (data stored in XML), which may keep track of the number of utterances imported from each natural language speech dialog application as well as data collection dates. XML databases or files may also include information describing locations of relevant library components on the computer-readable medium that may include library 108. In other implementations, other types of databases may be used instead of XML databases. For example, in one implementation consistent with the principles of the invention a relational database, such as, for example, a SQL database may be used.
The data for each collection may be maintained in a separate file structure. As an example, for browsing application data, it may be convenient to represent the hierarchical structure as a tree {category, verb, call-type, utterance items}. A call-type library hierarchy may be generated from the individual data collection databases and the sector database. The call-type library hierarchy may include sector, data collection, category, verb, call-type, utterance items. However, users may be interested in all of the call-types with “verb=Request” which suggest that the library may be maintained in a relational database. In one implementation that employs XML databases, widely available tools can be used, such as tools that support, for example, XML or XPath to render interactive user interfaces with standard Web browser clients. XPath is a language for addressing parts of an XML document. XSLT is a language for transforming XML documents into other XML documents.
In some implementations consistent with the principles of the invention, methods for building SLU models, methods for text normalization, feature extraction, and named entity extraction may be stored in a file, such as an XML file or other type of file, so that the methods used to build the SLU models may be tracked. Similarly, in implementations consistent with the principles of the invention, data that is relevant to building an ASR module or dialog manager may be saved.
When building an application from data in a library, such as, for example, library 108, a sector data set, associated with a selected model from library 108, may be used to bootstrap a SLU model and/or an ASR model for the new application. In this case, all or part of the utterances from the sector data set may be used to build the SLU model and/or the ASR model for the new application.
Next, model building module 107 may build an ASR model based on the extracted spoken dialog data. One of ordinary skill in the art would understand various methods for building the ASR model. Further, model building model 107 may build a SLU model based on the extracted spoken dialog data. One of ordinary skill in the art would understand various methods for building the SLU model.
The process illustrated in
Although the above description may contain specific details, they should not be construed as limiting the claims in any way. Other configurations of the described embodiments of the invention are part of the scope of this invention. For example, alternative methods may be used to select data to be extracted from a library in other implementations consistent with the principles of the invention. For example, an alternative interface may be used to select data from a library. Accordingly, other embodiments are within the scope of the following claims.
The present invention is related to U.S. patent application Ser. No. ______ (attorney docket no. 2004-0101), entitled “A LIBRARY OF EXISTING SPOKEN DIALOG DATA FOR USE IN GENERATING NEW NATURAL LANGUAGE SPOKEN DIALOG SYSTEMS,” U.S. patent application Ser. No. ______ (attorney docket no. 2004-0125), entitled “A SYSTEM OF PROVIDING AN AUTOMATED DATA-COLLECTION IN SPOKEN DIALOG SYSTEMS,” and U.S. patent application Ser. No. ______ (attorney docket no. 2004-0021), entitled “BOOTSTRAPPING SPOIEN DIALOG SYSTEMS WITH DATA REUSE.” The above U.S. Patent Applications are filed concurrently herewith and the contents of the above U.S. Patent Applications are herein incorporated by reference in their entirety.