This description relates to trend discovery in audio signals.
A contact center provides a communication channel through which business entities can manage their customer contacts. In addition to handling various customer requests, a contact center can be used to deliver valuable information about a company to appropriate customers and to aggregate customer data for making business decisions. Improving the efficiency and effectiveness of agent-customer interactions can result in greater customer satisfaction, reduced operational costs, and more successful business processes.
In a general aspect, a method includes processing, by a keyphrase generation engine, data representative of text associated with one or more content sources to generate a specification of a set of keyphrases of interest; processing, by a word spotting engine, a first set of audio signals collected during a first time period to generate first data characterizing putative occurrences of one or more keyphrases of the set of keyphrases of interest in the first set of audio signals; evaluating, by an analysis engine, the first data to generate keyphrase-specific comparison values for the first set of audio signals; deriving, by the analysis engine, first trending data between the first set of audio signals and a second set of audio signals based in part on an analysis of the keyphrase-specific comparison values for the first set of audio signals relative to stored keyphrase-specific baseline values; and generating, by a user interface engine, a visual representation of at least some of the first trending data and causing the visual representation of the first trending data to be presented on a display terminal.
Embodiments may include one or more of the following. The visual representation further includes trending data other than the first trending data. The keyphrase generation engine, the word spotting engine, the analysis engine, and the user interface engine form part of a contact center system. The first set of audio signals is representative of interactions between contact center callers and contact center agents. The display terminal is associated with a contact center user.
The method further includes processing, by the word spotting engine, the second set of audio signals collected during a second time period to generate second data characterizing putative occurrences of one or more keyphrases of the set of keyphrases of interest in the second set of audio signals. The second set of audio signals is representative of interactions between contact center callers and contact center agents. The second time period is prior to the first time period, and the method further includes evaluating, by the analysis engine, the second data to generate the keyphrase-specific baseline values; and storing, by the analysis engine, the keyphrase-specific baseline values in a machine-readable data store.
The second time period is subsequent to the first time period, and the method further includes evaluating, by the analysis engine, the second data to generate keyphrase-specific comparison values for the second set of audio signals; deriving, by the analysis engine, second trending data between the second set of audio signals period and a third set of audio signals based in part on an analysis of the keyphrase-specific comparison values of the second set of audio signals relative to the stored keyphrase-specific baseline values; and generating, by the user interface engine, a visual representation of at least some of the second trending data and causing the visual representation of the second trending data to be presented on a display terminal.
The second time period is subsequent to the first time period, and the method further includes evaluating, by the analysis engine, the second data to generate keyphrase-specific comparison values for the second set of audio signals; deriving, by the analysis engine, second trending data between the second set of audio signals and the first set of audio signals based in part on an analysis of the keyphrase-specific comparison values of the second set of audio signals relative to the keyphrase-specific comparison values for the first set of audio signals; and generating, by the user interface engine, a visual representation of at least some of the second trending data and causing the visual representation of the second trending data to be presented on a display terminal.
The specification of the set of keyphrases of interest includes at least one phonetic representation of each keyphrase of the set. For each set of audio signals, the processing includes identifying time locations in the set of audio signals at which a spoken instance of a keyphrase of the set of keyphrases of interest is likely to have occurred based on a comparison of data representing the set of audio signals with the specification of the set of keyphrases of interest. Evaluating each of the first data and the second data includes computing values representative of one or more of the following: hit count, call count, call percentage, total call duration. The method further includes filtering the first set of audio signals prior to processing the first set of audio signals by the word spotting engine. The filtering is based on one or more of the following techniques: clip spotting and natural language processing.
In another general aspect, a method includes processing, by a keyphrase generation engine, data representative of text associated with one or more content sources to generate a specification of a set of keyphrases of interest; processing, by a word spotting engine, a first set of audio signals collected during a first time period to generate first data characterizing putative occurrences of one or more keyphrases of the set of keyphrases of interest in the first set of audio signals; evaluating, by an analysis engine, the first data to identify coocurrences of spoken instances of keyphrases of the set of keyphrases of interest within the first set of audio signals; and generating, by a user interface engine, a visual representation of at least some of the identified cooccurences of the spoken instances of keyphrases and causing the visual representation to be presented on a display terminal.
Embodiments may include one or more of the following. The identified cooccurrences of spoken instances of keyphrases represent salient pairs of cooccurring keyphrases of interest. The method further includes deriving, by the analysis engine, first trending data between the first set of audio signals and a second set of audio signals based in part on an analysis of the salient pairs of coocurring keyphrases of interest within the first set of audio signals relative to salient pairs of coocurring keyphrases of interest within a second set of audio signals, and generating, by the user interface engine, a visual representation of at least some of the first trending data and causing the visual representation of the first trending data to be presented on the display terminal. The visual representation further includes trending data other than the first trending data.
The identified coocurrences of spoken instances of keyphrases represent clusters of keyphrases of interest. The method further includes deriving, by the analysis engine, first trending data between the first set of audio signals and a second set of audio signals based in part on an analysis of the clusters of keyphrases of interest within the first set of audio signals relative to clusters of coocurring keyphrases of interest within a second set of audio signals; and generating, by the user interface engine, a visual representation of at least some of the first trending data and causing the visual representation of the first trending data to be presented on the display terminal.
The keyphrase generation engine, the word spotting engine, the analysis engine, and the user interface engine form part of a contact center system. The first set of audio signals is representative of interactions between contact center callers and contact center agents. The display terminal is associated with a contact center user.
Evaluating the first data includes recording a number of times a spoken instance of a first keyphrase of the set of keyphrases of interest occurs within a predetermined range of another keyphrase of the set of keyphrases of interest; generating a vector for the first keyphrase based on the recording; and storing the generated first keyphrase vector in a machine-readable data store for further processing. The method further includes repeating the recording, vector generating, and storing actions for at least some other keyphrases of the set of keyphrases of interest.
In general, the techniques described in this document can be applied in many contact center contexts to ultimately benefit both contact center performance and caller experience. Trend discovery techniques are capable of identifying meaningful words and phrases using the high speed and high recall of phonetic indexing analysis. The vocabulary can be automatically refreshed and expanded as new textual sources become available without the need for user intervention. Trend discovery analyzes millions of spoken phrases to automatically detect emerging trends that may not be apparent to a user. The integration of call metadata adds relevance to the results of an analysis by including call parameters such as handle times, repeat call patterns, and other important metrics. Trend discovery provides an initial narrative into control center activity by highlighting important topics and generating clear visualizations of the results. These techniques thus help decision makers in choosing where to focus further empirical analysis of calls.
Other features and advantages of the invention are apparent from the following description and from the claims.
The following description discusses techniques and approaches that can be applied for discovering trends in sets of audio signals. For purposes of illustration and without limitation, the techniques are illustrated in detail below in the context of customer interactions with control centers. In this context, trend discovery identifies and tracks trends in topics discussed in those customer interactions. The identified trends highlight critical issues in a control center, such as common customer concerns, and serve as a starting point for a more in-depth analysis of a particular topic. For instance, trend discovery helps decision makers to gauge the success of a newly introduced promotion by tracking the frequency with which that promotion is mentioned during phone calls. The analytic tools provided by trend discovery thus add significant business value to control centers.
Referring to
Generally, the database 140 includes media data 142, for example, voice recordings of past and present calls handled by agents. The database 140 also includes metadata 144, which contains descriptive, non-audio data stored in association with audio data 142. Examples of metadata 144 include phonetic audio track (PAT) files that provide a searchable phonetic representation of the media data, transcripts of the media data, and descriptive information such as customer identifiers, customer characteristics (e.g., gender), agent identifiers, call durations, transfer records, day and time of a call, general categorization of calls (e.g., payment vs. technical support), agent notes, and customer-inputted dual-tone multi-frequency (DTMF; i.e., touch-tone) tones.
By processing data maintained in the control center database 140, the data processor 120 can extract management-relevant information for use in a wide range of analyses, including for example:
Results of these analyses can be presented to the user in desired forms through an output unit 130 (e.g. on a computer screen), allowing user interactions through the user interface 110 for further analysis and processing of the data if needed.
In some embodiments, data processor 120 includes a set of engines that detect trends in audio data. Those engines include a keyphrase generation engine 122, a word spotting engine 124, an analysis engine 126, and a user interface engine 128, as described in detail below.
Referring to
Referring to
Keyphrase generation engine 122 uses conventional natural language processing techniques to identify keyphrases of interest. The keyphrase generation engine employs a language model (LM) that models word frequencies and context and that is trained from a large textual corpora. For instance, the LM may be trained on the Gigaword corpus, which includes 1.76 billion word tokens obtained from web pages. Keyphrase generation engine 122 applies the LM to text corpus 300 using natural language processing techniques to identify keyphrases of interest. For instance, the keyphrase generation engine may employ part-of-speech tagging, which automatically tags words or phrases in text corpus 300 with their grammatical categories according to lexicon and contextual rules. For instance, in the phrase “activated the phone,” the word “activated” is tagged as a past tense verb, the word “the” is tagged as a determiner, and the word “phone” is tagged as a singular noun. Keyphrase generation engine 122 may also use noun phrase chunking based on Hidden Markov Models or Transformation-Based Learning to locate noun phrases, such as “account management” or “this phone,” in the text corpus 300. Keyphrase generation engine 122 also applies rule-based filters to text corpus 300. Keyphrase generation engine 122 may also identify keyphrases of interest based on an analysis of the nature of a phrase given its constituent parts. The set of vocabulary identified as keyphrases of interest represents topics most likely to occur in audio signals (e.g., control center conversations) to be analyzed for trends.
Once keyphrases of interest 302 are identified in text corpus 300, prior audio data 304 are fed into word spotting engine 124. Prior audio data include stored searchable data (e.g., PAT data) representing, for instance, audio recordings of telephone calls made over a period of time. For control center applications, prior audio data 304 may include data for more than 20,000 calls made over a period of 13 weeks. Word spotting engine 124 receives keyphrases of interest 302 and evaluates prior audio data 304 to generate data 306 characterizing putative occurrences of one or more of the keyphrases of interest 302 in the prior audio data 304.
Referring to
Referring again to
Current audio data 310 corresponding to, for instance, recently completed telephone calls are received by word spotting engine 124. Current audio data 310 is searchable data, such as PAT files, representing audio signals. Word spotting engine 124 evaluates current audio data 310 to generate data 312 characterizing putative occurrences of one or more of the keyphrases of interest 302 in the current audio data 310. The data 312 characterizing putative keyphrase occurrences are then processed by analysis engine 126, which generates current keyphrase-specific comparison values relating to a frequency of occurrence, a call metadata parameter (such as call count, call percentage, or total call duration), or a score for each keyphrase in current audio data 310. Analysis engine 126 retrieves stored baseline data 308 and uses robust, non-parametric statistical analysis to compare baseline values 308 with the current keyphrase-specific comparison values to produce trending data 314. Comparisons may be made on the basis of a frequency of occurrence, a call metadata parameter, or a score for each keyphrase of interest.
Trending data 314 represents a rank or a change in a frequency of occurrence of a keyphrase of interest between prior audio data 304 and current audio data 310. Trending data 314 may also represent a change in a call parameter of keyphrases of interest 302. For instance, trending data 314 may include a ranking or change in ranking of top keyphrases of interest in a given time period, a listing of keyphrases that appeared or disappeared between two time periods, or a change in call duration for calls including a given keyphrase of interest.
In some instances, portions of prior and/or current audio data may not be relevant to an analysis of a set of audio signals. For instance, in control center applications, on-hold messages, such as advertising and promotions, and interactive voice response (IVR) messages skew the statistics of keyphrase detection and trend discovery. Automatically filtering out repeated segments, such as on-hold and IVR messages, by a clip spotting process (e.g., using optional clip spotting engines 305, 311 in
Referring still to
Referring to
Referring to
Referring to
Referring to
In the example of
In the example of
Referring to
Referring to
Analysis engine 126 evaluates data corresponding to the putative occurrences 1012 of keyphrases to identify co-occurrences 1012 of spoken instances of the keyphrases. For example, analysis engine 126 records a number of times a spoken instance of a first keyphrase occurs within a predetermined time range of another keyphrase. The analysis engine 126 generates a co-occurrence vector for the first keyphrase representing co-occurring keyphrases, which vector is stored for further processing. The strength of association can also be weighted by measuring the distance apart of spoken instances of two keyphrases at each instance of co-occurrence. In some embodiments, analysis engine 126 compares the co-occurrence vector for a keyphrase in a first set of audio data with a co-occurrence vector for that keyphrase in a prior set of audio data to detect trends in co-occurring keyphrases of interest, such as changes in keyphrases that form salient pairs over time. User interface engine 128 receives co-occurrence vectors and generates a visual representation 1016 of at least some of the salient pairs.
Similarly, analysis engine 126 may be configured to detect clusters of keyphrases of interest that tend to be spoken within a same conversation or within a predetermined time range of each other. As an example, a control center for a mobile phone company may receive phone calls in which a customer inquires about several models of smartphones. A clustering analysis would then detect a cluster of keyphrases of interest related to the names of smartphone models. Analysis engine may also detect trends in clusters of keyphrases, such as changes in keyphrases that form trends over a period of time.
Referring to
Referring to
Tracking of clusters and salient pairs of co-occurring keyphrases of interest may be useful to aid a user (e.g., a control center agent) in responding to customer concerns or in generating new queries. As an example, the keyphrases “BlackBerry®” and “Outlook® Exchange” may have a high co-occurrence rate during a particular week. When a control center agent fields a call from a customer having problems using his BlackBerry®, the control center agent performs a query on the word BlackBerry®. In response, the system suggests to the control center agent that the customer's problems may be related to email and provides links to BlackBerry®-related knowledge articles that help troubleshoot Outlook® Exchange issues. More generally, tracking of clusters and salient pairs of co-occurring keyphrases of interest aids a user in identifying problem areas for troubleshooting or training
To enable the keyphrase generation engine to more readily identify keyphrases of interest specific to a particular company or application, a custom language model (LM) may be used instead of the generic LM described above. A custom LM is developed by training with relevant textual resources such as company websites, marketing literature, and training materials. In addition, a user may provide a list of important phrases obtained from, e.g., existing taxonomy, structured queries, and listener queries.
Referring to
The use of a custom LM to evaluate a text source increases recall over the use of a baseline LM and a standard search.
The above-discussed trend discovery techniques are generally applicable to a number of control center implementations in which various types of hardware and network architectures may be used.
Referring to
Traditionally, a customer 1102 contacts a control center by placing telephone calls through a telecommunication network, for example, via the public switched telephone network (PSTN) 1106. In some implementations, the customer 1102 may also contact the control center by initiating data-based communications through a data network 1108, for example, via internet by using voice over internet protocol (VoIP) technology.
Upon receiving an incoming request, a control module 1120 in the monitoring and processing engine 1110 uses a switch 1124 to route the customer call to a control center agent 1104. Once call connections are established, a media gateway 1126 is used to convert voice streams transmitted from the PSTN 1106 or from the data network 1108 into a suitable form of media data compatible for use by a media processing module 1134.
In many situations, the media processing module 1134 records voice calls received from the media gateway 1126 and stores them as media data 1144 into a storage module 1140. Some implementations of the media processing module are further configured to process the media data to generate non-audio based representations of the media files, such as phonetic audio track (PAT) files that provide a searchable phonetic representation of the media files, base on which the content of the media can be conveniently searched. Those non-audio based representations are stored as metadata 1142 in the storage module 1140.
The monitoring and processing engine 1110 also includes a call management module 1132 that obtains descriptive information about each voice call based on data supplied by the control module 1120. Examples of such information includes caller identifier (e.g., phone number, IP address, customer number), agent identifiers, call duration, transfer records, day and time of the call, and general categorization of calls (e.g., as determined based on touchtone input), all of which can be saved as metadata 1142 in the storage module 1140.
Data in the storage module 1140 can be accessed by the application engine 1160 over a data network 1150. Depending on the particular implementation, the application engine 1160 may employ a set of functional modules, each configured for a different analytic task. For example, the application engine 1160 may include a trend discovery module 1170 that provides trend discovery functionality in a manner similar to the data management system 100 described above with reference to
Note that this embodiment of control center service platform 1100 offers an integration of telecommunication-based and data-based networks to enable user interactions in different forms, including voice, Web communication, text messaging, and email. Examples of telecommunication networks include both fixed and mobile telecommunication networks. Examples of data networks include local area networks (“LAN”) and wide area network (“WAN”), e.g., the Internet, and include both wired and wireless networks.
Also, customers and agents who interact on the same control center service platform 1100 do not necessarily have to reside within the same physical or geographical region. For example, a customer located in U.S. may be connected to an agent who works at an outsourced control center in India.
In some examples, each of the two service engines 1110 and 1160 may reside on a separate server and individually operate in a centralized manner. In some other examples, the functional modules of a service engine may be distributed onto multiple hardware components, between which data communication channels are provided. Although in the example of
The techniques described herein can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The techniques can be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Method steps of the techniques described herein can be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Method steps can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Modules can refer to portions of the computer program and/or the processor/special circuitry that implements that functionality.
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
To provide for interaction with a user, the techniques described herein can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer (e.g., interact with a user interface element, for example, by clicking a button on such a pointing device). Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
The techniques described herein can be implemented in a distributed computing system that includes a back-end component, e.g., as a data server, and/or a middleware component, e.g., an application server, and/or a front-end component, e.g., a client computer having a graphical user interface and/or a Web browser through which a user can interact with an implementation of the invention, or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet, and include both wired and wireless networks.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact over a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
This application is related to the following patent applications, the contents of which are incorporated herein by reference: application Ser. No. 12/490,757, filed Jun. 24, 2009, and entitled “Enhancing Call Center Performance” (Attorney Docket No. 30004-041001); Provisional Application Ser. No. 61/231,758, filed Aug. 6, 2009, and entitled “Real-Time Agent Assistance” (Attorney Docket No. 30004-042P01); and Provisional Application Ser. No. 61/219,983, filed Jun. 24, 2009, and entitled “Enterprise speech intelligence analysis” (Attorney Docket No. 30004-043P01).