1. Field of the Invention
The present invention relates to audio analysis in general and to a method and apparatus for segmenting an audio interaction, in particular.
2. Discussion of the Related Art
Audio analysis refers to the extraction of information and meaning from audio signals for purposes such as word statistics, trend analysis, quality assurance, and the like. Audio analysis could be performed in audio interaction-extensive working environments, such as for example call centers, financial institutions, health organizations, public safety organizations or the like. Typically, audio analysis is used in order to extract useful information associated with or embedded within captured or recorded audio signals carrying interactions. Audio interactions contain valuable information that can provide enterprises with insights into their business, users, customers, activities and the like. The extracted information can be used for issuing alerts, generating reports, sending feedback or otherwise using the extracted information. The information can be usefully manipulated and processed, such as being stored, retrieved, synthesized, combined with additional sources of information, and the like. Extracted information can include for example, continuous speech, spotted words, identified speaker, extracted emotional (positive or negative) segments within an interaction, data related to the call flow such as number of bursts in from each side, segments of mutual silence, or the like. The customer side of an interaction recorded in a commercial organization can be used for various purposes such as trend analysis, competitor analysis, emotion detection (finding emotional calls) to improve customer satisfaction level, and the like. The service provider side of such interactions can be used for purposes such as script adherence, emotion detection (finding emotional calls) to track deficient agent behavior, and the like. The most common interaction recording format is summed audio, which is the product of analog line recording, observation mode and legacy systems. A summed interaction may include, in addition to two or more speakers that at times may talk simultaneously (co-speakers), also music, tones, background noises on either side of the interaction, or the like. The audio analysis performance, as measured in terms of accuracy, detection, real-time efficiency and resource efficiency, depends directly on the quality and integrity of the captured and/or recorded signals carrying the audio interaction, on the availability and integrity of additional meta-information, on the capabilities of the computer programs that constitute the audio analysis process and on the available computing resources. Many of the analysis tasks are highly sensitive to the audio quality of the processed interactions. Multiple speakers, as well as music (which is often present on hold periods), tones, background noises such as street noise, ambient noise, convolutional noises such as channel type and handset type, keystrokes and the like, severely degrade the performance of these engines, sometimes to the degree of complete uselessness, for example in the case of emotion detection where it is mandatory to analyze only one speaker's speech segments. Therefore it is crucial to identify only the speech segments of an interaction wherein a single speaker is speaking. The customary solution is to use unsupervised speaker segmentation module as part of the audio analysis.
Traditionally, unsupervised speaker segmentation algorithms are based on bootstrap (bottom up) classification methods, starting with short discriminative segments and extending such segments using additional, not necessarily adjacent segments. Initially, a homogenous speaker segment is located, and regarded as an anchor. The anchored segment is used for initially creating a model of the first speaker. In the next phase a second homogenous speaker segment is located, in which the speaker characteristics are most different from the first segment. The second segment is used for creating a model of the second speaker. By deploying an iterative maximum-likelihood (ML) classifier, based on the anchored speaker models, all other utterance segments could be roughly classified. The conventional methods suffer from a few limitations: the performance of the speaker segmentation algorithm is highly sensitive to the initial phase, i.e., poor choice of the initial segment (anchored segment) can lead to unreliable segmentation results. Additionally, the methods do not provide a verification mechanism for assessing the success of the segmentation, nor the convergence of the methods, in order to eliminate poorly segmented interactions from being further processed by audio analysis tools and providing further inaccurate results. Another drawback is that additional sources of information, such as computer-telephony-integration (CTI) data, screen events and the like are not used. Yet another drawback is the inability of the method to tell which collection of segments belongs to one speaking side, such as the customer, and which belongs to the other speaking side, since different analyses are performed on both sides, to serve different needs.
It should be easily perceived by one with ordinary skills in the art, that there is an obvious need for an unsupervised segmentation method and for an apparatus to segment an unconstrained interaction into segments that should not be analyzed, such as music, tones, low quality segments or the like, and segments carrying speech of a single speaker, where segments of the same speaker should be grouped or marked accordingly. Additionally, identifying the sides of the interaction is required. The segmentation tool has to be effective, i.e., extract as long and as many as possible segments of the interaction in which a single speaker is speaking, with as little as possible compromise on the reliability, i.e., the quality of the segments. Additionally, the tool should be fast and efficient, so as not to introduce delays to further processing, or place additional burden on the computing resources of the organization. It is also required that the tool will provide a performance estimation which can be used in deciding whether the speech segments are to be sent for analysis or not.
It is an object of the present invention to provide a novel method for speaker segmentation which overcomes the disadvantages of the prior art. In accordance with the present invention, there is thus provided a speaker segmentation method for associating one or more segments for each of two or more sides of one or more audio interactions, with one of the sides of the interaction using additional information, the method comprising: a segmentation step for associating the one or more segments with one side of the interaction, and a scoring step for assigning a score to said segmentation. The additional information can be one or more of the group consisting of: computer-telephony-integration information related to the at least one interaction; spotted words within the at least one interaction; data related to the at least one interaction; data related to a speaker thereof; external data related to the at least one interaction; or data related to at least one other interaction performed by a speaker of the at least one interaction. The method can further comprise a model association step for scoring the segments against one or more statistical models of one side, and obtaining a model association score. The scoring step can use discriminative information for discriminating the two or more sides of the interaction. The scoring step can comprise a model association step for scoring the segments against a statistical model of one side, and obtaining a model association score. Within the method, the scoring step can further comprise a normalization step for normalizing the one or more model scores. The scoring step can also comprise evaluating the association of the one or more segments with a side of the interaction, using additional information. The additional information can be one or more of the group consisting of: computer-telephony-integration information related to the at least one interaction; spotted words within the at least one interaction; data related to the at least one interaction; data related to a speaker thereof; external data related to the at least one interaction; or data related to at least one other interaction performed by a speaker of the at least one interaction. The scoring step can comprise statistical scoring. The method can further comprise: a step of comparing the score to a threshold; and repeating the segmentation step and the scoring step if the score is below the threshold. The threshold can be predetermined, or dynamic, or depend on: information associated with said at least one interaction, information associated with an at least one speaker thereof, or external information associated with the interaction. The segmentation step can comprise a parameterization step to transform the speech signal to a set of feature vectors in order to generate data more suitable for statistical modeling; an anchoring step for locating an anchor segment for each side of the interaction; and a modeling and classification step for associating at least one segment with one side of the interaction. The anchoring step or the modeling and classification step can comprise using additional data, wherein the additional data is one or more of the group consisting of: computer-telephony-integration information related to the at least one interaction; spotted words within the at least one interaction; data related to the at least one interaction; data related to a speaker thereof; external data related to the at least one interaction; or data related to at least one other interaction performed by a speaker of the at least one interaction. The method can comprise a preprocessing step for enhancing the quality of the interaction, or a speech/non-speech segmentation step for eliminating non-speech segments from the interaction. The segmentation step can comprise scoring the one or more segments with a voice model of a known speaker.
Another aspect of the disclosed invention relates to a speaker segmentation apparatus for associating one or more segments for each of at two or more speakers participating in one or more audio interactions, with a side of the interaction, using additional information, the apparatus comprising: a segmentation component for associating one or more segments within the interaction with one side of the interaction; and a scoring component for assigning a score to said segmentation. Within the apparatus the additional information can be of the group consisting of: computer-telephony-integration information related to the at least one interaction; spotted words within the at least one interaction; data related to the at least one interaction; data related to a speaker thereof; external data related to the interaction; or data related to one or more other interactions performed by a speaker of the interaction.
Yet another aspect of the disclosed invention relates to a quality management apparatus for interaction-rich environments, the apparatus comprising: a capturing or logging component for capturing or logging one or more audio interactions; a segmentation component for segmenting the interactions; and a playback component for playing one or more parts of the one or more audio interactions.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which:
The present invention overcomes the disadvantages of the prior art by providing a novel method and a system for locating segments within an audio interaction in which a single speaker is speaking, dividing the segments into two or more groups, wherein the speaker in each segment group is the same one, and discriminating in which group of segments a certain participant, or a certain type of participant, such as a service representative (agent) of an organization, is speaking, and in which group another participant or participant type, such as a customer, is speaking. The disclosed invention utilizes additional types of data collected in interaction-intensive environments, such as call centers, financial institutions or the like, in addition to captured or recorded audio interactions in order to enhance the segmentation and the association of a group of segments with a specific speaker or speaker type, such as an agent, a customer or the like. The discussion below is oriented more to applications involving commerce or service, but the method is applicable to any required domain, including public safety, financial organizations such as trade floors, health organizations and others.
The information includes raw information, such as meta data, as well as information extracted by processing the interactions. Raw information includes, for example Computer Telephony Integration (CTI) information which includes hold periods, number called, number called, DNIS, VDN, ANI or the like, agent details, screen events related to the current or other interactions with the customer, information exchanged between the parties, and other relevant information that can be retrieved form external sources such as CRM data, billing information, workflow management, mail messages and the like. The extracted information can include, for example certain words spotted within the interaction, such as greetings, compliance phrases or the like, continuous speech recognition, emotion detected within an interaction, and call flow information, such as bursts of one speaker when the other speaker is talking, mutual silence periods and others. Other data used, include for example voice models of a single or multiple speakers.
The collected data is used in the process of segmenting the audio interaction in a number of ways. First, the information can be used to obtain an accurate anchor point for the initial selection of a segment of a single speaker. For example, a segment in which a compliance phrase was spotted can be a good anchor point for one speaker, specifically the agent. A highly emotional segment can be used as an anchor for the customer side. Such information can be used during the classification of segments into speakers, and also for posteriori assessment of the performance of the segmentation. Second, the absence or presence, and certainty level of specific events within the segments of a certain speaker can contribute to the discrimination of the agent side from the customer side, and also for assessing the performance of the segmentation. For example, the presence of compliance sentences and typical customer-side noises (such as a barking dog) in segments of allegedly the same speaker, can suggest a deficient segmentation. The discrimination of the speakers can be enhanced by utilizing agent-customer-discriminating information, such as screen events, emotion levels, and voice models of a specific agent, a specific customer, a group of agents, a universal agent model or a universal customer model. If segments attributed to one side have a high probability of complying with a specific agent's characteristics or with a universal agent model, relating the segments to the agent side will have a higher score, and vice versa. Thus, the segmentation can be assessed, and according to the assessment result accepted, rejected, or repeated.
Referring now to
Referring now to
Referring now to
In step 232 the method further uses additional data evaluation, in order to evaluate the contribution of each segment attributed to a certain speaker. Additional data can include spotted words that are typical to a certain side, such as “how can I help you” on the agent side, and “how much would that cost” for a customer side, CTI events, screen events, external or internal information or the like. The presence, possibly associated with a certainty level, of such events on segments associated with a specific side are accumulated or otherwise combined into a single additional data score. The scores of statistical scoring 204, model association 212 and additional data scoring 232 are combined at step 236, and a general score is issued. If the score is below a predetermined threshold, as is evaluate at step 140 of
As mentioned above in relation to the statistical model scoring, and is applicable for all types of data, the same data item should not be used in the scoring phase if it was already used during the segmentation phase. Using the same data item in the two phases will bias the results and give higher and unjustified score to certain segmentation. For example, if the phrase “Company X good morning” was spotted at a certain location, and the segment it appeared on was used as an anchor for the agent side, considering this word during additional data scoring step will raise the score in an artificial manner, since it is known that the segment the phrase was said in is associated with the agent side.
It will be appreciated by people skilled in the art that some of the presented methods and scorings can be partitioned in a different manner over the described steps without significant change in the results. It will also be appreciated by people skilled in the art that additional scoring methods can exist and be applied in addition, or instead of the presented scoring. The scoring method can be applied to the results of any segmentation method, and not necessarily the one presented above. Also, different variations can be applied to the segmentation and the scoring methods as described, without significant change to the proposed solution. It will further be appreciated by people skilled in the art that the disclosed invention can be extended to segmenting an interaction between more than two speakers, without significant changes to the described method. The described rules and parameters, such as the acceptable score values, stopping criteria for the segmentation and the like can be predetermined or set dynamically. For example, the parameters can tale into account the type or length of the interaction, the customer type as received from an external system or the like.
The disclosed invention provides a novel approach to segmenting an audio interaction into segments, and associating each group of segments with one speaker. The disclosed invention provides a scoring and control mechanism over the quality of the resulting segmentation. The system
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention is defined only by the claims which follow.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IL2006/000100 | 1/25/2006 | WO | 00 | 2/9/2006 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2007/086042 | 8/2/2007 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4104539 | Hase | Aug 1978 | A |
4145715 | Clever | Mar 1979 | A |
4359679 | Regan | Nov 1982 | A |
4527151 | Byrne | Jul 1985 | A |
4766364 | Biamonte et al. | Aug 1988 | A |
4821118 | Lafreniere | Apr 1989 | A |
5051827 | Fairhurst | Sep 1991 | A |
5091780 | Pomerleau | Feb 1992 | A |
5303045 | Richards et al. | Apr 1994 | A |
5307170 | Itsumi et al. | Apr 1994 | A |
5353168 | Crick | Oct 1994 | A |
5404170 | Keating | Apr 1995 | A |
5491511 | Odle | Feb 1996 | A |
5519446 | Lee | May 1996 | A |
5598507 | Kimber et al. | Jan 1997 | A |
5655058 | Balasubramanian et al. | Aug 1997 | A |
5659662 | Wilcox et al. | Aug 1997 | A |
5734441 | Kondo et al. | Mar 1998 | A |
5742349 | Choi et al. | Apr 1998 | A |
5751346 | Dozier et al. | May 1998 | A |
5790096 | Hill, Jr. | Aug 1998 | A |
5796439 | Hewett et al. | Aug 1998 | A |
5847755 | Wixson et al. | Dec 1998 | A |
5895453 | Cook et al. | Apr 1999 | A |
5920338 | Katz | Jul 1999 | A |
6014647 | Nizzar et al. | Jan 2000 | A |
6028626 | Aviv et al. | Feb 2000 | A |
6031573 | MacCormack et al. | Feb 2000 | A |
6037991 | Thro et al. | Mar 2000 | A |
6070142 | McDonough et al. | May 2000 | A |
6081606 | Hansen et al. | Jun 2000 | A |
6092197 | Coueignoux | Jul 2000 | A |
6094227 | Guimier | Jul 2000 | A |
6097429 | Seely et al. | Aug 2000 | A |
6111610 | Faroudja | Aug 2000 | A |
6134530 | Bunting et al. | Oct 2000 | A |
6138139 | Beck et al. | Oct 2000 | A |
6167395 | Beck et al. | Dec 2000 | A |
6170011 | Beck et al. | Jan 2001 | B1 |
6212178 | Beck | Apr 2001 | B1 |
6230197 | Beck et al. | May 2001 | B1 |
6236582 | Jalaleddine | May 2001 | B1 |
6295367 | Crabtree et al. | Sep 2001 | B1 |
6327343 | Epstein et al. | Dec 2001 | B1 |
6330025 | Arazi et al. | Dec 2001 | B1 |
6345305 | Beck et al. | Feb 2002 | B1 |
6404857 | Blair et al. | Jun 2002 | B1 |
6405166 | Huang et al. | Jun 2002 | B1 |
6421645 | Beigi et al. | Jul 2002 | B1 |
6424946 | Tritschler et al. | Jul 2002 | B1 |
6427137 | Petrushin | Jul 2002 | B2 |
6441734 | Gutta et al. | Aug 2002 | B1 |
6549613 | Dikmen | Apr 2003 | B1 |
6559769 | Anthony et al. | May 2003 | B2 |
6570608 | Tserng | May 2003 | B1 |
6604108 | Nitahara | Aug 2003 | B1 |
6628835 | Brill et al. | Sep 2003 | B1 |
6704409 | Dilip et al. | Mar 2004 | B1 |
7076427 | Scarano et al. | Jul 2006 | B2 |
7103806 | Horvitz | Sep 2006 | B1 |
7295970 | Gorin et al. | Nov 2007 | B1 |
20010043697 | Cox et al. | Nov 2001 | A1 |
20010052081 | McKibben et al. | Dec 2001 | A1 |
20020005898 | Kawada et al. | Jan 2002 | A1 |
20020010705 | Park et al. | Jan 2002 | A1 |
20020059283 | Shapiro et al. | May 2002 | A1 |
20020087385 | Vincent | Jul 2002 | A1 |
20030033145 | Petrushin | Feb 2003 | A1 |
20030059016 | Lieberman et al. | Mar 2003 | A1 |
20030128099 | Cockerham | Jul 2003 | A1 |
20030163360 | Galvin | Aug 2003 | A1 |
20030182118 | Obrador et al. | Sep 2003 | A1 |
20040098295 | Sarlay et al. | May 2004 | A1 |
20040141508 | Schoeneberger et al. | Jul 2004 | A1 |
20040161133 | Elazar et al. | Aug 2004 | A1 |
20040249650 | Freedman et al. | Dec 2004 | A1 |
20040260550 | Burges et al. | Dec 2004 | A1 |
20050228673 | Nefian et al. | Oct 2005 | A1 |
20060093135 | Fiatal et al. | May 2006 | A1 |
20060229876 | Aaron et al. | Oct 2006 | A1 |
Number | Date | Country |
---|---|---|
103 58 333 | Jul 2005 | DE |
1 484 892 | Dec 2004 | EP |
9916430.3 | Jul 1999 | GB |
03 067884 | Aug 2003 | IL |
WO 95 29470 | Nov 1995 | WO |
WO 98 01838 | Jan 1998 | WO |
WO 0073996 | Dec 2000 | WO |
WO 0237856 | May 2002 | WO |
WO 03 013113 | Feb 2003 | WO |
WO 03067360 | Aug 2003 | WO |
WO 03067884 | Aug 2003 | WO |
WO 2004091250 | Oct 2004 | WO |
Number | Date | Country | |
---|---|---|---|
20080181417 A1 | Jul 2008 | US |