This invention relates to speaker identification.
Speaker “diarization” of an audio recording of a conversation is a process for partitioning the recording according to a number of speakers participating in the conversation. For example, an audio recording of a conversation between two speakers can be partitioned into a number of portions with some of the portions corresponding to a first speaker of the two speakers speaking and other of the portions corresponding to a second speaker of the two speakers speaking.
Various post-processing of the diarized audio recording can be performed.
In an aspect, in general, a system includes a first input for receiving a first data representing an interaction among a plurality of parties, the first data identifying a plurality of parts of the interaction and identifying a plurality of segments associated with each part of the plurality of parts, a second input for receiving a second data associating each of one or more labels with one or more corresponding query phrases, a searching module for searching the first data to identify putative instances of the query phrases, and a classifier for labeling the parts of the interaction associated with the identified putative instances of the query phrases with the labels corresponding to the identified query phrases.
Aspects may include one or more of the following features.
The first data may represent an audio signal including the interaction among the plurality of speakers. The first data may represent a text based chat log including the interaction among the plurality of speakers. The system may include a recording module for forming the first data representing the audio signal including recording an audio signal of the interaction between the plurality of parties, segmenting the audio signal into the plurality of segments, and associating each of the segments with a part of the plurality of parts, wherein each part of the plurality of parts corresponds to one of the parties of the plurality of parties. The recording module may be configured to segment the audio signal according to the different acoustic characteristics of the plurality of parties.
The system may include a recording module for forming the first data representing the text based chat log including logging a textual interaction between the plurality of parties, segmenting the textual interaction into the plurality of segments, and associating each of the segments with a part of the plurality of parts, wherein each part of the plurality of parts corresponds to one of the parties of the plurality of parties.
The searching module may be configured to, for each label of at least some of the one or more labels, search for putative instances of at least some of the one or more query phrases corresponding to the label in at least some of the plurality of segments which are associated with at least some of the plurality of parts. The searching module may include a speech processor and each putative instance is associated with a hit quality that characterizes a quality of recognition of a corresponding query phrase of the one or more query phrases. The searching module may include a wordspotting system. The searching module may include a text processor. At least some of the query phrases may be known to be present in the first data. The first data may be diarized according to the interaction.
In another aspect, in general, a computer implemented method includes receiving a first data representing an interaction among a plurality of parties, the first data identifying a plurality of parts of the interaction and identifying a plurality of segments associated with each part of the plurality of parts, receiving a second data associating each of one or more labels with one or more corresponding query phrases, searching the first data to identify putative instances of the query phrases, and labeling the parts of the interaction associated with the identified putative instances of the query phrases with the labels corresponding to the identified query phrases.
Aspects may include one or more of the following features.
The first data may represent an audio signal comprising the interaction among the plurality of speakers. The first data may represent a text based chat log comprising the interaction among the plurality of speakers. The method may include forming the first data representing the audio signal including recording an audio signal of the interaction between the plurality of parties, segmenting the audio signal into the plurality of segments, and associating each of the segments with a part of the plurality of parts, wherein each part of the plurality of parts corresponds to one of the parties of the plurality of parties. Segmenting the audio signal into the plurality of segments may include segmenting the audio signal according to the different acoustic characteristics of the plurality of parties.
The method may include forming the first data representing the text based chat log including logging a textual interaction between the plurality of parties, segmenting the textual interaction into the plurality of segments, and associating each of the segments with a part of the plurality of parts, wherein each part of the plurality of parts corresponds to one of the parties of the plurality of parties. Searching the first data may include, for each label of at least some of the one or more labels, searching for putative instances of at least some of the one or more query phrases corresponding to the label in at least some of the plurality of segments which are associated with at least some of the plurality of parts.
Searching the first data may include associating each putative instance with a hit quality that characterizes a quality of recognition of a corresponding query phrase of the one or more query phrases. At least some of the query phrases may be known to be present in the first data. The first data may be diarized according to the interaction.
In another aspect in general, software stored on a computer-readable medium comprising instructions for causing a data processing system to receive a first data representing an interaction among a plurality of parties, the first data identifying a plurality of parts of the interaction and identifying a plurality of segments associated with each part of the plurality of parts, receive a second data associating each of one or more labels with one or more corresponding query phrases, search the first data to identify putative instances of the query phrases, and label the parts of the interaction associated with the identified putative instances of the query phrases with the labels corresponding to the identified query phrases.
Embodiments may have one or more of the following advantages.
Among other advantages the speaker identification system can improve the speed and accuracy of searching an audio recording.
Other features and advantages of the invention are apparent from the following description, and from the claims.
In general, the systems described herein process transcriptions of interactions between users of one or more communication systems. For example, the transcriptions can be derived from audio recordings of telephone conversations between users or from text logs of chat sessions between users. The following description relates to one such system which processes call records from a customer service call center. However, the reader will recognize that the system and the techniques applied therein can also be applied to other types of transcriptions of interactions between users such as logs of chat sessions between users.
Referring to
Referring to
One use of a diarized call record 116 such as that shown in
However, one problem associated with a diarized conversation 116 such as that shown in
Referring to
In some examples, the user 328 supplies the cue phrases for the different speaker types (e.g. customer service agent, customer) by using a command such as:
SPEAKER_IDEN(speakerType,phrase(s))
The system 324 processes one or more diarized call records 116 of the database of diarized call records 118 using the cue phrases 326, 328 to generate one or more diarized call records with one or more of the speakers in the call records identified, referred to as speaker ID'd call records 342. The speaker ID'd call records 322 are stored in a database of speaker ID'd call records 332.
Within the query based speaker identification system 324, a diarized call record 116 from the database of diarized call records 118 and the customer service cue phrase 326 are passed to a first speech processor 336 (e.g., a wordspotting system). The first speech processor 336 searches all of the portions of the diarized call record 116 to identify portions which include putative instances of the customer service cue phrase 326. Each identified putative instance includes a hit quality score which characterizes how confident the first speech processor 336 is that the identified putative instance of the customer service cue phrase matches the actual customer service cue phrase 326.
In general, the customer service cue phrase 326 is a phrase that is known to be commonly spoken by customer service agents 104 and to be rarely spoken by customers 102. Thus, it is likely that the portions of the diarized call record 116 which correspond to the customer service agent 104 speaking will include the majority, if not all of the putative instances of the customer service cue phrase 326 identified by the first speech processor 336. The speaker associated with the portions of the diarized call record 116 which include the majority of the putative instances of the customer service cue phrase 326 is identified as the customer service agent 104. The result of the first speech processor 326 is a first speaker ID'd diarized call record 338 in which the customer service agent 104 is identified.
The first speaker ID'd diarized call record 338 is provided, along with the customer cue phrase 330 to a second speech processor 340 (e.g., a wordspotting system). The second speech processor 340 searches all of the portions of the first speaker ID'd diarized call record 338 to identify portions which include putative instances of the customer cue phrase 330. As was the case above, each identified putative instance includes a hit quality score which characterizes how confident the second speech processor 340 is that the identified putative instance of the customer cue phrase matches the actual customer service cue phrase 330.
In general, the customer cue phrase 330 is a phrase that is known to be commonly spoken by customers 102 and to be rarely spoken by customer service agents 104. Thus, it is likely that the portions of the first speaker ID'd diarized call record 338 which correspond to the customer 102 speaking will include the majority, if not all of the putative instances of the customer cue phrase 330 identified by the second speech processor 340. The speaker associated with the portions of the first speaker ID'd diarized call record 338 which include the majority of the putative instances of the customer cue phrase 330 is identified as the customer 102. The result of the second speech processor 326 is a second speaker ID'd diarized call record 342 in which the customer service agent 104 and the customer 102 are identified. The second speaker ID'd call record 342 is stored in the database of speaker ID'd call records 332 for later use.
Referring to
Referring to
In some examples, the query 546 specified by the user takes the following form:
Q=(speakerType, phrase(s));
For example, the user 548 may specify a query such as:
Q=(Customer, “I received a letter”);
Within the speaker specific searching system 544, the query 546 and a speaker ID'd diarized call record 550 are provided to a speaker specific speech processor 552 which processes the portions of the speaker ID'd diarized call record 550 which are associated with the speakerType specified in the query to identify putative instances of the phrase(s) included in the query. Each identified putative instance includes a hit quality score which characterizes how confident the speaker specific speech processor 552 is that the identified putative instance of the phrase(s) matches the actual phrase(s) specified by the user. In this way, the efficiency and accuracy of searching the audio recording 112 is made more efficient since the searching operation is limited to only those portions of the audio recording 112 which are related to a specific speaker, thereby restricting the search space.
The query result 553 of the speaker specific speech processor 552 is provided to the user 548. In some examples, each of the putative instances, including the quality and temporal location of each putative instance, is shown to the user 548 on a computer screen. In some examples, the user 548 can interact with the computer screen to verify that a putative instance is correct, for example, by listening to the audio recording at and around the temporal location of the putative instance.
Referring to
In some examples, the user 628 supplies the cue phrases for the different speaker types (e.g, customer service agent, customer) by using a command such as:
SPEAKER_IDEN(Customer Service, “Hi, how may I help you”)
or
SPEAKER_IDEN(Customer,“I received a letter”)
In the present example, a diarized call record 616, which is the same as the diarized call record 116 illustrated in
The result 638 of the first speech processor 636 is passed to a second speech processor 640 along with the customer cue phrase 630 (i.e., “I received a letter”). The second speech processor 640 searches the result 638 of the first speech processor 636 for the customer cue phrase 626 and locates a putative instance of the customer cue phrase in the second portion of the result 638. Since the second portion of the result 638 is associated with the second speaker 322, the second speech processor 640 identifies the second speaker 322 as the customer. The result of the second speech processor 640 is a second speaker ID'd diarized call record 642 in which the first speaker 320 is identified as the customer service agent and the second speaker 322 is identified as the customer. The second speaker ID'd call record 642 is stored in a database of speaker ID'd call records 632 for later use.
Referring to
Q=(Customer Service, “I can help you with that”)
Such a query indicates that portions of a diarized call record which are associated with a customer service agent should be searched for putative instances of the term “I can help you with that.”
In the present example, a speaker ID'd diarized call record 750, which is the same as the second speaker ID'd diarized call record 342 of
In some examples, a conversation involving more than two speakers is included in a diarized call record. In other examples, a diarized call record of a conversation between a number of speakers includes more diarized groups than there are speakers.
While the examples described above identify all speakers in a diarized call record, in some examples, it is sufficient to identify less than all of the speakers (i.e., a speaker of interest) in the diarized call record.
The examples described above generally label speaker segregated (i.e., diarized) data by the roles of the speakers as indicated by the presence of user specified queries. However, the speaker segregated data can be labeled according to a number of different criteria. For example, the speaker segregated data may be labeled according two or more topics discussed by the speakers in the speaker segregated data.
In some examples, the individual tracks (i.e., the single speaker records) of the diarized call records are identified by an automated segmentation process which identifies two or more speakers on the call based on the voice characteristics of the two or more speakers.
In some examples, the speaker identification system can be used to segregate data into portions that do or do not include sensitive information such as credit card numbers.
While the above description relates to speaker identification in diarized call records recorded at customer service call centers, it is noted that the same techniques can be used to identify the parties in a log of a text interaction (e.g., a chat session) where the parties in the interaction are not labeled. In such a case, rather than using speech processors, a structured query language using text parsing and searching algorithms are used.
In some examples, a text interaction between two or more parties includes macros (e.g., automatically generated text) that are used by agents in chat rooms for basic or common interactions. In such examples, a macro may be a valid speaker type.
4 Implementations
Systems that implement the techniques described above can be implemented in software, in firmware, in digital electronic circuitry, or in computer hardware, or in combinations of them. The system can include a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor, and method steps can be performed by a programmable processor executing a program of instructions to perform functions by operating on input data and generating output. The system can be implemented in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.