The present invention relates to a method of generating a single-channel audio signal representing a multi-party conversation. It has particular utility in recording conversations carried by enterprise voice systems such as teleconferencing systems, call centre systems and trading room systems.
Automatic speech analytics (SA) for contact-centre interactions can be used to understand the drivers of customer experience, assess agent performance and conformance, and to perform root-cause analysis. An important element of speech analytics is the automatic production of a transcript of a conversation which includes an indication of who said what (or the automatic production of a transcript of what a particular party to the conversation said).
US patent application US 2015/0025887 teaches that each conversation in a contact centre is recorded in a mono audio file. Recording conversations in mono audio files is a great deal cheaper than recording conversations in files having separate audio channels for different speakers. However, in subsequent analysis of the recorded conversation this brings in a need to separate the talkers in the conversation. The above patent application achieves this using a transcription process in which blind diarization (establishing which utterances were made by the same person) is followed by speaker diarization (establishing the identity of that person). The blind diarization uses clustering to identify models of the speech of each of the participants in the multi-party conversation. A hidden Markov Model is then used to establish which participant said each utterance in the recorded conversation. The speaker models are then compared with stored voiceprints to establish the identity of each of the participants in the conversation.
There is a need to provide enterprise voice systems which avoid the complexity of training the system to recognise voiceprints of all the users of those systems whilst still enabling talker separation and the use of single-channel audio recording. The use of single-channel audio recording reduces memory and bandwidth costs associated with the enterprise voice system.
According to a first aspect of the present invention, there is provided a method of generating a single-channel audio signal representing a multi-party conversation, said method comprising:
receiving a plurality of audio signals representing the voices of respective participants in the multi-party conversation, and for at least one of the participants, marking the audio signal representing the participant's voice by:
i) finding the current energy in the audio signal representing the participant's voice;
ii) generating a speaker-dependent signal having an energy proportional to the current energy in the audio signal representing the participant's voice; and
iii) adding said speaker-dependent signal to the audio signal representing the participant's voice to generate a marked audio signal;
generating a single-channel audio signal by summing said at least one marked audio signal and any of said plurality of audio signals which have not been marked.
Because, in general, only one person speaks at any one time in a multi-party conversation, and because the energy in a telephony signal will increase greatly when the telephony signal contains speech rather than background noise, by receiving a plurality of audio signals representing the voices of respective participants in the multi-party conversation, and for at least one of the participants, marking the audio signal representing the participant's voice as set out above, and thereafter generating a single-channel audio signal by summing the marked audio signal and any other audio signals which have not been marked, at points in the multi-party conversation where the at least one participant is speaking, the speaker-dependent signal for the at least one participant will contain sufficient energy in comparison to the other signals in the single-channel audio signal to render it detectable by a subsequent diarization process despite it being mixed with the audio signals representing the input of the other participant or participants to the multi-party conversation (the input from the other participant or participants often merely being background noise).
In some embodiments, said speaker-dependent signal is generated from a predetermined speaker identification signal. This simplifies generating a speaker-dependent signal with an energy which is proportional to the energy in the speaker's audio signal measured over whatever time period the added speaker-dependent signal extends over. In particular, the speaker identification signal, or a portion of the speaker identification signal added during a energy analysis time period, can be scaled by an amount proportional to the energy found in the audio signal over that energy analysis time period to generate said speaker-dependent signal.
In some embodiments, said speaker identification, speaker-dependent and audio signals comprise digital signals. This allows the use of digital signal processing techniques.
In some embodiments, the speaker identification signal comprises a digital watermark. A digital watermark has the advantage of being imperceptible to a person who listens to the marked audio signal—such as one or more of the participants to the conversation in embodiments where the marked audio signal is generated in real time, or someone who later listens to a recording of the multi-party conversation.
In some cases, the speaker identification signal is a pseudo-random bit sequence. The pseudo-random bit sequence can be derived from a maximal length code—this has the advantage of yielding an autocorrelation of +N for a shift of zero and −1 for all other integer shifts for a maximal length code of length N; shifted versions of a maximal length code may therefore be used to define a set of uncorrelated pseudo-random codes.
Some embodiments further comprise finding the spectral shape of the audio signal over a spectral analysis time period, and then spectrally shaping the speaker-identification signal, or a portion thereof, to generate a speaker-dependent signal whose spectrum is similar to the spectrum of the audio signal representing the at least one participant's voice. This allows a speaker-identification signal with a greater energy to be added whilst remaining imperceptible. A speaker identification signal with greater energy can then be more reliably detected. One method of finding the spectral shape of the audio signal is to calculate linear prediction coding (LPC) coefficients for the audio signal, the speaker-identification signal can then be spectrally shaped by passing it through a linear prediction filter set up with those LPC coefficients.
In order to allow analysis of the single-channel audio signal to be performed at a later time, in some embodiments, the single-channel audio signal is recorded in a persistent storage medium.
According to a second aspect of the present invention, there is provided a method of processing a single-channel audio signal representing a multiparty conversation to identify the current speaker, said single-channel audio signal having been generated using a method according to the first aspect of the present invention, said method comprising processing said signal to recognise the presence of a speaker-dependent signal based on a predetermined speaker identification signal in said single-channel audio signal.
By processing said single-channel recording to recognise the speaker-identification signal used in generating the speaker-dependent signal in the single-channel audio signal and thereby identify the current speaker, automatic analysis of the conversation is enabled whilst avoiding the need for finding and storing voiceprints of possible participants in the multi-party conversation.
There now follows, by way of example only, a description of one or more embodiments of the invention. This description is given with reference to the accompanying drawings, in which:
In a first embodiment, an IP-based voice communications network is used to deploy and provide a contact centre for an enterprise.
The IP-based voice communications network includes a plurality of customer service agent computers (24A-24D), each of which is provided with a headset (26A-26D). A local area network 23 interconnects the customer service agent computers 24A-24D with a call control server computer 28, a call analysis server 30, the router 12 and the PSTN gateway 18.
Each of the customer service agents' personal computers comprises (
Also communicatively coupled to the central processing unit 40 via the communications bus 46 is a network interface card 48 and a USB interface card 50. The network interface card 48 provides a communications interface between the customer service agent's computer 24A-24D and the local area network 23. The USB interface card 50 provides for communication with the headset 26A-26D used by the customer service agent in order to converse with customers of the enterprise who telephone the call centre (or who are called by a customer service agent—this embodiment can be used in both inbound and outbound contact centres).
The hard disk 60 of each customer service agent computer 24A-24D stores:
i) an operating system program 62,
ii) a speech codec 64,
iii) a watermark insertion module 66;
iv) an audio-channel mixer 68;
v) a media file recorder 70;
vi) a media file uploader 72;
vii) one or more media files 73 storing audio representing conversations involving the agent;
viii) a set of agent names and associated pseudo-random sequences 74; and
ix) a target signal-to-watermark ratio 76.
Some or all of the modules ii) to vi) might be provided by a VOIP telephony client program installed on each agent's laptop computer 24A-24D.
The call analysis server 30 comprises (
Also communicatively coupled to the central processing unit 80 via the communications bus 86 is a network interface card 88 which provides a communications interface between the call analysis server 30 and the local area network 23.
The hard disk 90 of the call analysis server 30 stores:
i) an operating system program 92,
ii) a call recording file store 94,
iii) a media file diarization module 96, and
iv) a media file diarization database 98 populated by the diarization module 96.
One or more of the modules ii) to iv) might be provided as part of a speech analytics software application installed on the call analysis server. The speech analytics software might be run on an application server, and communicate with a browser program on a personal computer via a web server, thereby allowing the remote analysis of the data stored at the call analysis server.
In order to allow subsequent processing to identify watermarked digital audio as representing the voice of a given agent, the media file diarization database (
i) an agent table (
ii) an indexed maximal length sequence table (
iii) a diarization table which is populated by the media file diarization module 96 as it identifies utterances in an audio file, and, where possible, attributes those utterances to a particular person. A row is created in the diarization table for each newly identified utterance, each row given (where found) the Agent ID of the agent who said the utterance, the name of the file in which the utterance was found, and the start time and end time of that utterance (typically given as a position in the named audio file).
On an audio marking process (
It is to be noted that the digitised audio received from the USB port will represent the voice of the customer service agent, and periods of silence or background noise at other times. In contact centre environments, the level of background noise can be quite high, so in the present embodiment, the headset is equipped with noise reduction technology.
In the present embodiment, the digitised audio is a signal generated by sampling the audio signal from the agent's microphone at an 8 kHz sampling rate. Each sample is a 16-bit signed integer.
The processing of each sub-block of digitised audio begins with the determination 116 of an index value k. The index is set to the value of the mth element of the downloaded unique pseudo-random sequence. So, referring to
The kth maximal length code from the downloaded indexed maximal length sequence table is then selected 118. Alternatively the maximal length code could be automatically generated by applying k circular leftwards bit shifts to the first maximal length code.
The change in the maximal length code from sub-block to sub-block is used in the present embodiment to avoid the generation of an unwanted artefact in the watermarked digital audio which would otherwise be introduced owing to the periodicity that would be present in the watermark signal were the same watermark signal to be added to each sub-block.
The sub-block is then processed to calculate 120 scaling and spectral shaping parameters to be applied to the selected maximal length sequence to generate the watermark to be added to the sub-block.
The calculation of the scaling and spectral shaping parameters (
Those skilled in the art will understand that the LPC synthesis filter models the filtering provided by the vocal tract of the agent. In other words, the LPC filter models the spectral shape of the agent's voice during the current frame of sub-blocks.
In the present embodiment, the target signal-to-watermark ratio (
The target signal-to-watermark ratio (SWR dB) is used, along with LPC filter coefficients AID1,m, to determine the gain factor required for use in the scaling of the selected maximal length sequence. The required energy in the watermark signal is first calculated using equation 1 below.
The terms ‘m’ and ‘ID1’ in the suffix of the audio sample magnitudes SpID1,m and the watermark signal values WID1,m indicate that the values relate to the mth sub-block of audio signal received from a given agent's (Agent ID1's) headset.
The energy of the signal resulting from passing the maximal length code MLID1,m through an LPC synthesis filter having coefficients AID1,m is the calculated, and the watermark gain Gm required to scale the energy of the filtered maximal length sequence to the required energy in the watermark signal is found. It will be appreciated that, given the constant ratio between the audio signal energy and the watermark energy, the gain will rise and fall monotonically as the energy in the audio signal sub-block rises and falls.
Returning to
The spectrally shaped maximal length sequence signal is then scaled 126 by the calculated watermark signal gain GID1,m to generate a watermark signal. The scaling of the signal provides a second part of the calculation providing a watermark signal which contains as much power as possible whilst remaining imperceptible to a listener.
The combination of the scaling and spectral shaping of the maximal length sequence is thus in accordance with Equation 2 below.
W
ID1,m(n)=GID1,m*MLID1,m(n)ConvAID1,m (Equation 2)
Where MLID1,m is the maximal length sequence selected for the mth sub-block (which will in turn depend on the index k for the mth sub-block from this agent), and Conv AID1,m represents a convolution with an LPC synthesis filter configured with the calculated LPC coefficients AID1,m.
The thirty-one values in the watermark signal are then added 128 to the respective thirty-one sample values found in the audio sub-block signal. In other words, the watermark signal is added in the time domain to the audio block signal to generate a watermarked audio block signal.
The watermarked signal sub-block is then sent 129 for VOIP transmission to the customer's telephone (possibly by way of a VOIP-to-PSTN gateway).
A local recording of the conversation between the call centre agent and the customer is then produced by first mixing 130 the watermarked signal sub-block with the digitised customer audio signal, and then storing 131 the resulting mixed digital signal in the local media file 73. It is to be noted that the combined effect of the pseudo-random sequence of twenty index values k and the thirty-one bit maximal length sequences is to produce a contiguous series of agent identification frames in the watermarked audio signal, each of which is six hundred and twenty samples long. In practice, the sequence added differs from one identification frame to the next because of the scaling and spectral shaping of the signal.
A functional block diagram of the agent computer 24A-24D is shown in
With regard to the mixing which takes place at the mixer 160, the digitised customer audio signal will not be synchronized with the digitised agent audio signal. This lack of synchronization will be present at the sampling level (the instants at which the two audio signals are sampled will not necessarily be simultaneous), and, in situations where the customer's audio signal is watermarked, at the sub-block, block and identification frame level. Interpolation can be used to adjust the sample values of the digitised customer audio signal to reflect the likely amplitude of the analog customer audio signal at the sample instants used by the agent's headset. The resulting mixed signal SpC(n) is stored as a single-channel recording in a media file.
Returning once again to
The operation of the media file diarization module (
Following the uploading of the single-channel recordings of each conversation from the agents' computers 24A-24D to the call recording file store (
On such a request being made, a list of candidate agent IDs, along with the associated pseudo-random sequences, is read 202 from the agent table (
The digital audio samples from the media file to be diarized are then processed to first obtain 206 sub-block synchronisation. This will be described in more detail below with reference to
The sub-block synchronization search (
Then, a sliding window correlation analysis 256-266 is carried out, with the sliding window having a length of twenty blocks and being slid one sample at a time.
For each of the possible watermarked participants, a correlation measure is then calculated (256-262).
Each correlation measure calculation (256-262) begins with finding 256 the LPC coefficients for the first thirty-one samples of the sliding window. Those thirty one samples are then passed through the inverse LPC filter 258 which, when the sliding window happens to be in synchrony with the sub-block boundaries used in the watermarking process, will remove the spectral shaping applied to the watermark in the encoding process of the speech of the participant (
The correlation between the inverse filtered sub-block and each of the thirty-one maximal length sequences is then found 260 using Equation (3) below.
Where k is the maximal length sequence index (working on the hypothesis that the 620 sample identification frame is aligned with the sliding window), ML31(k,n) represents the nth bit of the kth maximal length sequence and SpCResm(n) represents the residual signal resulting from passing the recorded single channel audio signal SpC(n) through the inverse LPC filter.
Because of the working hypothesis that the 620 sample identification frame is aligned with the sliding window, the hypothetical index k of the maximal length sequence for the current sub-block m will be known. To give an example, if this is the first sub-block in the sliding window, and the current outer loop is calculating a correlation measure for Agent ID A, then the index k is one (see
The sub-block correlation measures found in this way are then added to the cumulative correlation measure for the participant currently being considered—with the maximal length sequence selected according to the pseudo-random sequence associated with that participant. The cumulative correlation measure is calculated according to Equation 4 below:
Where the parameter km represents the index k for the mth element of the pseudo-random sequence associated with the participant.
Once the cumulative correlation over the twenty sub-blocks for the current participant has been found, the cumulative correlation measure for the current participant is stored 264, and the process is repeated for any remaining possible watermarked participants. Once a cumulative correlation measure has been found for each of the possible participants, a synchronization confidence measure is calculated 264 in accordance with Equation 5 below.
where Max1 is the highest cumulative correlation measure found, and Max2 is the second highest cumulative correlation measure found. On a confidence test 266 finding that Conf(m) is less than or equal to a threshold value, the identification frame synchronisation process is repeated with the sliding window moved on by one sample. On the test 266 finding that Conf(m) exceeds a threshold, identification frame synchronisation (and hence sub-block synchronisation) has been found, and the diarization process moves on to calculating (
The sub-block watermark detection confidence calculation begins by fetching the next thirty one samples from the media file. A sliding window correlation calculation (282-290) is then carried out for each possible participant in the conversation.
The sliding window correlation calculation begins with the calculation 282 of the LPC coefficients for the sub-block, and an analysis filter with those coefficients is then used to remove 284 the spectral shaping applied to the watermark in the watermarking process. The correlation of the filtered samples with each of the thirty-one maximal length sequences is then calculated 286 (using equation (3) above) and stored 284. The calculated correlation is added 286 to a sliding window total for the participant, whilst any correlation calculated for the participant for the sub-block immediately preceding the beginning of the sliding window is subtracted 288. The sliding window total for the participant is then stored 290.
Once the sub-block correlation calculation for a given sliding window position is complete, a confidence measure is calculated 292 in accordance with equation 5 above (though in some embodiments, a different threshold is used). As explained above in relation to
The effect of the above embodiment will now be illustrated with reference to
The inventors have found that when a speech signal with a watermark is mixed with a speech signal without a watermark, then the watermark detection confidence is affected by the relative energies of the signals being mixed. A 4 s male signal (
In order to derive binary signals (flags) which indicate who was speaking when, the following Boolean equations were used, where the SP1 flag indicates the presence of male active speech and the SP2 flag indicates the presence of female active speech:
SP1 flag: (Female watermark confidence<T2) AND (speech activity=TRUE)
SP2 flag: (Female watermark confidence>T1) AND (speech activity=TRUE)
The results are shown in
Applying a combination of sliding-window median smoothing and pre and post hangover removes these momentary errors are can be seen in
As can be seen from
SP1: (Male confidence>T1) AND (speech activity=TRUE)
SP2: (Female watermark confidence>T1) AND (speech activity=TRUE)
These decision values still rely on the degradation of detection confidence for the watermarks in background noise mixed with the other active speech signals. Without this property, it would not be possible to identify which watermark was present in each segment of active speech, without recourse to some form of voice-activated watermark insertion in the encoder.
Possible variations on the above embodiments include (this list is by no means exhaustive):
i) whilst in the above embodiments, the contact centre was provided using an IP-based voice communications network, in other embodiments other technologies such as those used in Public Switched Telephony Networks, ATM networks or an Integrated Service Digital Networks might be used instead;
ii) in other embodiments, the above techniques are used in conferencing products which rely on a mixed audio signal, and yet provide spatialized audio (so different participants sound as though they are in different positions relative to the speaker);
iii) in the above embodiments, a call recording computer was provided. In other embodiments, legacy call logging apparatus might provide the call recording instead. Because the watermark is added by the terminals in the system, the benefits of the above embodiment would still be achieved even in embodiments where legacy call logging apparatus is used in the contact centre;
iv) in the above embodiments, the media files recording the interactions between the customer service agents and customers were stored at the call analysis server. In other embodiments, the media file recording the interactions between the customer service agents and customers could be uploaded to and stored on a separate server, being transferred to the call analysis server temporarily in order to allow the call recording to be analysed and the results of that analysis to be stored at the call analysis server.
v) in the above embodiments, the watermark signal was given an energy which was proportional to the energy of the signal being watermarked. In other embodiments, the calculated LPC coefficients could be used to generate an LPC analysis filter and the energy in the residual obtained on applying that filter to the block could additionally be taken into account. In yet other embodiments, the watermark signal could be given an energy floor even in situations where very low energy is present in the signal being watermarked.
vi) in the above embodiments, the LPC coefficients were calculated for blocks containing 155 digital samples of the audio signal. Different block sizes could be used, provided that the block sizes are sufficiently short to mean that the spectrum of the speech signal is largely stationary for the duration of the block.
vii) in the above embodiments, the watermark was added to the audio being transmitted over the communications network to the customer. In other embodiments, the watermark might only be added to the recorded audio, and not added to a separate audio stream sent to the customer. This could alleviate any problems caused by delay being added to the transmission of the agent's voice to the customer.
viii) in other embodiments, the customer might additionally have a terminal which adds a watermark to the audio produced by their terminal. In yet other embodiments, the customer's terminal might add a watermark to the customer's audio and the customer service agent's terminal might not add a watermark to the agent's audio.
ix) in the above embodiments, each agent had a unique ID and could be separately identified from a recording. In other embodiments, all agents could be associated with the same watermark, or groups of agents could be associated with a common watermark.
x) in some embodiments, the mixed digitised audio signal could be converted to an analogue signal before recording, with the subsequent analysis of the recording involved converting the recorded analogue signal to a digital signal.
xi) in the above embodiments, the single-channel mixed signal was generated by summing the magnitudes of the digitised audio signal in the time domain. In alternative embodiments, the summation of the two signals might be done in the frequency domain or any other digital signal processing technique which has a similar effect might be used.
xii) in the above embodiments, digital audio technology was used. However, in other embodiments, analog electronics might be used to generate and add analog signals corresponding to the digital signals described above.
xiii) the above embodiments relate to the recording and subsequent analysis of a voice conversation. In other embodiments, one or more of the participants has a terminal which is also generates a video signal including an audio track—the audio track then being modified in the way described above in the recorded video signal.
xiv) in the above embodiments, linear predictive coding techniques were used to establish the current spectral shape of the customer service agent's voice, and process the watermark to give it the same spectral shape before generating a single-channel audio signal by adding the watermark, the signal representing the customer service agent's voice, and the signal representing the audio from the customer (usually background noise when the customer service agent is speaking). In other embodiments, the linear predictive coding is avoided in the generation of the single-channel audio signal, so that the watermark is not spectrally shaped to match the voice signal. In some such embodiments, linear predictive coding can also be avoided in the analysis of the single-channel audio signal. The downside of avoiding the use of linear predictive coding is that the energy in the watermark signal is that much lower relative to the energy in the single-channel audio signal, making the recovery of the watermark signal more challenging.
xv) in embodiments where a voice activity detector (VAD) is used, then the watermark can be applied only to the current speaker. If this were done centrally (for example at a conferencing bridge or media server), then the amount of processing required, and hence the cost of the system, would be reduced.
xvi) in the above embodiments, the audio signal from the agent's microphone was sampled at an 8 kHz sampling rate. In other embodiments, a higher sampling rate might be used (for example, one of the sampling rates (44.1, 48, 96, and 192 kHz) offered by USB audio). In the above embodiment, the sample size was is 16 bits. In other embodiments, higher sample sizes, for example 16 or 32 bits might be used instead. For higher sampling rates, the LPC shaping would be of even greater benefit as lack of high frequency energy in speech would provide little masking for white-noise-like watermark signals.
xvii) in the first of the above embodiments, synchronization was achieved by performing a sliding window correlation analysis between the recorded digitised audio signal and each of the basic agent identification sequences. In other embodiments, the digital audio recording might include framing information which renders the synchronization process unnecessary.
xviii) in other embodiments, in the synchronization process, an additional timing-refinement step may be required to account for any possible sub-sample shifts in timing that may have occurred in any mixing and re-sampling processes carried out after the watermark has been applied; such a step would involve interpolation of either the analysis audio signal or the target watermark signals.
xix) in the above embodiments, there are 31 different ML codes, which form the basis of the watermark signalling. Each of the indices in the PR sequences reference an ML code; the PR sequences allow averaging over time (multiple ML codes) to be performed without the introduction of audible buzzy artifacts from repetitive ML codes. In embodiments where the artifact problem is ignored, then each ID could be assigned just one of the 31 ML codes and the averaging length would be set according to desired robustness.
In summary of the above disclosure, an enterprise voice system such as a contact centre is disclosed which provides a speech analytics capability. Whilst call recording is common in many contact centres, calls are normally recorded in single-channel audio files in order to save costs. Previous attempts to provide automatic diarization of those recorded calls have relied on training the system to recognise voiceprints of users of the system, and then comparing utterances within the recorded calls to those voiceprints in order to identify who was speaking at that time. In order to avoid the need to train the system to recognise voiceprints, an enterprise voice system is disclosed which inserts a mark into the audio signal from each user's microphone. By inserting the mark with an energy, and, in some cases also with a spectrum, which matches the audio signal into which it is inserted, and taking advantage of typically only one user speaking at a time, a mark is left in the recorded call which a speech analytics system can use in order to identify who was speaking at different times in the conversation.
Number | Date | Country | Kind |
---|---|---|---|
15187782.6 | Sep 2015 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2016/073237 | 9/29/2016 | WO | 00 |