The invention relates to a method of speech separation and recognition. Specifically, the present invention relates to a method of speech separation and recognition of service agents, customers, and semi-supervised training in a call center, which enhances the automatic monitoring of customer service call centers.
Today, the number of customer service telephone calls is increasing rapidly in many fields such as telecommunications, finance, electricity, retail, etc. Therefore, knowing the concerns of customers as well as whether service agents are giving accurate and proper advice to customers is an urgent need for managers. This can be done manually using some supervisors who randomly listen to several telephone calls. However, this method is costly in terms of manpower and delays in time, while the information obtained depends on the subjectivity of the supervisors. Therefore, it is necessary to have a method to automatically separate and recognize the speech of the service agents and the customers in customer service telephone calls. In addition, there needs to be an automatic training method to help the system recognize more and more accurately when put into use.
This present invention aims to provide a method for speech separation and recognition of agent, customer, and semi-supervised training in telephone call centers to automate the monitoring of customer service telephone calls.
Specifically, the present invention provides a method including:
Step 1: collect speech data of customer service telephone calls for analysis. This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals;
Step 2: separate and label text for speech files; at this step, put the files in step 1 to the labeling system for the transcribers to listen, separate and label the transcription for the service agent and customer's speech; the output of this step is the speech sets that have been classified and labeled separately into the service agent's speech set and the customer's speech set;
Step 3: create training and test sets; accordingly, when the speech data is labeled in the service agent's speech set and the customer's speech set in step 2, both ≥Hlabel_min data hours, in which Hlabel_min≥10 hours to ensure the data set is large enough; The administrator decides to select some of the speech files labeled in step 2 to create the training set, the remaining files are used to create the test set with the requirement that the test set size needs to be larger than Htest_min data hours, where Htest_min≥2 hours to ensure that the test set is large enough and reliable;
Step 4: build two language models; LMa for agents and LMb for customers based on the training data set created in step 3 to store spoken language features such as frequently spoken phrases by the service agent and the customer from that to distinguish the statements of the service agent or the customer in the following steps; Language models can be n-grams or neural network-based models;
Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition; This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals;
Step 6: automatically cut speech files into small segments; for each speech file obtained in step 5, the speech is automatically cut into segments based on the signal characteristics; we can rely on popular methods such as: based on the average energy of the signal or we can rely on speech recognition systems;
Step 7: extract speaker feature vectors; all speech segments obtained in step 6 are extracted by a pre-trained feature extraction network such as a deep learning neural network (DNN) to obtain speaker feature vectors, wherein each speech segment will obtain a corresponding speaker feature vector;
Step 8: cluster speech segments; for each speech file, cluster the speech segments in step 6 into two clusters C1 and C2 based on the speaker feature vectors extracted in step 7;
Step 9: convert speech to text; all speech segments in step 6 are converted to text using a speech recognition system, with each speech segment obtaining a corresponding text and a recognition confidence score CS ranging from 0 to 1;
Step 10: select the speech segment satisfying the conditions as a basis for classification; for each speech file, select the speech segments in step 9 that satisfy the condition: have confidence score CS≥α, where 0.5≤α≤0.95 to eliminate speech segments with too low confidence, which are often speech segments with too poor quality or too noisy environment affecting the quality of the classification system; if no satisfactory speech segment is selected, skip the current file and move to a new speech file;
Step 11: classify speech segments of service agents and customers;
with the speech segments selected in step 10 divided into two clusters in step 8, compute, where PPLa1, PPLa2, PPLb1, PPLb2 are the perplexity given by the language models LMa, LMb in step 4 computed with the text data set of speech segments selected in step 10; PPLa1, PPLb1 are computed for the segments in cluster C1; PPLa2, PPLb2 correspond to segments in cluster C2; if w≤θ, all speech segments in cluster C1 are identified as service agents, all speech segments in cluster C2 are identified as customers, and vice versa if w>θ, all speech segments in cluster C2 are identified as customers speech segments in cluster C2 are identified as service agents, all speech segments in cluster C1 are identified as customers; threshold θ has a value in the range from 0.5 to 2.0; After this step, we have completed speech separation and recognition for the service agent and the customer, if there is a need for semi-supervised training to improve the quality of the system, proceed to step 12, otherwise, stop;
Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set; select speech segments in step 9 meeting the requirement: have confidence score CS>β, in which 0.8≤β≤0.99 to select speech segments with a high recognition confidence score for the semi-supervised training dataset; each speech segment has been labeled as service agent or customer from step 11;
Step 13: choose the time to update the language models; when the training data in the semi-supervised set is greater than a threshold Hsemi_min data hours and when there is a decision of the administrator, where Hsemi_min≥10 hours for then semi-supervised training data is large enough and reliable;
Step 14: build language models based on semi-supervised data; at this step, use the data in the semi-supervised set to build two language models, LMa_semi with service agent data and LMb_semi with customer data; then combine with two language models LMa, LMb in step 4 to create two language models LMa′, LMb′, with association coefficient k, where 0.8≥k≥0.1;
Step 15: update the language models; compute
where PPLa1, PPLa2, PPLb1, PPLb2 are the perplexity given by the language models LMa, LMb in step 4 computed with the text data of the test sets in step 3; PPLa1, PPLb1 are computed for the test set consisting of speech segments of the service agent; PPLa2, PPLb2 are calculated for the test set of customer speech segments; then compute
similar as w0 by replacing the two language models in step 4 by LMa′ and LMb′ in step 14; if w0>q*w1, update LMa with LMa′, LMb with LMb′; where q≥1.0.
The invention is detailed below, specifically, a method of speech separation and recognition of service agents, customers and semi-supervised training in a customer service call centers comprising of steps:
Step 1: collect speech data of customer service telephone calls for analysis;
Step 2: separate and label text for speech files;
Step 3: create training and test sets;
Step 4: build two language models;
Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition;
Step 6: automatically cut speech files into small segments;
Step 7: extract speaker feature vectors;
Step 8: cluster speech segments;
Step 9: convert speech to text;
Step 10: select the speech segments satisfying the conditions as a basis for classification;
Step 11: classify speech segments of service agents and customers;
Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set;
Step 13: choose the time to update the language models;
Step 14: build language models based on semi-supervised data;
Step 15: update the language models.
The details of these steps are as follows:
Step 1: collect speech data of customer service telephone calls for analysis. This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals.
Step 2: separate and label text for speech files; at this step, put the files in step 1 to the labeling system for the transcribers to listen, separate and label the transcription for the service agent and customer's speech; the output of this step is the speech sets that have been classified and labeled separately into the service agent's speech set and the customer's speech set.
Step 3: create training and test sets; accordingly, when the speech data is labeled in the service agent's speech set and the customer's speech set in step 2, both≥Hlabel_min data hours, in which Hlabel_min≥10 hours to ensure the data set is large enough; The administrator decides to select some of the speech files labeled in step 2 to create the training set, the remaining files are used to create the test set with the requirement that the test set size needs to be larger than Htest_min data hours, where Htest_min≥2 hours to ensure that the test set is large enough and reliable.
Step 4: build two language models; LMa for agents and LMb for customers based on the training data set created in step 3 to store spoken language features such as frequently spoken phrases by the service agent and the customer from that to distinguish the statements of the service agent or the customer in the following steps; Language models can be n-grams or neural network-based models.
Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition; This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals.
Step 6: automatically cut speech files into small segments; for each speech file obtained in step 5, the speech is automatically cut into segments based on the signal characteristics; we can rely on popular methods such as: based on the average energy of the signal or we can rely on speech recognition systems.
Step 7: extract speaker feature vectors; all speech segments obtained in step 6 are extracted by a pre-trained feature extraction network such as deep learning neural network (DNN) to obtain speaker feature vectors, with each speech segment will obtain a corresponding speaker feature vector.
Step 8: cluster speech segments; for each speech file, cluster the speech segments in step 6 into two clusters C1 and C2 based on the speaker feature vectors extracted in step 7.
Step 9: convert speech to text; all speech segments in step 6 are converted to text using a speech recognition system, with each speech segment obtaining a corresponding text and a recognition confidence score CS ranging from 0 to 1.
Step 10: select the speech segment satisfying the conditions as a basis for classification; for each speech file, select the speech segments in step 9 that satisfy the condition: have confidence score CS≥α, where 0.5≤α≤0.95 to eliminate speech segments with too low confidence, which are often speech segments with too poor quality or too noisy environment affecting the quality of the classification system; if no satisfactory speech segment is selected, skip the current file and move to a new speech file.
Step 11: classify speech segments of service agents and customers;
with the speech segments selected in step 10 divided into two clusters in step 8, compute, where PPLa1, PPLa2, PPLb1, PPLb2 are the perplexity given by the language models LMa, LMb in step 4 computed with the text data set of speech segments selected in step 10; PPLa1, PPLb1 are computed for the segments in cluster C1; PPLa2, PPLb2 correspond to segments in cluster C2; if w≤θ, all speech segments in cluster C1 are identified as service agents, all speech segments in cluster C2 are identified as customers, and vice versa if w>θ, all speech segments in cluster C2 are identified as customers speech segments in cluster C2 are identified as service agents, all speech segments in cluster C1 are identified as customers; threshold θ has a value in the range from 0.5 to 2.0; After this step, we have completed speech separation and recognition for the service agent and the customer, if there is a need for semi-supervised training to improve the quality of the system, proceed to step 12, otherwise, stop.
Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set; select speech segments in step 9 meeting the requirement: of having confidence score CS≥β, in which 0.8≤β≤0.99 to select speech segments with a high recognition confidence score for the semi-supervised training dataset; each speech segment has been labeled as service agent or customer from step 11.
Step 13: choose the time to update the language models; when the training data in the semi-supervised set is greater than a threshold Hsemi_min data hours and when there is a decision of the administrator, where Hsemi_min>10 hours for then semi-supervised training data is large enough and reliable.
Step 14: build language models based on semi-supervised data; at this step, use the data in the semi-supervised set to build two language models, LMa_semi with service agent data and LMb_semi with customer data; then combine with two language models LMa, LMb in step 4 to create two language models LMa′, LMb′ with association coefficient k, where 0.8≥k≥0.1.
Step 15: update the language models; compute
where PPLa1, PPLa2, PPLb1, PPLb2 are the perplexity given by the language models LMa, LMb in step 4 computed with the text data of the test sets in step 3; PPLa1, PPLb1 are computed for the test set consisting of speech segments of the service agent; PPLa2, PPLb2 are calculated for the test set of customer speech segments; then compute
similar as w0 by replacing the two language models in step 4 by LMa′ and LMb′ in step 14; if w0>q*w1, update LMa with LMa′, LMb with LMb′; where q≥1.0.
The solution has been applied to build a method of separating, recognizing the speech of service agents, customers and semi-supervised training in Viettel's customer service call centers.
At Viettel customer service call centers, we use this method to separate and recognize the speech of service agents and customers into text. From there, it is possible to monitor and make statistics for the content of customer service telephone calls automatically and quickly. In addition, we can also know the thoughts and frustrations of the customers as well as whether the service agent's response is correct. The system is constantly updated based on the semi-supervised training mechanism, thereby helping to improve the accuracy of the system.
A special advantage related to this present invention is to develop a method of speech separation and recognition of service agents, customers and semi-supervised training in call centers. This recommendation method lets managers see what their service agents and customers say. From there, quickly and objectively know the wishes and concerns of customers as well as whether their service agents can give accurate and correct advice. In addition, the system is constantly updated based on the semi-supervised training mechanism, which means that the system can self-learn from actual data during operation, thereby helping to improve the system's accuracy.
Although the above descriptions contain many specifics, they are not intended to be a limitation of the embodiment of the invention but are intended only to illustrate some preferred execution.
Number | Date | Country | Kind |
---|---|---|---|
1-2021-04220 | Jul 2021 | VN | national |