The present invention relates to speech recognition in general, and to method and apparatus enhanced phonetic search, in particular.
Many organizations, such as commercial organizations, financial institutions, government agencies or public safety organizations conduct numerous interactions with customers, users, suppliers and the like on a daily basis. Many of these interactions are vocal, or at least comprise a vocal or audio component, for example, voices of participants of a phone call or the audio portion of a video or face-to-face interaction.
Many organizations record some or all of the interactions, whether it is required by law or regulations, for quality assurance or quality management purposes, or for any other reason.
Once the interactions are recorded, the organization may want to extract as much information as possible from the interactions. A common usage for such recorded interactions relates to speech recognition and in particular to searching for particular words pronounced by either side of the interaction, such as product or service name, a competitor name, competing product name, or the like.
Searching for words can be performed by phonetic indexing of the interaction's audio signal and then searching the index for words. The search speed of a single indexed interaction is quite fast, but when dealing with large amounts of indexed interactions the accumulative search speed may be very slow in terms of user response time. There is thus a need in the art for a method and apparatus for enhanced phonetic indexing and search, in order to enhance the speed of speech search systems that are based phonetic indexing and search algorithms.
Searching for words in a large amount of recorded audio signals by using traditional phonetic indexing and search may result in a very slow accumulative search speed in terms of user response time. Search speed in traditional phonetic search systems is a linear function of the number of indexes that are being searched.
An aspect of an embodiment of the disclosed subject matter, relates to a system and method for improving the phonetic search speed, thus enabling fast search speed, in terms of user response time. The search speed improvement is based on splitting the search task into two phases. The first search phase is a fast coarse search and the second phase is a slow fine search. The coarse search is based on inverted phonetic indexing and search. The fine search is based on traditional phonetic indexing and search. The method comprising: receiving a digital representation of an audio signal; producing a phonetic index of the audio signal; producing phonetic N-gram sequence from the phonetic index by segmenting the phonetic index into a plurality of phonetic N-grams; and producing an inverted index of the plurality of phonetic N-grams.
The method can further comprise: obtaining a textual search term; converting the textual search term into a phonetic search term; searching the phonetic search term on the inverted index. The method can further comprise ranking two or more digital audio signals based on the searching of the inverted index and determining whether to perform a second search phase, on the audio signals, based on the audio signals determined rank. In some embodiments only audio signals that have a rank that is higher than a predefined threshold, will be searched by the second search phase. In some embodiments the determination regarding performing the second search phase may be based on the accuracy estimation of the inverted index search. In some embodiments the determination regarding performing the second search phase may be based on other parameters, such as the load balance of a device performing the phonetic search.
By using this method, the search speed, in terms of user response time, of searching for words in a large amount of recorded audio signals is enhanced significantly. Additionally the system provides control over the tradeoff between search speed and search accuracy.
The present disclosure will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which corresponding or like numerals or characters indicate corresponding or like components. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings:
Reference is made to
A typical environment where a system according to the invention may be deployed may be an interaction-rich organization, e.g., a call center, a bank, a trading floor, an insurance company or any applicable financial or other institute. Other environments may be a public safety contact center, an interception center of a law enforcement organization, a service provider, an internet content delivery company with multimedia search needs, a system for content delivery programs, or the like. Interactions captured and provided to system 100 may be any applicable interactions or transmissions, including broadcasts, interactions with customers or users or interactions involving organization members, suppliers or other parties.
Various data types may be provided as input to system 100. The information types optionally include auditory segments, video segments, textual interactions, and additional data. The capturing of voice interactions, or the vocal or auditory part of other interactions, such as video, may be of any form, format, and may be produced using various technologies, including trunk side, extension side, summed audio, separate audio, various encoding and decoding protocols such as G729, G726, G723.1, and the like. The digital representations of the audio signals of the interactions may be provided by telephone/VoIP module 112, the walk-in center 116, the video conference 124 and additional sources 128 and captured by the capturing and logging module 132. Vocal interactions may include telephone or voice over IP (VoIP) sessions, telephone calls of any kind that may be carried over landline, mobile, satellite phone or other technologies.
It will be appreciated that voice messages are optionally captured and processed as well, and that embodiments of the disclosed subject matter are not limited to two-sided conversations. Captured interactions may include face to-face interactions, such as those recorded in a walk-in-center, video conferences that include an audio component or any additional sources of data as shown by 128. Additional sources 128 may include vocal sources such as microphone, intercom, vocal input by external systems, broadcasts, files, streams, or any other source.
Data from all the above-mentioned sources and others may be captured and/or logged by the capturing and logging module 132. The capturing and logging module 132 may include a computing platform that may execute one or more computer applications, e.g., as detailed below. The captured data may optionally be stored in storage which is preferably a mass storage device, for example an optical storage device such as a CD, a DVD, or a laser disk; a magnetic storage device such as a tape, a hard disk, Storage Area Network (SAN), a Network Attached Storage (NAS), or others; a semiconductor storage device such as Flash device, memory stick, or the like.
The storage may be common or separate for different types of captured segments of an interaction and different types of additional data. The storage may be located onsite where the segments or some of them are captured, or in a remote location. The capturing or the storage components can serve one or more sites of a multi-site organization.
Phonetic indexing and inverted indexing component 140 may produce phonetic index and inverted index for each interaction. The phonetic index is a sequence of phonemes representing the speech sounds of the interaction. The inverted index is a data structure that maps between sub-sequences of the phonetic index and the location they appear in. The location includes the audio signal unique identifier and the time tag in milliseconds within the audio signal. The inverted index data structure enables fast searching of phoneme sequences.
The storage device 144 may store the phonetic indexes 146, and the inverted phonetic indexes 148 of audio interactions, that are produced by the inverted indexing component 140.
Two phase phonetic search component 150 may use inverted phonetic indexes 148 and/or phonetic indexes 146. The inverted indexes are used for first phase search of search terms. The first phase search is a fast course. The first phase search results are ranked. The ranked output of this first phase search may be used as input to a second phase search or used as is, without performing a second phase search. The second phase search may be performed on the top ranked interactions. The second phase search is a phonetic search that is performed using phonetic indexes 146. The top ranked interactions may be selected by using a predefined threshold or selecting the N top ranked interactions or by a combination of threshold selection with limiting the number of selected interactions, in order to bound search time.
The output of the two phase phonetic search component 150 may preferably be sent to further analysis module 152. Further analysis may include, but is not limited to, emotion detection, speech to text, text analysis on the resulting text, call flow analysis, root cause analysis, link analysis, topic extraction, categorization, clustering, or the like. The further analysis may be based on the search results, for example, categorization process may use words or phrases that were detected by the two phase phonetic search component 150 for categorizing interactions.
The output of the two phase phonetic search component 150 may also be transferred to the playback and visualization module 154, if required. The search results can also be presented in any way the user prefers, including for example various graphic representations, textual presentation, table presentation, vocal representation, or the like, and can be transferred in any required method. The output can also be presented as a dedicated user interface or media player that provides the ability to examine and listen to certain areas of the interactions, for example: areas that include detected search results.
The output of the two phase phonetic search component 150 may also be transferred to the storage module 156 for storing search results. Search results storage may include the detected search term, the audio signal in which the term was detected, the time tag of the detected term within the audio signal and the certainty score of the detected search term.
System 100 may include one or more collections of computer instructions, such as libraries, executables, modules, or the like, programmed in any programming language such as C, C++, C#, Java or other programming languages, and/or developed under any development environment, such as .Net, J2EE or others.
Alternatively, methods described herein may be implemented as firmware ported for a specific processor such as digital signal processor (DSP) or microcontrollers, or may be implemented as hardware or configurable hardware such as field programmable gate array (FPGA) or application specific integrated circuit (ASIC). The software components may be executed on one platform or on multiple platforms wherein data may be transferred from one computing platform to another via a communication channel, such as the Internet, Intranet, Local area network (LAN), wide area network (WAN), or via a device such as CD-ROM, disk on key, portable disk or others.
Reference is made to
Audio signal 200 contains a digital representation of an audio signal and an audio signal ID. The audio signal ID is a unique identifier of the audio signal. The audio signal is captured and logged by capturing and logging module 132 of
Step 202 discloses applying phonetic indexing algorithm on the audio signal 200. The audio is phonetically indexed, producing a sequence of pairs, where each pair includes a phoneme and its time tag. The phoneme is the smallest unit of sound of the spoken language. The time tag reflects the time, in milliseconds, from the beginning of the audio signal. The sequence of pairs is referred to herein as phonetic index. Following is a phonetic index example: {[(Ph(1),T(1)], [Ph(2),T(2)] [Ph(3),T(3)], [Ph(4),T(4)] . . . [Ph(n),T(n)]}. Wherein n represents the ordinal number of the phoneme from the beginning of the phonetic index, Ph(n) represents the phoneme type of the nth phoneme and T(n) represent the time interval from the beginning of the audio signal to the beginning of the nth phoneme.
Step 204 discloses storing the phonetic index. The phonetic index is stored in any permanent storage, such as phonetic indexes 146 of storage device 144 of
Step 206 discloses N-gram segmentation of the phonetic index. The phonetic index is segmented to non-overlapping or overlapping sequences of N consecutive phonemes. The sequence of N consecutive phonemes is referred to herein as phonetic N-gram. A sequence of phonetic N-grams is referred to herein as phonetic N-gram sequence. The number of consecutive phonemes in a phonetic N-gram is referred to herein as N-gram length and the overlap value is referred to herein as N-gram overlap. N-gram length is typically in the range of two to five and N-gram overlap is typically in the range of zero to two, when zero stands for no overlap. For example, for N-gram length of three phonemes and N-gram overlap of zero phonemes, the following phonetic N-gram sequence is generated: {[Ph(1) Ph(2) Ph(3), T(1)], [Ph(4) Ph(5) Ph(6), T(4)] . . . [Ph(n-5) Ph(n-4) Ph(n-3), T(n-5)], [Ph(n-2) Ph(n-1) Ph(n), T(n-2)]}. Another phonetic N-gram sequence example is shown for N-gram length of three phonemes and N-gram overlap of two phonemes: {[Ph(1) Ph(2) Ph(3), T(1)], [Ph(2) Ph(3) Ph(4), T(2)] . . . [Ph(n-3) Ph(n-2) Ph(n-1), T(n-3)], [Ph(n-2) Ph(n-1) Ph(n), T(n-2)]}.
Wherein n represents the ordinal number of the phoneme from the beginning of the phonetic index, Ph(n) represents the phoneme type of the nth phoneme and T(n) represent the time interval from the beginning of the audio signal to the beginning of the n'th phoneme.
Step 208 discloses inverted indexing of the phonetic N-gram sequence. The phonetic N-gram sequence is inversely indexed. The inverted index is a data structure that maps each phonetic N-gram, of the phonetic N-gram sequence, to the audio signal ID that it appears in, the time tag, counted in milliseconds from the beginning of the audio signal and its ordinal position within the phonetic N-gram sequence.
Step 210 discloses inverted index storing. The inverted index is stored along with its audio signal ID in any permanent storage, such as inverted phonetic indexes 148 of storage device 144 of
Reference is made to
Textual search term 300, is an obtained textual form of a word or sequence of words.
Step 302 discloses text to phoneme conversion on the textual search term 300. The textual search term is going through a process of assigning a phonetic transcription to each word within the term, thus generating a sequence of phonemes which is the phonetic representation of the textual search term. The phonetic representation of the textual search term is referred to herein as phonetic search term.
Following is an example of the phonetic search term structure: {Ph(1) Ph(2) Ph(3) Ph(4) . . . Ph(n)}. Wherein n represents the ordinal number of the phoneme from the beginning of the phonetic search term and Ph(n) represents the phoneme type of the nth phoneme. The conversion may be preformed by dictionary based methods. Those methods include dictionaries containing all the words of a language, in textual forms, and their correct pronunciations, in a phoneme sequence form. Step 304 discloses applying N-gram segmentation on the phonetic search term. The phonetic search term is segmented to non-overlapping or overlapping sequences of N consecutive phonemes producing segmented phonetic search terms. The value of N is the same value as the N-gram length value that is used in step 206 of
Step 308 discloses inverted index search. Inverted index search includes the detection of matches between one or more segmented phonetic search terms that are included in the search query and areas in the inverted index or inverted indexes 306 that are generated on step 208 and stored on step 210 of
Step 310 discloses audio signal ranking. The audio signal ranking involves receiving two or more audio signal ID's that represent digital representations of audio signals, receiving a list of detected events per audio signal ID, receiving a search query that include one or more segmented phonetic search terms; and ranking the two or more digital representations of audio signals based on the matching between the search query and the detected events.
The ranking process produces a ranked list of audio signal IDs that correspond to audio signals. Audio signal ranking may be based on the number of detected events, their confidence scores and their matching level to the search query. The rank reflects the probability that the textual search term is found within the audio signal. The rank of each audio signal is in the range of 0-100, where 100 represents high probability and 0 represents low probability, that the search term is found within the audio signal. In some embodiments the rank may be produced by counting the number of detected events in an audio signal that satisfy the query. In addition to counting the detected events, their scores can also affect the ranking. For example, the following function may be used for ranking of an audio signal:
Where:
J is the audio signal ID;
Ci is the confidence score of i-th detected event within the J-th signal ID;
N is the number of detected events within the J-th signal ID that conform with the search query; and
A is a predetermined constant (may be 0.35 by default, or may be empirically determined through statistical experiments. Other values may be used.); For example, assuming that an audio signal j includes the following 3 detected events that conform with the search query, with the following 3 detected events confidence scores: C0=0.6; C1=0.75; C2=0.8; and assuming that a=0.35. The rank may be calculated as follows:
Step 312 discloses second phase search decision. A decision regarding whether to perform phonetic search, in addition to the inverted index search, is made. Performing phonetic search on top of the inverted index search output may yield more accurate search results but slower search speed than performing inverted index search only.
The decision whether to perform further search may be based on obtaining an enable/disable parameter 311. The enable/disable parameter 311 enables or disables the second phase search process. The parameter may be manually controlled by a system user thus enabling the control over the tradeoff between search speed and search accuracy. The decision may also be based on the estimated accuracy of the inverted index search. The accuracy estimation of the inverted index search may be based on audio signal ranking disclosed on step 310. In some embodiments if the average rank of the top N audio signals is above a predefined threshold than second phase search is disabled, else second phase search is enabled, thus performing phonetic search only if the accuracy estimation of the inverted index search is lower than the threshold. The decision may also be based on the available processing resources of the system. In some embodiments performing phonetic search is disabled if the processing resources are below a predefined threshold.
Step 314 discloses audio signal selection. The selection process selects the audio signals that will undergo phonetic search at step 318. The selection process may be based on comparing the rank of each audio signal to a threshold. Audio signals with rank score that is higher than the threshold may undergo phonetic search and Audio signals with rank score that is lower than the threshold may not be searched. In some embodiments the threshold may be predefined (e.g. threshold=50). In other embodiments audio signal selection may be based on selecting the N top ranked audio signals or on a combination of threshold based selection with limitation of the number of selected audio signals, in order to bound search time. Limiting the number of audio signals that are used for phonetic search may be performed according to a predefined limitation parameter obtained by the system of the subject matter.
Step 318 discloses phonetic search. The phonetic search term is searched over phonetic indexes 316 that are generated on step 202 and stored on step 204 of
Step 320 discloses search output storing. The search output is generated by the inverted index search step 308 or by the phonetic search step 318. The search output is a list of detected events. Each entry of the detected events list includes three parameters. The first parameter is the audio signal ID that the event was detected in. The second parameter is the time tag of the detected event, within the audio signal and the third parameter is the certainty score of the detected event.
Reference is made to
As indicated at step 400, the inverted indexing process is performed for each phonetic N-gram of the phonetic N-gram sequence that is generated by step 206 of
Reference is made to
Reference is made to
Where:
N is the number of met conditions and:
A is a predetermined constant (may be 0.35 by default, or may be empirically determined through statistical experiments. Other values may be used.);
Reference is made to
Reference is made to
Number | Name | Date | Kind |
---|---|---|---|
5349645 | Zhao | Sep 1994 | A |
5500920 | Kupiec | Mar 1996 | A |
5621859 | Schwartz | Apr 1997 | A |
5963899 | Bayya | Oct 1999 | A |
6012053 | Pant | Jan 2000 | A |
6026398 | Brown et al. | Feb 2000 | A |
6108628 | Komori | Aug 2000 | A |
6122613 | Baker | Sep 2000 | A |
6178401 | Franz et al. | Jan 2001 | B1 |
6243713 | Nelson | Jun 2001 | B1 |
6253178 | Robillard et al. | Jun 2001 | B1 |
6266636 | Kosaka et al. | Jul 2001 | B1 |
6539353 | Jiang | Mar 2003 | B1 |
6681206 | Gorin et al. | Jan 2004 | B1 |
6789061 | Fischer | Sep 2004 | B1 |
6882970 | Garner et al. | Apr 2005 | B1 |
7139712 | Yamada | Nov 2006 | B1 |
7212968 | Garner et al. | May 2007 | B1 |
7263484 | Cardillo | Aug 2007 | B1 |
7286984 | Gorin et al. | Oct 2007 | B1 |
7310600 | Garner et al. | Dec 2007 | B1 |
7818170 | Cheng | Oct 2010 | B2 |
7962330 | Goronzy | Jun 2011 | B2 |
8438089 | Wasserblat | May 2013 | B1 |
8543399 | Jeong et al. | Sep 2013 | B2 |
8762142 | Jeong et al. | Jun 2014 | B2 |
8838446 | Jeong et al. | Sep 2014 | B2 |
20020052740 | Charlesworth et al. | May 2002 | A1 |
20020052870 | Charlesworth et al. | May 2002 | A1 |
20020156776 | Davallou | Oct 2002 | A1 |
20030187642 | Ponceleon et al. | Oct 2003 | A1 |
20030204399 | Wolf et al. | Oct 2003 | A1 |
20030204492 | Wolf et al. | Oct 2003 | A1 |
20050049872 | Dharanipragada | Mar 2005 | A1 |
20060074892 | Davallou | Apr 2006 | A1 |
20060136218 | Lee | Jun 2006 | A1 |
20060206324 | Skilling et al. | Sep 2006 | A1 |
20070038450 | Josifovski | Feb 2007 | A1 |
20070225981 | Kim | Sep 2007 | A1 |
20080059188 | Konopka | Mar 2008 | A1 |
20080071542 | Yu | Mar 2008 | A1 |
20080082329 | Watson | Apr 2008 | A1 |
20080082341 | Blair | Apr 2008 | A1 |
20080097982 | Gupta | Apr 2008 | A1 |
20080130699 | Ma et al. | Jun 2008 | A1 |
20080162125 | Ma | Jul 2008 | A1 |
20080167872 | Okimoto et al. | Jul 2008 | A1 |
20080172224 | Liu et al. | Jul 2008 | A1 |
20080201136 | Fujimura | Aug 2008 | A1 |
20080228296 | Eilam et al. | Sep 2008 | A1 |
20080256033 | Cheng | Oct 2008 | A1 |
20080267503 | Denoue et al. | Oct 2008 | A1 |
20080270110 | Yurick et al. | Oct 2008 | A1 |
20080270138 | Knight et al. | Oct 2008 | A1 |
20080270344 | Yurick et al. | Oct 2008 | A1 |
20090006075 | Krishnan et al. | Jan 2009 | A1 |
20090030680 | Mamou | Jan 2009 | A1 |
20090030894 | Mamou et al. | Jan 2009 | A1 |
20090043575 | Thompson et al. | Feb 2009 | A1 |
20090043581 | Abbott et al. | Feb 2009 | A1 |
20090055206 | Orbke | Feb 2009 | A1 |
20090055360 | Morris | Feb 2009 | A1 |
20090063151 | Arrowood et al. | Mar 2009 | A1 |
20090070299 | Parikh et al. | Mar 2009 | A1 |
20090150152 | Wasserblat et al. | Jun 2009 | A1 |
20090157403 | Chung | Jun 2009 | A1 |
20090164218 | Ma | Jun 2009 | A1 |
20090210226 | Ma | Aug 2009 | A1 |
20090234826 | Bidlack | Sep 2009 | A1 |
20090292541 | Daya | Nov 2009 | A1 |
20100005056 | Bayliss | Jan 2010 | A1 |
20100179811 | Gupta et al. | Jul 2010 | A1 |
20100223056 | Kadirkamanathan | Sep 2010 | A1 |
20100250620 | Maier et al. | Sep 2010 | A1 |
20100306193 | Pereira | Dec 2010 | A1 |
20100312782 | Li et al. | Dec 2010 | A1 |
20100324900 | Faifkov | Dec 2010 | A1 |
20110004473 | Laperdon et al. | Jan 2011 | A1 |
20110037766 | Judy et al. | Feb 2011 | A1 |
20110066629 | Escalante | Mar 2011 | A1 |
20110145214 | Zhang et al. | Jun 2011 | A1 |
20110206198 | Freedman et al. | Aug 2011 | A1 |
20110224983 | Moore | Sep 2011 | A1 |
20110295605 | Lin | Dec 2011 | A1 |
20110307257 | Pereg | Dec 2011 | A1 |
20120036159 | Katsurada et al. | Feb 2012 | A1 |
20120059656 | Garland | Mar 2012 | A1 |
20120116766 | Wasserblat et al. | May 2012 | A1 |
20120117076 | Austermann | May 2012 | A1 |
20120143600 | Iriyama | Jun 2012 | A1 |
20120324538 | Malegaonkar et al. | Dec 2012 | A1 |
20130018967 | Gannu et al. | Jan 2013 | A1 |
20130073534 | French | Mar 2013 | A1 |
20130111355 | Jennings | May 2013 | A1 |
20130246064 | Wasserblat | Sep 2013 | A1 |
20130262106 | Hurvitz et al. | Oct 2013 | A1 |
20130289993 | Rao | Oct 2013 | A1 |
20140025376 | Wasserblat et al. | Jan 2014 | A1 |
Number | Date | Country |
---|---|---|
1020060020754 | Oct 2007 | KR |
Number | Date | Country | |
---|---|---|---|
20140067373 A1 | Mar 2014 | US |