The present invention relates to phonetic searching, and in particular to associating source information with phonetic indices.
A vast portion of modern communications is provided through written text or speech. In many instances, such text and speech are captured in electronic form and stored for future reference. Given the volume of these communications, large libraries of text and audio-based communications are being amassed and efforts are being made to make these libraries more accessible. Although there is significant benefit gained from thoughtful organization, contextual searching is becoming a necessary supplement, if not a replacement, for traditional organizing techniques. Most document management systems for written documents allow keyword searching throughout any number of databases, regardless of how the documents are organized, to allow users to electronically sift through volumes of documents in an effective and efficient manner.
Text-based documents lend themselves well to electronic searching because the content is easily characterized, understood, and searched. In short, the words of a document are well defined and easily searched. However, speech-based media, such as speech recordings, dictation, telephone calls, multi-party conference calls, music, and the like have traditionally been more difficult to analyze from a content perspective than text-based documents. Most speech-based media is characterized in general and organized and searched accordingly. The specific speech content is generally not known with any specificity, unless human or automated transcription is employed to provide an associated text-based document. Human transcription has proven time-consuming and expensive.
Over the past decade, significant efforts have been made to improve automated speech recognition. Unfortunately, most speech recognition techniques rely on creating large vocabularies of words, which are created based on linguistic modeling for cross-sections of the specific population in which the speech recognition system will be used. In essence, the vocabularies are filled with the many thousands of words that may be uttered during speech. Although such speech recognition has improved, the improvements have been incremental and remain error prone.
An evolving speech processing technology that shows significant promise is based on phonetics. In essence, speech is parsed into a series of discrete human sounds called phonemes. Phonemes are the smallest units of human speech, and most languages only have 30 to 40 phonemes. From this relatively small group of phonemes, all speech can be accurately defined. The series of phonemes created by this parsing process is readily searchable and referred to in general as a phonetic index of the speech. To search for the occurrence of a given term in the speech, the term is first transformed into its phonetic equivalent, which is provided in the form of a string of phonemes. The phonetic index is processed to identify whether the string of phonemes occurs within the phonetic index. If the string of phonemes for the search term occurs in the phonetic index, then the term occurs in the speech. If the phonetic index is time aligned with the speech, the location of the string of phonemes in the phonetic index will correspond to the location of the term in the speech. Notably, phonetic-based speech processing and searching techniques tend to be less complicated and more accurate than the traditional word-based speech recognition techniques. Further, the use of phonemes mitigates the impact of dialects, slang, and other language variations that make identifying a specific word difficult, but have much less impact on each individual phoneme that makes up the same word.
One drawback of phonetic-based speech processing is the ability to distinguish between speakers in multi-party speech, such as that found in telephone or conference calls. Although a particular term may be identified, there is no efficient and automated way to identify the speaker who uttered the term. The ability to associate portions of speech with the respective speakers in multi-party speech would add another dimension in the ability to process and analyze multi-party speech. As such, there is a need for an efficient and effective technique to identify and associate the source of speech in multi-party speech with the corresponding phonemes in a phonetic index that is derived from the multi-party speech.
The present invention relates to creating a phonetic index of phonemes from an audio segment that includes speech content from multiple sources. The phonemes in the phonetic index are directly or indirectly associated with the corresponding source of the speech from which the phonemes were derived. By associating the phonemes with a corresponding source, the phonetic index of speech content from multiple sources may be searched based on phonetic content as well as the corresponding source. In one embodiment, the audio segment is processed to identify phonemes for each unit of speech in the audio segment. A phonetic index of the phonemes is generated for the audio segment, wherein each phonetic entry in the phonetic index identifies a phoneme that is associated with a corresponding unit of speech in the audio segment. Next, each of the multiple sources is associated with corresponding phonetic entries in the phonetic index, wherein a source associated with a given phonetic entry corresponds to the source of the unit of speech from which the phoneme for the given phonetic entry was generated. Various techniques may be employed to associate the phonemes with their sources; however, once associated, the phonetic index may be searched based on phonetic content and source criteria.
Such searches may entail searching the phonetic index based on phonetic content criteria to identify a source associated with the phonetic content, searching the phonetic index based on the phonetic content criteria as well as source criteria to identify a matching location in the phonetic index or corresponding audio segment, and the like. Accordingly, the source information that is associated with the phonetic index may be useful as a search criterion or a search result. In one embodiment, a phonetic index and any source information directly or indirectly associated therewith may be searched as follows.
Initially, content criteria bearing on the desired phonetic content and source criteria bearing on the desired source are obtained via an appropriate search query. The content criteria query may include keywords, phoneme strings, or any combination thereof alone or in the form of a Boolean function. If one or more keywords are used, each keyword is broken into its phonetic equivalent, which will provide a string of phonemes. Accordingly, the content criteria either is or is converted into phonetic search criteria comprising one or more strings of phonemes, which may be associated with one or more Boolean operators. The phonetic index and associated source information are then searched based on the phonetic content criteria and the source criteria to identify portions of the phonetic index that match the phonetic search criteria and correspond to the source or sources identified in the source criteria. Depending on the application, various actions may be taken in response to identifying those portions of the phonetic index that match the phonetic search criteria and correspond to the source or sources identified by the source criteria. Further, such processing and searching may be provided on existing media files or streaming media that has speech content from one or more parties.
Those skilled in the art will appreciate the scope of the present invention and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.
The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the invention, and together with the description serve to explain the principles of the invention.
The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the invention and illustrate the best mode of practicing the invention. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the invention and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.
The present invention relates to creating a phonetic index of phonemes from an audio segment that includes speech content from multiple sources. The phonemes in the phonetic index are directly or indirectly associated with the corresponding source of the speech from which the phonemes were derived. By associating the phonemes with a corresponding source, the phonetic index of speech content from multiple sources may be searched based on phonetic content as well as the corresponding source. With reference to
Initially, an audio segment including speech content from multiple sources is accessed for processing (Step 100). The audio segment may be provided in any type of media item, such as a media stream or stored media file that includes speech from two or more known sources. The media item may include graphics, images, or video in addition to the audio segment. The audio segment is then parsed to identify phonemes for each unit of speech in the audio segment (Step 102). The result of such processing is a sequence of phonemes, which represents the phonetic content of the audio segment. Based on this sequence of phonemes, a phonetic index of the phonemes is generated for the audio segment, wherein each phonetic entry in the phonetic index identifies a phoneme that is associated with a corresponding unit of speech in the audio segment (Step 104). Next, each of the multiple sources is associated with corresponding phonetic entries in the phonetic index, wherein a source associated with a given phonetic entry corresponds to the source of the unit of speech from which the phoneme for the given phonetic entry was generated (Step 106). Various techniques may be employed to associate the phonemes with their sources; however, once associated, the phonetic index may be searched based on phonetic content and source criteria.
A portion of the phonetic index 12 is illustrated, and the actual phonemes for the speech segment corresponding to “transfer me to the Washington office” are provided. Phonemes about this speech segment are illustrated generically as PH. The string of phonemes for the speech segment “transfer me to the Washington office” is represented as follows:
Since the speech segment was uttered by the caller (Source 2), corresponding source information is provided in the phonetic index 12, wherein the phonemes uttered by the caller (Source 2) are associated with the caller (Source 2). Notably, this phonetic example is an over-simplified representation of a typical phonetic index. A phonetic index may include or represent characteristics of the acoustic channel, which represents the environment in which the speech was uttered, and a transducer through which it was recorded, and a natural language in which human beings express the speech. Acoustic channel characteristics include frequency response, background noise, and reverberation. Natural language characteristics include accent, dialect, and gender traits. For basic information on one technique for parsing speech into phonemes, please refer to the phonetic processing technology provided by Nexidia Inc., 3565 Piedmont Road NE, Building Two, Suite 400, Atlanta, Ga. 30305 (www.nexidia.com), and its white paper entitled Phonetic Search Technology, 2007 and the references cited therein, wherein the white paper and cited reference are each incorporated herein by reference in their entireties.
As indicated, the phonetic index 12 and associated source information may be directly or indirectly associated with each other in a variety of ways. Regardless of the source information, the phonetic index 12 may also be maintained in a variety of ways. For example, the phonetic index 12 may be associated with the corresponding audio segment in a media item that includes both the audio segment and the phonetic index 12. In one embodiment, the phonetic index 12 may be maintained as metadata associated with the audio segment, wherein the phonetic index 12 is preferably, but need not be, synchronized (or time-aligned) with the audio segment. When synchronized, a particular phoneme is matched to a particular location in the audio segment where the unit of speech from which the phoneme was derived resides. Alternatively, the phonetic index 12 may be maintained in a separate file or stream, which may or may not be synchronized with the audio segment, depending on the application. When there is a need for synchronization, the phonetic index may be associated with a time reference or other synchronization reference with respect to the audio segment or media item containing the audio segment. Notably, certain applications will not require the maintenance of an association between the phonetic index 12 and the audio segment from which the phonetic index 12 was derived. Similarly, the source information may be maintained with the phonetic index 12, in a separate file or stream, or in the media item containing the audio segment. Notably, certain applications will not require the maintenance of an association between the source information and the audio segment from which the phonetic index 12 was derived.
With reference to
With reference to
By associating the phonemes with a corresponding source, the phonetic index 12 of speech content from multiple sources may be searched based on the phonetic content, corresponding source, or a combination thereof. Such searches may entail searching the phonetic index 12 based on phonetic content criteria to identify a source associated with the phonetic content, searching the phonetic index 12 based on the phonetic content criteria as well as source criteria to identify a matching location in the phonetic index or corresponding audio segment 10, and the like. Accordingly, the source information that is associated with the phonetic index 12 may be useful as a search criterion or a search result. In one embodiment, a phonetic index 12 and any source information directly or indirectly associated therewith may be searched as follows.
With reference to
Accordingly, the content criteria either is or is converted to phonetic search criteria comprising one more strings of phonemes, which may be associated with one or more Boolean operators. The phonetic index and associated source information are then searched based on the phonetic content criteria and the source criteria (Step 206) and portions of the phonetic index that match the phonetic search criteria and correspond to the source or sources of source criteria are identified (Step 208), as depicted in
The actions taken may range from merely indicating that a match was found to identifying the locations in the phonetic index 12 or audio segment 10 wherein the matches were found. Accordingly, the text about the location of a phonetic match may be provided in a textual format, wherein the phonetic index 12 or other source is used to provide all or a portion of a transcript associated with the phonetic match. Alternatively, the portions of the audio segment that correspond to the phonetic match may be played, queued, or otherwise annotated or identified. In another example, multi-party telephone conversations may be monitored based on keywords alone, and when certain keywords are uttered, alerts are generated to indicate when the keywords were uttered and the party who uttered them. Alternatively, multi-party telephone conversations may be monitored based on keywords as well as source criteria, such that when certain keywords are uttered by an identified party or parties, alerts are generated to indicate when the keywords were uttered by the identified party or parties. The alerts may identify each time a keyword is uttered and identify the party uttering the keyword at any given time, wherein utterances of the keyword by parties that are not identified in the search criteria will not generate an alert. Those skilled the art will recognize innumerable applications based on the teachings provided herein. Notably, the term “keyword” is generally used to identify any type of syllable search term, sound, phrase, or utterance, as well as any series or string thereof that are associated directly or through Boolean logic.
With reference to
The phonetic processing system 28 may provide the functionality described above, and as such, may receive the source information and the multi-source audio content, generate a phonetic index for the multi-source audio content, and associate sources with the phonemes in the phonetic index based on the source information. Notably, the source information may be integrated with or provided separately from the multi-source audio content, depending on the application.
Also illustrated are a database 32, a search server 34, and a search terminal 36, which may also represent a communication terminal 30. The search server 34 will control searching of a phonetic index and any integrated or separate source information as described above, in response to search queries provided by the search terminal 36. Notably, the phonetic processing system 28 may be instructed by the search server 34 to search the incoming multi-source audio content in real time or access phonetic indices 12 that are stored in the database 32. When processing real-time or streaming information, the phonetic processing system 28 may generate the phonetic indices 12 and associated source information, as well as search the phonetic indices 12, the source information, or a combination thereof, in real time. Any search results may be reported to the search server 34 and on to the search terminal 36. Alternatively, the phonetic processing system 28 may generate the phonetic indices 12 and associated source information, and store the phonetic indices 12 and associated source information in the database 32, alone or in association with the multi-source audio content. If stored in the database 32, the search server 34 may access the phonetic indices 12 and associated source information for any number of multi-source audio content and provide search results to the search terminal 36.
The present invention is particularly useful in an audio or video conferencing. An overview of a conference environment in which the present invention may be practiced is provided in association with
The communication terminals are generally referenced with the numeral 30; however, the different types of communication terminals are specifically identified when desired with a letter V, D, or C. In particular, a voice communication terminal 30(V) is primarily configured for voice communications, communicates with the conference system 38 through an appropriate voice network 42, and generally has limited data processing capability. The voice communication terminal 30(V) may represent a wired, wireless, or cellular telephone or the like while the voice network 42 may be a cellular or public switched telephone network (PSTN).
A data communication terminal 30(D) may represent a computer, personal digital assistant, media player, or like processing device that communicates with the conference system 38 over a data network 44, such as a local area network, the Internet, or the like. In certain embodiments, certain users will have a data communication terminal 30(D) and an associated voice communication terminal 30(V). For example, a user may have an office or cellular telephone as well as a personal computer. Alternatively, a composite communication terminal 30(C) supports voice communications as well as sufficient control applications to facilitate interactions with the conference system 38 over the data network 44, as will be described further below. The composite communication terminal 30(C) may be a personal computer that is capable of supporting telephony applications or a telephone capable of supporting computing applications, such as a browser application.
In certain embodiments of the present invention, certain conference participants are either associated with a composite communication terminal 30(C) or both voice and data communication terminals 30(V), 30(D). For a conference call, each participant is engaged in a voice session, or call, which is connected to the conference bridge 40 of the conference system 38 via one or more network interfaces 46. Data or video capable terminals are used for application sharing or video presentation. A session function of the conference system 38 may be used to help facilitate establishment of the voice sessions for the conference call. In particular, the session function may represent call server functionality or like session signaling control function that participates in establishing, controlling, and breaking down the bearer paths or bearer channels for the voice sessions with the conference bridge 40.
In addition to a voice session, a control channel may also be established for each or certain participants. The control channel for each participant is provided between an associated communication terminal 30 and the conference system 38. The control channel may allow a corresponding participant to control various aspects of the conference call, receive information related to the conference call, provide informant related to the conference call, and exchange information with other participants. The control channels may be established with a conference control function, which is operatively associated with the conference bridge 40 and the session control function. For participants using a composite communication terminal 30(C), control channels may be established between the composite communication terminal 30(C) and the conference control function while the voice session is established between the composite communication terminal 30(C) and the conference bridge 40. For participants using voice and data communication terminals 30(V), 30(D), control channels may be established between the data communication terminal 30(D) and the conference control function, while the corresponding voice sessions are established between the voice communication terminals 30(V) and the conference bridge 40.
Although the control channels may take any form, an exemplary control channel is provided by a web session wherein the conference control function runs a web server application and the composite communication terminal 30(C) runs a compatible browser application. The browser application provides a control interface for the associated participant and the web server application will control certain operations of the conference system 30 based on participant input and facilitate interactions with and between the participants.
The conference bridge 40, including the session function and the conference control function, may be associated with the search server 34 and the phonetic processing system 28. As such, keyword or phonetic search queries may be received by the search server 34 from the participants via the control channels, and search results may be provided to the participants via the same control channels. The conference bridge 40 will be able to provide a conference output that represents multi-source audio content and associated source information to the phonetic processing system 28 to facilitate creation of a phonetic index 12 and associated source information for the conference output as well as searching of the phonetic index 12 and the associated source information.
As noted, the conference bridge 40 is used to facilitate a conference call between two or more conference participants who are in different locations. In operation, calls from each of the participants are connected to the conference bridge 40. The audio levels of the incoming audio signals from the different calls are monitored. One or more of the audio signals having the highest audio level are selected and provided to the participants as an output of the conference bridge. The audio signal with the highest audio level generally corresponds to the participant who is talking at any given time. If multiple participants are talking, audio signals for the participant or participants who are talking the loudest at any given time are selected.
The unselected audio signals are generally not provided by the conference bridge to conference participants. As such, the participants are only provided the selected audio signal or signals and will not receive the unselected audio signals of the other participants. To avoid distracting the conference participants who are providing the selected audio signals, the selected audio signals are generally not provided back to the corresponding conference participants. In other words, the active participant in the conference call is not fed back their own audio signal. Those skilled in the art will recognize various ways in which a conference bridge 40 may function to mix the audio signals from the different sources. As the audio levels of the different audio signals change, different ones of the audio signals are selected throughout the conference call and provided to the conference participants as the output of the conference bridge.
An exemplary embodiment of the conference bridge 40 is now described in association with
With reference to
An exemplary architecture for a conference bridge 40 is provided in
A source selection function 52 is used to select the source port, SOURCE 1-N, which is receiving the audio signals with the highest average level. The source selection function 52 provides a corresponding source selection signal to the audio processing function 50. The source selection signal identifies the source port, SOURCE 1-N, which is receiving the audio signals with the highest average level. These audio signals represent the selected audio signals to be output by the conference bridge 40. In response to the source selection signal, the audio processing function 50 will provide the selected audio signals from the selected source port, SOURCE 1-N from all of the output ports, OUTPUT 1-N, except for the output port that is associated with the selected source port. The audio signals from the unselected source ports SOURCE 1-N are dropped, and therefore not presented to any of the output ports, OUTPUT 1-N, in traditional fashion.
Preferably, the source port SOURCE 1-N providing the audio signals having the greatest average magnitude is selected at any given time. The source selection function 52 will continuously monitor the relative average magnitudes of the audio signals at each of the source ports, SOURCES 1-N, and select appropriate source ports, SOURCE 1-N, throughout the conference call. As such, the source selection function 52 will select different ones of the source ports, SOURCE 1-N, throughout the conference call based on the participation of the participants.
The source selection function 52 may work in cooperation with level detection circuitry 54(1-N) to monitor the levels of audio signals being received from the different source ports, SOURCE 1-N. After normalization by the signal normalization circuitry 48(1-N), the audio signals from source ports, SOURCE 1-N are provided to the corresponding level detection circuitry 54(1-N). Each level detection circuitry 54(1-N) will process corresponding audio signals to generate a level measurement signal, which is presented to the source selection function 52. The level measurement signal corresponds to a relative average magnitude of the audio signals that are received from a given source port, SOURCE 1-N. The level detection circuitry 54(1-N) may employ different techniques to generate a corresponding level measurement signal. In one embodiment, a power level derived from a running average of given audio signals or an average power level of audio signals over a given period of time is generated and represents the level measurement signal, which is provided by the level detection circuitry 54 to the source selection function 52. The source selection function 52 will continuously monitor the level measurement signals from the various level detection circuitry 54(1-N) and select one of the source ports, SOURCE 1-N, as a selected source port based thereon. As noted, the source selection function 52 will then provide a source selection signal to identify the selected source port SOURCE 1-N to the audio processing function 50, which will deliver the audio signals received at the selected source port, SOURCE 1-N, from the different output ports, OUTPUT 1-N, which are associated with the unselected source ports, SOURCE 1-N.
The source selection function 52 may also provide source selection signals that identify the active source port, SOURCE 1-N, at any given time to the phonetic processing system 28. Further, the audio processing function 50 may provide the audio signals from the selected source port, SOURCE 1-N, to the phonetic processing system 28. The phonetic processing system may generate a phonetic index 12 of phonemes for the audio signals. As the selected source ports, SOURCE 1-N, change throughout the conference call, the sources of the audio signals provided to the phonetic processing system 28 change, wherein the audio signals provide multi-source audio content. The multi-source audio content effectively includes a series of speech segments from different source ports, SOURCE 1-N. Since the source selection signals identify the active source port, SOURCE 1-N, at any given time, the phonetic processing system 28 can associate a particular source port, SOURCE 1-N, with a corresponding speech segment in the multi-source audio content, and in particular, the particular section or phonemes of the phonetic index 12 that corresponds to the speech segment.
The phonetic index 12 and associated source information may be monitored in real time or may be stored for subsequent processing. When processed in real time, search queries may be employed to identify utterances by certain parties or sources, and appropriate action may be taken. The actions may include providing an alert via a control channel or other mechanism to the speaking party or the other parties, based on rules established by the speaking party or other parties. As such, the speaking party may establish rules to alert himself or other parties when the speaking party utters a certain keyword or phrase. Alternatively, a first party may establish criteria wherein they are alerted when one or more selected parties utter a certain keyword or phrase. Further, a person who is not a party to the conference call may monitor the conference call and receive alerts when a keyword is uttered. The alert may include the utterance and the source of the utterance. Alternatively, criteria may be employed wherein alerts are provided to a person who is not participating in the conference call when only selected parties utter certain keywords or phrases. Similar processing may be provided on audio files of a conference call, after the conference call has concluded. With the present invention, multiple conference calls may be analyzed at the same time, in real time or after the conference call has concluded. Accordingly, search criteria may be employed to search multiple media items based on content, source, or a combination thereof in an effective and efficient manner according to the present invention.
With reference to
Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present invention. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow.
Number | Name | Date | Kind |
---|---|---|---|
5930755 | Cecys | Jul 1999 | A |
6073094 | Chang et al. | Jun 2000 | A |
6163769 | Acero et al. | Dec 2000 | A |
7263484 | Cardillo et al. | Aug 2007 | B1 |
7509258 | Roy | Mar 2009 | B1 |
7668718 | Kahn et al. | Feb 2010 | B2 |
20020040296 | Kienappel | Apr 2002 | A1 |
20020052870 | Charlesworth et al. | May 2002 | A1 |
20020080927 | Uppaluru | Jun 2002 | A1 |
20030125945 | Doyle | Jul 2003 | A1 |
20030125954 | Bradley et al. | Jul 2003 | A1 |
20040111266 | Coorman et al. | Jun 2004 | A1 |
20040111271 | Tischer | Jun 2004 | A1 |
20040215449 | Roy | Oct 2004 | A1 |
20050159953 | Seide et al. | Jul 2005 | A1 |
20060149558 | Kahn et al. | Jul 2006 | A1 |
20060206324 | Skilling et al. | Sep 2006 | A1 |
20070203702 | Hirose et al. | Aug 2007 | A1 |
20080071542 | Yu | Mar 2008 | A1 |
20080082341 | Blair | Apr 2008 | A1 |
20080162125 | Ma et al. | Jul 2008 | A1 |
20080270138 | Knight et al. | Oct 2008 | A1 |
20090043581 | Abbott et al. | Feb 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20100094630 A1 | Apr 2010 | US |