The present disclosure relates generally to the field of communications, and more particularly, to correlation of transcribed text with corresponding audio.
Transcription services are often used to convert audio communications into text. This may be used, for example, at call centers to document customer service issues, for medical or legal transcription, or for users that do not have the time to listen to voice mail messages and would rather read through the messages.
Speech recognition software may be used to transcribe audio, however, the quality is often not at an acceptable level. Another option is to send an audio file to a transcription service at which a transcriber listens to the audio and provides a transcription. The quality for human transcription is generally better than computer generated transcription. A drawback with human generated transcription is that if the user wants to compare specific text in the transcription with the audio, there is no easy way for the user to identify the location of the text in the audio.
In one embodiment, a method generally comprises receiving at a communication device, an audio communication and a transcribed text created from the audio communication, and generating a mapping of the text to the audio communication, independent of transcribing the audio. The mapping identifies locations of portions of the text in the audio communication.
In another embodiment, an apparatus generally comprises memory for storing an audio communication and a transcribed text created from the audio communication, and a processor for generating a mapping of the text to the audio communication independent of transcribing the audio.
The following description is presented to enable one of ordinary skill in the art to make and use the embodiments. Descriptions of specific embodiments and applications are provided only as examples, and various modifications will be readily apparent to those skilled in the art. The general principles described herein may be applied to other embodiments and applications without departing from the scope of the embodiments. Thus, the embodiments are not to be limited to those shown, but are to be accorded the widest scope consistent with the principles and features described herein. For purpose of clarity, details relating to technical material that is known in the technical fields related to the embodiments have not been described in detail.
Transcription of audio into text is used in a variety of applications, including for example, voice mail, business, legal, and medical. It is usually important that the audio be transcribed accurately. Software currently available to generate text based on audio does not generally provide as accurate transcriptions as human generated transcriptions. However, with conventional human generated transcription there is no way to correlate a word in the transcribed text with its location in the audio file. Thus, if a user is reading the transcribed text and wants to go back and check the audio, there is no easy way to identify the corresponding location in the audio file.
The embodiments described herein map an audio communication to a text transcription created from the audio communication. The mapping breaks down portions of the text (e.g., words, phrases, etc.) and identifies the corresponding locations (e.g., offset times) in the audio. The audio communication may be, for example, a voice mail message, recorded conversation between two or more people, medical description, legal description, or other audio requiring transcription. The transcribed text may be human generated, computer generated, or a combination thereof. In the case of computer generated transcription, the mapping is performed independent of the transcription. The mapping provides a user the ability to easily identify a point in the audio that correlates with a point in the transcribed text. The text to audio mapping may be used, for example, to check the transcribed text, fill in missing portions of the transcribed text, or confirm important information such as phone numbers, dates, or other data.
Referring now to the drawings and first to
In the example shown in
The LAN 12 couples multiple endpoints 24, 26 for the establishment of communication sessions between the endpoints and other endpoints distributed across multiple cities and geographic regions. The LAN 12 is coupled with the Internet 16, WAN 18, and PSTN 14 to allow communication with various devices located outside of the LAN. The LAN 12 provides for the communication of packets, cells, frames, or other portions of information between endpoints, such as computers 24 and telephones 26, which may include an IP (Internet Protocol) telephony device. IP telephony devices provide the capability of encapsulating a user's voice into IP packets so that the voice can be transmitted over the LAN 12 (as wells as the Internet 16 and WAN 18). The LAN 12 may include any combination of network components, including for example, gatekeepers, call managers, routers, hubs, switches, gateways, endpoints, or other network components that allow for exchange of data the network.
Endpoints 24, 26 within the LAN may also communicate with non-IP telephony devices, such as telephone 30 connected to PSTN 14. PSTN 14 includes switching stations, central offices, mobile telephone switching offices, remote terminals, and other related telecommunications equipment. Calls placed to endpoint 30 are made through gateway 28. The gateway 28 converts analog or digital circuit-switched data transmitted by PSTN 14 (or a PBX) to packet data transmitted by the LAN 12 and vice-versa. The gateway 28 also translates between a VoIP (Voice over IP) call control system and a Signaling System 7 (SS7) or other protocols used in the PSTN 14.
Calls may also be made between endpoints 24, 26 in the LAN 12 and other IP telephony devices located in the Internet 16 or WAN 18. A router 38 (or other network device such as a hub or bridge) directs the packets to the IP address of the receiving device.
In the example shown in
The voice mail system 20 operates in connection with the endpoints 24, 26 coupled to the LAN 12 to receive and store voice mail messages for users of endpoints 24, 26, as well as for certain remote devices located outside of the LAN. When a user is participating in a previous call or is otherwise unavailable to take the incoming call, the call may be forwarded to the voice mail system 20. The voice mail system 20 may answer the call and provide an appropriate message to the user requesting that the caller leave a voice mail message. The voice mail system 20 and call manager 22 may be located at separate devices as shown in
It is to be understood that the communication network shown in
In one embodiment, the device 50 is in communication with a transcription center 74. The transcription center 74 may be a voice message conversion service that provides human generated transcription, computer generated transcription, or a combination thereof. For example, the transcription services may be provided by a company such as SpinVox, a subsidiary of Nuance Communications of Marlow, UK, for example.
As shown in
The processor 52 may be a microprocessor, controller, or any other suitable computing device. As described below, the processor 52 operates to receive and process voice mail messages intended for end users associated with the endpoints. During the mapping of text to audio, the processor 52 sends information to and receives information from the speech recognition engine 56. The processor 52 also operates to store information in and retrieve information from memory 54. Logic may be encoded in one or more tangible media for execution by the processor 52. For example, the processor 52 may execute codes stored in the memory 54. Program memory is one example of a computer-readable medium. Program memory 54 may be any form of volatile or non-volatile memory including, for example, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), removable media, or any other suitable local or remote memory component.
The speech recognition engine 56 may be any combination of hardware, software, or encoded logic, that operates to receive and process speech signals from the processor 52. Where the received signals are analog signals, the speech recognition engine 56 may include a voice board that provides analog-to-digital conversion of the speech signals. A signal processing module may take the digitized samples and convert them into a series of patterns. The patterns may then be compared to a set of stored modules that have been constructed from the knowledge of acoustics, language, and dictionaries, for example.
The speech recognition engine 56 is configured to match words or phrases in the audio communication to words or phrases contained within the text transcribed from the audio communication. Therefore, the speech recognition engine 56 does not need to be a sophisticated engine operable to convert speech to text. Instead, the engine is only required to match words or phrases within the text to the corresponding words or phrases in the audio. Thus, the speech recognition engine 56 may be a low quality, low overhead speech recognition engine.
In one embodiment, the speech recognition engine 56 uses the audio communication recorded by the voice mail system 20, and the transcribed text received from the transcription center 74. In another embodiment, the communication device 50 (or other network device in communication with the device 50) is configured with a speech recognition engine operable to generate the text transcription of the audio communication. In this case, the speech recognition engine 56 uses the computer generated transcription rather than a transcription from the transcription center 74, for correlation with the audio communication. As previously noted, the text to audio mapping is performed independent from transcribing the text in the case of computer generated transcription. Thus, even if the same speech recognition engine is used for the transcription and the mapping, these steps are performed independent of one another.
In one embodiment, audio communication 62 and transcribed text 64 are stored in memory 54 along with the generated text to audio mapping 70 (
The portion of text mapped to the audio may be individual words, numbers, or other data, phrases (e.g., groups of words or numbers, or key phrases (phone numbers, dates, locations, etc.)), or any other identifiable sound or groups of sounds. In one embodiment, the speech recognition engine 56 uses isolated word and phrase recognition to recognize a discrete set of command words, phrases, or patterns or uses key word spotting to pick out key words and phrases from a sentence of extraneous words. For example, the speech recognition engine 56 may use key word spotting to identify strings of numerals, times, or dates and store the offset positions of these key words and phrases in memory.
The text to audio mapping for audio A (
Upon receiving the voice mail message, the communication device 50 identifies the message with an audio ID A. The audio communication is then transcribed to provide text transcription A (shown in
In one embodiment, the location is an offset position (e.g., time in milliseconds) measured from the beginning of the audio communication. For example, the word ‘Hi’ in the text A is matched in the corresponding audio at an offset of 200 milliseconds from the beginning of the audio file (
The transcribed text is presented to the user at a graphical user interface (GUI) at the user device, with embedded links at tagged words or phrases.
Referring again to
In the example shown in
In one embodiment, the text/audio mapping 70 is stored in memory 54 along with the recorded audio communication 62 and transcribed text 64 (
As previously noted, the endpoint may also be configured to correlate the audio and text and generate the text/audio mapping 70. In this embodiment, the voice mail system 20 may store the audio communication 62 and corresponding text 64, and upon receiving a request from the user, transmit both the audio and text to the user. The speech recognition engine at the user device then uses the audio and text files 62, 64 to generate the mapping 70.
Although the method and apparatus have been described in accordance with the embodiments shown, one of ordinary skill in the art will readily recognize that there could be variations made to the embodiments without departing from the scope of the embodiments. Accordingly, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
Number | Name | Date | Kind |
---|---|---|---|
5568540 | Greco et al. | Oct 1996 | A |
6035017 | Fenton et al. | Mar 2000 | A |
6263308 | Heckerman et al. | Jul 2001 | B1 |
6687339 | Martin | Feb 2004 | B2 |
6775360 | Davidson et al. | Aug 2004 | B2 |
6850609 | Schrage | Feb 2005 | B1 |
7225126 | Hirschberg et al. | May 2007 | B2 |
7966181 | Hirschberg et al. | Jun 2011 | B1 |
8064576 | Skakkebaek et al. | Nov 2011 | B2 |
20020143533 | Lucas et al. | Oct 2002 | A1 |
20030220784 | Fellenstein et al. | Nov 2003 | A1 |
20050055213 | Claudatos et al. | Mar 2005 | A1 |
20060085186 | Ma et al. | Apr 2006 | A1 |
20060182232 | Kerr et al. | Aug 2006 | A1 |
20070081636 | Shaffer et al. | Apr 2007 | A1 |
20070106508 | Kahn et al. | May 2007 | A1 |
20070233487 | Cohen et al. | Oct 2007 | A1 |
20080037716 | Bran et al. | Feb 2008 | A1 |
20080065378 | Siminoff | Mar 2008 | A1 |
20080255837 | Kahn et al. | Oct 2008 | A1 |
20080294433 | Yeung et al. | Nov 2008 | A1 |
20080319743 | Faisman et al. | Dec 2008 | A1 |
20090099845 | George | Apr 2009 | A1 |
20100145703 | Park | Jun 2010 | A1 |
Entry |
---|
http://www.avid.com/US/solutions/workflow/Scriptbased-Editing. |
http://www.spinvox.com/how—it—works.html. |
Number | Date | Country | |
---|---|---|---|
20110231184 A1 | Sep 2011 | US |