The present embodiments relate to voice response and a voice server.
The development of voice recognition technology and Voice Over IP (“VoIP”), together with the newly emerged advanced “voice server” (in contrast with keystroke-style menu selection), promote all automatic Interactive Voice Response (“IVR”) applications. A user may implement end-to-end service with an enterprise or operator through such applications. For example, when a customer calls a service hotline of a consumption household electrical appliances manufacturer and speaks “refrigerator”, the customer will be connected to a relevant department, which reduces the calling time. In the field of telecom value-added service such as Number Best Tone, the operator also provides user experience of voice recognition. In another application field of data entry, the voice technology is significantly advantageous over the keystroke-style IVR. For example, some U.S. Airlines advocate all automatic systems recently for people to book a ticket through telephone. Such applications would be impossible with only the keystroke-style dial-up.
In the voice technologies, a user interacts with a system via acoustical organs and voices. An interface for this is known as a Voice User Interface (“VUI”). The VUI presents a correct result for the first interaction so as to reduce the number of times of user confirmation and the number of times of returning from error at most.
The following example shows an interaction between a user and a flight information system:
System: Hello, thanks for calling “Blue Sky” Airlines. Our newest automatic system may help you to inquire about flight information you need. Do you know the flight number?
User: Sorry, I don't know.
System: Never mind, tell me the departure city of the flight, please.
User: Beijing.
Referring to
A user dials a telephone number of the voice server via the telephone, and the exchange switches on a transmission channel between the telephone and the voice server.
The voice server plays a salutatory or operation prompt. More specifically, the service control module obtains a text response from the service processing module, the service control module invokes (uses) a TTS (Text to Speech) technology of the voice processing module to transform the text response into speech, and the service control module sends the speech to the telephone via the exchange.
The user interacts with the voice server via voices, the service control module forwards the voices from the telephone to the voice processing module. The voice processing module performs ASR (Automatic Speech Recognition) and returns text to the service control module. The service control module forwards the text to the service processing module.
If the voices are recognized as text correctly, the service processing module implements the service and instructs the result to the user. If the voices are not recognized or are ambiguous, the service processing module prompts the user to confirm the result or error.
The user continues to interact with the voice server via voices, or hangs up.
In
The IVR system shown in
The present embodiments may obviate one or more of the drawbacks or limitations inherent in the related art. For example, in one embodiment, a voice server provides a visual interface while providing a voice recognition interactive interface.
In one embodiment, a voice response method includes obtaining a voice service request; transforming (generating) the voice service request into a text service request; obtaining corresponding voice response data and visual response data according to the text service request; and transmitting the voice response data and the visual response data.
In one embodiment, a voice server includes a service processing module, a service control module and a voice processing module. The voice processing module transforms a received voice service request into a text service request. The service processing module obtains corresponding voice response data and visual response data according to the text service request. The service control module may transmit the voice response data and visual response data.
In one embodiment, a man-machine interactive interface provides a combination of voices and visual response data. The interaction may provide a visual interface even when prompt tones are not recognizable. The user voice interruption is allowed to respond to the result even before the end of the prompt tones, so as to the speed up the voice interaction. The interactive interface does not repeatedly play prompt tones when a user does not understand or hear the prompt tones.
Embodiments will be illustrated with reference to the accompanying drawings.
The voice server includes a service processing module, a service control module, and a voice processing module. The service control module is connected to the exchange. The voice processing module is adapted (operable) to transform a received voice service request into a text service request. The voice service request may be obtained from the service control module or directly through an interface. Voice response data and visual response data (such as, text messages, images, and streaming media) associated with the text service request are stored in the service processing module. The service processing module obtains corresponding voice response data and visual response data according to the text service request. The service control module is connected to the service processing module, and is adapted to control the service processing module and return voice response data and visual response data obtained by the service processing module to the telephone via the exchange, so as to provide the voice response data and visual response data to the user.
The telephone includes a display module. The voice server transmits texts (text messages), images, or streaming media to the telephone while transmitting voices. The voice server may use a communication channel, an audio communication channel, and signaling for transmission. The telephone displays texts, images, or streaming media contents using the display module.
The IVR system may be used to display a synthetic face (e.g., a virtual compere) while listening to voices of a computer. The synthetic face makes the man-machine interactive interface more friendly and harmonious.
The voice server includes a transforming unit and a second voice processing module when the service processing module has text response data associated with a text service request. The transforming unit may be an independent module or in the service control module. The transforming unit is adapted to transform text response data into images and/or media streams. The second voice processing module is adapted to transform text response data into voice response data. The second voice processing module may be an independent module or set in the voice processing module. In this case, the service control module is adapted to control the service processing module to obtain text response data from the service processing module. The service control module may invoke (use) Text to Speech (“TTS”) technology of the second voice processing module to transform the text response data into voice response data. The service control module may control the transforming unit to invoke (use) Text-to-Visual Speech (“TTVS”) technology to transform text response data into images or streaming media.
The telephone voice system provides accessorial texts, a graphic visual interface, or video interface, in addition to (in combination with) a voice interactive interface. The speed and efficiency of voice interaction are improved by combination of voices and visual information. The man-machine interactive interface is friendly and harmonious.
The voice, text, image, and video data may be transmitted on any transport network or protocol. For example, the texts (text messages), images, and streaming media may be transferred through a Public Switched Telephone Network (“PSTN”), an Internet Protocol (IP)-based switch network, and IP-based protocols (such as session initiation protocol (“SIP”)). The telephone may be a VOIP telephone, a plain old telephone service (“POTS”) telephone, an intelligent terminal, or a mobile phone.
Obtaining corresponding voice response data and visual response data according to (associated with) the text service request may include obtaining the corresponding voice response data and/or visual response data directly according to the text service request if there is voice response data and/or visual response data associated with the text service request. Obtaining corresponding voice response data and visual response data according to (associated with) the text service request may include obtaining the corresponding text response data according to the text service request if there is text response data associated with the text service request.
If the visual response data is text or an image, the text or the image is transmitted to the user through signaling. If the visual response data are streaming media, a streaming media communication channel is established and the streaming media is transmitted to the user through the streaming media communication channel.
The method further includes determining the visual response data to be transmitted to the user. Determining the visual response data may include receiving information on service capability that the terminal supports reported by the user, and determining corresponding visual response data according to the information on service capability.
The telephone transmits an INVITE message to the voice server when a user dials the number. The voice server returns a 200OK message. The INVITE message and 200OK message carry an identifier indicating whether the telephone supports text messages, images, or streaming media, and carries a Session Description Protocol (SDP) that describes the media streaming.
An audio communication channel is established between the telephone and the voice server after SDP negotiation on the INVITE and 200OK messages bearing SDP. If it is determined that the telephone supports text messages, text messages are exchanged between the telephone and the voice server via signaling and voices are exchanged through the audio communication channel. If it is determined that the telephone supports streaming media, a video communication channel is established between the telephone and the voice server, and streaming media is exchanged through the video communication channel between the telephone and the voice server. If it is determined that the telephone supports image information, images are exchanged via signaling between the telephone and the voice server.
The following example illustrates using SIP to transmit voice response data and visual response data.
In the example, a user dials 911, and the telephone transmits an INVITE message as follows:
The telephone transmits an INVITE message to the voice sever, indicating that it intends to establish a video channel and an audio channel, and informing the voice server that the telephone supports text messages (MESSAGE) and supports images (INFO). The telephone returns a 200OK message as follows:
Using the MESSAGE and INFO of an Allow field in the INVITE message, the voice server determines that the telephone accepts text messages and images. The text messages are sent via the MESSAGE and images are sent via the INFO.
The specific standards are as follows:
RFC3261 describes SIP protocol.
RFC3364 describes the session negotiation of SDP.
RFC3428 describes receiving and transmitting texts of MESSAGE.
RFC2976 describes an INFO message.
In a PSTN network, the above functions and standards may be implemented via an H.320 protocol.
In addition, the exchange in the IVR system may be substituted with a software-switching device or a router.
Those skilled in the art will understand that the present embodiments can be implemented with software and necessary hardware, or entirely with hardware, but in many cases, the former is preferred. Based on such understanding, the contribution of the solution of the invention to the prior art may be entirely or partially achieved by software, the software may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, the software includes instructions for making a computer device (personal computer, server or network device etc.) carry out the method of embodiments or parts of an embodiment of the present invention.
The preferred embodiments of the present invention were discussed above and by no way to limit the scope of the present invention. Any modifications, alternations, and improvements made within the spirit and principle of the present invention will fall into the scope of the invention as defined by the accompanying claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 200610157787.8 | Dec 2006 | CN | national |
This application is a continuation application of international application No. PCT/CN2007/071104, filed Nov. 21, 2007, which claims the benefit of Chinese Patent Application No. 200610157787.8, filed Dec. 26, 2006, the contents of which are both incorporated in their entireties by reference.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2007/071104 | Nov 2007 | US |
| Child | 12132185 | US |