This invention relates generally to wireless telephony, and more particularly to push-to-talk (PTT) telephony.
In the prior art, push-to-talk (PTT) is used in wireless communications to select when a voice signal is transmitted. Frequently, a button on a microphone provides the PTT function. PTT is also useful when the voice signal is further processed by a speech recognition system. PTT reduces speech recognition errors due to noise.
Other related prior art is described in U.S. Patent Application 20040100987 “Method for managing two-way alternate communication in semi-duplex mode through a packet switching transport network,” and U.S. Pat. No. 6,741,952, “Instrument timing using synchronized clocks” and U.S. Pat. No. 6,748,053, “Relay for personal interpreter.”
The invention provides push-to-talk (PTT) for conventional wireless telephony devices, such as cellular telephones (cell phones) that are otherwise not designed to support PTT. The PTT according to the invention concurrently uses both a voice and data channel of a wireless network, such as a cellular telephone network.
In a cell phone, a microphone is always ‘on’ when the cell phone is in use. In conventional cell phones there is no capability to turn the microphone ‘off’ to provide a true PTT function. Consequently, the cell phone continuously transmits a voice signal on a voice channel when in use. This becomes a particular problem if the voice signal is to be processed by an automatic speech recognition system.
Therefore, the invention provides the cell phone with a PTT button that signals PTT ‘on’ and ‘off’ events. The events can be time-stamped according to a clock of the cell phone. Alternatively, the ‘on’ and ‘off’ events form a sequence of pairs without a time-stamp.
In any case, unlike the prior art, the events and optional time-stamps are transmitted as message, e.g., data packets, on a data channel of the cellular telephone network, which is separate from the voice channel. It should be noted that the voice and data channel can have different bandwidth and latency characteristics.
A server connected to the network receives the voice and data signals. The server includes an automatic speech recognition (ASR) system and a clock. The clock can be synchronized with the clock of the cell phone, although this is not a requirement to work the invention.
The server correlates the PTT ‘on’ and ‘off’ events that have been received and selects segments from the voice buffer that fall inside PTT ‘on’ windows for processing and discards segments that fall in ‘off’ windows.
The ASR receives and processes the ‘on’ segments and generates corresponding text. The text can be analyzed by a dialog manager, which also provides results in the form of text. The text results are converted to speech and sent back to the cell phone.
The server can also generate a short “tone” to indicate when the user is to speak. This prevents distractions and miscommunications because the cell phone and the ASR system and application operate asynchronously, and the voice and data channels have different latencies.
Network
The cellular network 150 supports a wireless voice channel 151 and a wireless data channel 152. The wireless voice channel provides a relatively low, fixed data rate connection with a low latency, usually considerably less than a second. This is necessary for two-way voice communications.
The wireless data channel provides a relatively high, variable data rate connection with a potentially a high latency, as much as tens of seconds. The latency of the wireless data channel varies because of communications with other devices in a cell, devices on the IP network, or server load.
The connections 153 and 154 between the cellular network and the server are usually wired. The public standard telephone network connection (PSTN) voice channel 153 is also low latency. Latencies on the wired data channel 154, e.g., the Internet, also vary due to changing routing and traffic conditions and server load. Therefore, various buffers of the server, described below, are sized according to inherent and unavoidable delays 155 in the data channels 152-154.
Cell Phone
The cell phone 110 includes a microphone 121 for speech input 101, a speaker 122 for audio output 102, a push-to-talk (PTT) button 123, and a clock 124.
The PTT button can be implemented as a ‘soft’ button, a ‘touch’ panel button, and the like. The cell phone also includes other buttons, such as alpha-numeric keys and control buttons. A selected one of these buttons can be designated the PTT button by programming the cell phone accordingly, perhaps using the conventional user interface that is provided with most cell phones for setting up user preferences or the button can be selected by the server.
However, the cell phone 110 lacks the ability to provide the PPT function itself, and the cell phone does not provide an application interface (API) for the cell phone application programs to intermittently enable and disable the voice channel as is done in conventional PTT devices. Typically, the cell phone microphone is always ‘on.’
PTT Events
Instead, pushing and releasing the PTT button 123 causes the cell phone to generate PTT ‘on’ and ‘off’ events. The PTT events are time-stamped according to a value of the clock 124 when the PTT button is pressed or released. The events and time-stamps are transmitted from the client to the server as data messages on the data channel.
Server
Voice signals and data messages received by the server from the client via the voice and data channels are processed by the speech server 160. The voice signal is stored in an voice buffer 180. Each audio sample of the voice signal in the voice buffer can be related to the time the sample was received.
Selected segments of the voice signal are processed by an automated speech recognition (ASR) system 182. Only those segments of the voice signal between a pair of ‘on’ and ‘off’ events are processed. The time-stamps can be used to accurately locate such segments.
However, it should be noted that other speech processing techniques can be used instead or in addition to the time-stamp events. For example, an ASR end pointing system can detect the start of speech that is substantially concurrent with the ‘on’ event, and the end of speech that is substantially concurrent with the immediately following ‘off’ event.
Segments of recognized speech are stored in a text buffer 183. The converted speech can have any known format. An application 190, under control of a selector 170, can process the text. For example, the application is a dialog manager of a voice query system.
A selector 170 receives the PTT events and synchronizes the events according to the server clock 171. Techniques for remotely synchronizing clocks and for correcting for clock drift are well known.
The events are used to access segments of the voice signal in the voice buffer 180 to be used by the ASR system 182, or to select text from the text buffer 183, which preserves the time-stamps.
The application 190 also receives the PPT events, which the application uses to control the operation of a text-to-speech engine (TTS) 191. The output speech signal of the application is in response to the input voice signal.
In this case, the PTT button can be used to control the feedback from the server. The button can also activate a short tone. This prevents distractions and miscommunications that might otherwise result because the cell phone and the ASR system and application operate asynchronously, and the voice and data channels have different latencies.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.