The present invention relates generally to telecommunication and in particular to hybrid audio-visual communication.
Visual communication over traditionally voice centric communication systems, such as Push-To-Talk (PTT) radio systems and cellular telephone systems, is highly desirable because facial expressions and head/body gestures play a very important role in face-to-face human communications. Video communication is an example of natural visual communication, whereas avatar based communication is an example of synthetic visual communication. In the later case, an avatar representing a user is animated at the receiving terminal. The term avatar generally refers to a model that can be animated to generate a sequence of synthetic images.
Push-to-talk (PTT) is a half-duplex communication control scheme that is very cost effective for group communications. It is still popular after several decades of deployment. Visual communication over traditionally voice centric PTT is highly desirable because facial expressions and head/body gestures play a very important role in face-to-face human communication. In another words, visual communications over PTT makes communication between individuals more effective. Video communication is just one type of visual communications; avatar based communication is another. In the later case, an avatar representing a user is animated at the receiving terminals. The sender controls the avatar using facial animation parameters (FAPs) and body animation parameters (BAPs). It is widely recognized that users can express themselves better by choosing the appropriate avatars and exaggerating/distorting emotions.
Solutions already exist for push-to-talk, push-to-view (images), and push-to-video. In the case of push-to-video, the sender's video is transmitted over, in real-time, to all receiving terminals. However, these solutions built on top of PTT do not solve the more general issue of allowing heterogeneous PTT phones seamlessly operating together for visual communications, with minimum user setup and maximum flexibility for self-expression, including the use of avatars.
The support of natural and/or synthetic visual communications is problematic because user equipment has a variety of multimedia capabilities. PTT phones generally fall into the following categories that are capable of:
One problem is how to animate an avatar on a user terminal that can decode video but cannot render synthetic images. Another problem is how to allow a user to select between video and avatar images if the user terminal supports both capabilities. Another problem is how to adapt to fluctuation of channel capacity so that when QoS degrades, video can be switched to avatar communications (which usually requires much less channel bandwidth than video). A still further problem is how, when and where to perform necessary transcoding in order to bridge terminals having different capabilities. For example, how is the voice call from a voice-only sending terminal to be visualized on receiving terminal that is video or avatar capable?
Techniques are known for viewing images (push-to-view) or video (push-to-video) over push-to-talk systems. In addition, a receiving terminal may select an avatar to be displayed using the caller's ID. Avatar assisted affecting voice call; and the use of avatars as an alternative for low-bandwidth video communication are also known.
An apparatus has been disclosed for offering a service for distinguishing callers, so that when a mobile terminal has an incoming call, information (avatar, ring tone, etc) related to the caller is searched from a database, and results are transmitted to the recipient's mobile terminal. The user can request the database to check the list of available images from which they can choose from.
A telephone number management service and avatar providing apparatus has also been disclosed. In this approach, a user can register with the apparatus and create his, or her, own avatar. When a mobile communication device has an incoming call, it checks with the management service by caller's ID. If an avatar exists in the database for the caller, the avatar is transmitted and displayed to the mobile terminal.
Methods have also been disclosed for associating an avatar with a caller's ID (CID) and for efficient animation of realistic, speaking 3D characters in real time. This is achieved by defining a behavior database. Specified cases of real time avatar animation driven by text source, audio source or user input through User Interface (UI).
Use of an avatar that is transmitted along with audio and is initiated through a single button press has been disclosed.
A method has been disclosed for assisting voice conversations through affective messaging. When a telephone call is established, an avatar of the user's choice is downloaded to recipient's device for display. During conversation, the avatar is animated and controlled by affective messages received from the owner. These affective messages are generated by participants using various implicit user inputs, such as, gestures, tones of voices, etc. Since these messages typically occur in a low rate, they can be sent using a short message service (SMS). The affective messages transmitted between parties can either be encoded into special code for privacy or be sent via plain text for simplicity.
It is known that extreme video compression may be achieved by utilizing an avatar reference. By utilizing a convenient set of avatars to represent the basic categories of a human's appearance, each person whose image is being transmitted is represented by the one avatar of the set of avatars that is closest to the person involved.
Avatars may be used as a lower-bandwidth alternative to video conferencing. An animation of a face can be controlled through speech processing so that the mouth moves in synchrony with the speech. Keypad buttons of a phone may be used to express emotional state during a call. In an “avatar” telephone call, each call participant is allowed to press the buttons to indicate their desired facial expression.
Avatar images may be controlled remotely using mobile phone.
In summary, the prior techniques address how to make multimedia over PTT more efficient at a network level, how to adapt video transmission to maintain quality of service or adapt to terminal capabilities, and how to drive avatar animation.
The accompanying figures, in which like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention.
FIG's 3-6 show an exemplary server consistent with some embodiments of the invention.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to hybrid audio-visual communication. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element that is preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of hybrid audio-visual communication described herein. The non-processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as a method to perform hybrid audio-visual communication. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
One embodiment of the invention relates to a method for providing communication between a sending terminal and a receiving terminal in a communication network. Communication is provided by detecting the media content of a signal transmitted by the sending terminal, generating, from the media content, a voice stream, an avatar control parameter stream and a video stream, selecting, as output, at least one of the voice stream, the avatar control parameter stream and the video stream; and transmitting the selected output to the receiving terminal.
The method may be implemented in a network server that includes a viseme detector operable to receive a voice component of an incoming communication stream from the sending terminal and generate first avatar control parameters therefrom, a video tracker operable to receive a video component of the incoming communication stream generate second avatar control parameters therefrom, an avatar rendering engine, operable to render avatar images dependent upon at least one of the first avatar control parameters, second avatar control parameters and avatar control parameters in the incoming communication stream, a video encoder, operable to encode the rendered avatar images to produce a synthetic video stream and an adaptation decision unit. The adaptation decision unit receives as input one or more of: the voice component of the incoming communication stream, avatar control parameters in the incoming communication stream or generated from elements at the server, a natural video component of the incoming communication stream, and the synthetic video stream, and is operable to select at least one of the inputs as an output to be transmitted to the receiving terminal.
In addition to the differing capabilities of the user terminal, the communication channels (112, 114, 116 and 118 in
To enable effective audio/visual communication between the user terminals, the server must adapt to both channel variations and variations in user equipment.
The present invention relates to hybrid natural and synthetic visual communication over communication networks. The communication network may be, for example, a push-to-talk (PPT) infrastructure that uses PTT telephones that have various multimedia processing capabilities. In one embodiment, communication is facilitated through media adaptation and transcoding decisions at a server within the network. The adaptation is dependent upon network terminal capability, user preference and network QoS. Other factors may be taken into consideration. The invention has application to various communication networks including, but not limited to, cellular wireless networks and PTT infrastructure.
In one embodiment, the receiving terminal adapts to the type of the transmitted media. In this embodiment, the receiving terminal checks a header of the incoming system level communication stream to determine whether it is an avatar animation stream or a video stream, and delegates the stream to either an avatar render engine or a video decoding engine for presentation.
The audio communication signal may be used to drive an avatar on the terminal. For example, if a sending terminal is only capable of voice transmission, the receiving terminal can generate an animated avatar with lip movement synchronized to the audio signal. The avatar may be generated at all receiving terminals that have the capability of avatar rendering or video playback. To generate the avatar synthetic images from the audio content, the audio content is passed to a viseme decoder 216. A viseme is a generic facial image, or a sequence of images, that can be used to describe a particular sound. A viseme is the visual equivalent of a phoneme or unit of sound in spoken language. The viseme decoder 216 recognizes phonemes or other speech components in the audio signal and generates a signal 218 representative of a corresponding viseme. The viseme signal 218 is passed to an avatar animation unit 220 that is operable to generate avatars that display the corresponding viseme. In addition to enhancing communication for a hearing user, visemes allow hearing-impaired users to view sounds visually and facilitate “lip-reading” the entire human face.
The de-multiplexer 204 is operable to detect whether the visual content of the incoming communication stream 202 relates to a synthetic image (an avatar) or a natural image and generate a switch signal 222 that is used to control switch 224. The switch 224 direct the visual content 208 to the either the avatar rendering unit 220 or a video playback unit 226. The video playback unit 226 is operable to decode the video content of the signal.
A display driver 228 receives either the generated avatar or the decoded video and generates a signal 230 to drive a display 232.
In a further embodiment, the media type is adapted based upon user preference of media type for visual communication. For video enabled terminals, a user can choose either video communication or avatar communication; the selection can be changed during the communication.
The receiving terminal may also include a means for disabling one or more of the processing units (that is, at least one of the viseme detector, the video tracker, the avatar rendering engine, the video encoder, and the adaptation decision unit). The choice of which processing unit to disable may be dependent upon the input media modality or user selection, or a combination thereof.
In a still further embodiment, the network is adapted for visual communication. In this embodiment, the network is operable to switch between video communication and avatar usage.
Table 2, below, summarizes the transcoding tasks that enable the server to bridge between two different types of sending and receiving terminals.
The outputs of the viseme detector 305, the behavior generator 306 and the video tracker 308, and the avatar control parameter stream 303 are used to control an avatar rendering engine 310. The avatar rendering engine 310 accesses a database 312 of avatar images and renders animated avatars dependent upon the incoming avatar control stream or features identified in the incoming voice and/or images. The avatars are passed to a video encoder 314, which generates an avatar video stream 316 of synthetic images. The animation parameter can be encoded in a number of ways. One way is to pack the animation parameter into the video streams; the other way is to use standardized system streams, such as the MPEG-4 system framework.
The avatar parameters output from the viseme detector 305, the behavior generator 306, and video tracker, together with the received avatar control parameter stream 303 may be passed to a multiplexer 318 and multiplexed into a single avatar parameter stream 320. This may be a stream of facial animation parameters (FAPs) and/or body animation parameters (BAPs) that describe how to render an avatar.
An adaptation decision unit 322 receives the voice input 302, the avatar parameter stream 320, the avatar video stream 316, and the natural video stream 304 and selects which modalities (voice, video, avatar, etc) are to be included in the output 324. The decision as to the type of modality output from the server can be based upon a number of criteria. This can be done using a rule based approach, a heuristic approach, or a graph based decision mechanism.
The selection may be dependent upon a quality of service (QoS) measure 326. For example, if the communication bandwidth is insufficient to support good video quality, a symbol may be shown at sender's terminal to suggest using avatar. Alternatively, the server can automatically use video-to-avatar transcoding in order to meet a QoS requirement.
Further, the selection may be dependent upon a user preference 328, a server load status 330 and/or a terminal capability 332.
The selection may be used to control the other components of the server, disabling components that are not required to produce the selected output.
The incoming video stream 304 and the voice signal 302 are passed direct to the adaptation decision unit 322. The incoming video stream 304 is also passed to video tracker 308 that identified features such as facial expressions or body gestures in the video images. The features are encoded and passed to the rendering engine 312, which renders images and enables video encoder 314 to generate a video stream 316. Thus, the adaptation decision unit 322 receives a voice signal 302, an avatar control parameter stream 320, a synthetic video stream 316 and the incoming video stream 304 and may select between these modalities.
The methods and apparatus described above, with reference to certain embodiments, enable a communication system to adapt automatically to different terminal types, media types, network conditions and user preference. This automatic adaptation minimizes user setup requirements and still provides flexibility for user to choose between natural or synthetic media type. In particular the approach enables flexible choice for the user's self-expression.
When an avatar is used, depending on the capability of the sending terminal, the user may select whether emotions, facial expressions and/or body animations are used.
The approach enables visual communication over a voice channel, without increasing the bandwidth requirement for voice communication, for legacy PTT phones or other user equipment with limited capability.
A mechanism for exchanging terminal capability at the server is provided, so that different actions can be taken according to inbound terminal type and outbound terminal type. For example, for legacy PTT phones that do not support metadata exchange, its type can be inferred from other signaling or network configurations.
Terminal capability exchange may be used, allowing the server to know whether a terminal has the capability for video, avatar, or both, or none (voice only).
In one embodiment, a user only need to select his/her own avatar, and push another button before talking to select video or avatar.
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.