This invention relates to real-time multimedia communication and messaging, and in particular, to instant messaging, and audio and audio-video communication over the Internet.
Instant Messaging has become one of the most popular forms of communication over the Internet, rivaling E-mail. Its history can be traced from several sources: the UNIX TALK application developed during the late 1970s, the popularity of CHAT services during the early 1990s, and the various experiments with MUDs (Multi-user domains), MOOs (MUD Object-Oriented), MUSHs (Multi-user shared hallucinations), during the 1980s and 1990s
Chat Services: Chat services (such as the service provided by America-On-Line) allow two or more users to converse in real time through separate connections to a chat server, and are thus similar to Instant Messaging. Prior to 2000, most of these services were predominately text-based. With greater broadband access, audio-visual variants have gained popularity. However, the standard form continued to be text-based.
Standard chat services allow text-based communication: a user types text into an entry field and submits the text. The entered text appears in a separate field along with all of the previously entered text segments. The text messages from different users are interleaved in a list A new text messages is entered into the list as soon as it received by the server and is published to all of the chat clients (the timestamp can also be displayed). In the context of the present disclosure, the important principal governing the display and use of messages is:
Although all of chat services we know of follow the “display immediately” principle, most services also allow users to scroll back and forth in the list, so that they can see a history of the conversation. Some variants of the service allow the history (the list of text messages) to be preserved from one session to the next, and some services (especially some MOOs) allow new users to see what happened before they entered the chat.
Although audio-video chat is modeled on text-based chat, the different medium places additional constraints on the way the service operates. As in text-based services, in audio-video chat when a new message is published to the chat client, the message is immediately played, i.e., the “display immediately” principle is maintained. However, audio and audio-video messages are not interleaved like text messages. Most audio-video chat services are full-duplex and allow participants to speak at the same time. Thus, audio and audio-video signals are summed rather than interleaved. Some audio-video conference services store the audio-video interactions and allow participants (and others) to replay the conference.
Many conference applications allow participants to “whisper” to one another, thus creating a person-to-person chat. Thus, users could simultaneously participate in a multi-user chat and several person-to-person whispers. In audio chats, the multi-user chat was conveyed through audio signaling, and users could “whisper” to one another using text. Experimental versions allowed users to “whisper” using audio signaling, whiling muting the multi-user chat. Instant Messaging: Instant Messaging had its origins in the UNIX Talk application, which permitted real-time exchange of text messages. Users, in this case, were required to be online on the same server at the same time. The Instant Messaging that was popularized in the mid-1990s generalized the concept of UNIX Talk to Internet communication. Today, instant messaging applications, e.g., ICQ and AOL's AIM, are comprised of the following basic components:
Notably, separate IM windows exist for each IM session, and each IM session typically represents the exchange of messages between two users. Thus, a user can be, and often is, engaged in several concurrent IM sessions. In most IM services, a message is not displayed until the author of a messages presses a send button or presses the return key on a the keyboard. Thus, messages tend to be sentence length. Messages are displayed in an interleaved fashion, as a function of when the message was received. UNIX Talk and ICQ allow transmission on every key press. In this way, partial messages from both participants can overlap, mimicking the ability to talk while listening.
The underlying IM system typically uses either a peer-to-peer connection or store-and-forward server to route messages, or a combination of recipient) who is not logged in, many IM applications will route the message to a store-and-forward server.
The intended recipient will be notified of the message when that recipient logs into the service. If a message is sent to a logged-in recipient, the message appears on the recipient's window “almost instantly”, i.e., as quickly as network and PC resources will allow. Thus, text messages will appear in open IM windows even if the window is not the active window on a terminal screen. This has several benefits:
In practice, a first user might send an IM to a second user, who is logged into an IM service but who is engaged in some other activity. The second user might be editing a document on the same PC that is executing the IM application, or eating lunch far away from this PC, or sleeping. Nonetheless, the IM window on the second user's PC will display the last received IM. Concurrently, another IM window on the second user's PC terminal might represent a conversation between a third user and the second user and might also display a last received message from the third user, or if none were received, then it would display the last message sent by the second user to the third user.
As in Chat, the presentation of IM text messages are governed by the “display immediately” principle. Text messages can be viewed and/or responded to quickly or after many hours, as long as the session continues. Thus, instant messaging blurs the line between synchronous and asynchronous messaging.
Recently, IM has been extended to allow exchange of graphics, real time video, and real time voice. Using the aforementioned principal, a received graphic will be displayed in near real time but because it is a static image or a continuously-repeating, short-duration moving image, like text, the visual does not need to be viewed at the time it is received. It will persist as long as the IM session continues. However, audio and audio-video IM must be viewed when received, because audio and audio-video are time-varying signals, unlike text
Thus, IM applications that permit audio and audio-visual messages have the following limitations:
The “display immediately” principle is sensible, because all heretofore audio and audio-video IM applications provide synchronous, peer-to-peer communication. Even if multiple audio- and audio-video IM sessions were permitted, a user could not be engaged in multiple concurrent, ongoing conversations. Call-waiting, a telephone service, provides a good analogy. The service allows two calls to exist on the same phone at the same time, but only one call can be active. If a call participant on the inactive line were to talk, the call-waiting subscriber would never hear it. In the same way, audio or audio-video IM transmissions are immediately presented in an active IM session. If the intended recipient is not paying attention, the message is lost.
Some IM services, allow messages to be sent to users who are not logged in to the service or who are logged in but identified as “away”. In these cases, when users log in or indicate that they are “available”, all of the text messages received while they were logged out or away are immediately displayed. However, due to storage constraints most current services do not store audio-video messages when the intended recipients are way or not logged in
To sum up, current IM art uses the “display immediately” principle: When an IM session is active, messages are presented as soon as possible. When the messages are static, as in text, the recipient can read the message at any subsequent time, as long as the session is still active. When the messages are dynamic, as in audio or audio-visual, the recipient must view them when they arrive. This principal implies that only one IM session at a time can be used for audio or audio-visual communication. This is a serious limitation, and negates one major advantage of IM: multiple, on-going asynchronous conversations. The text fonnat of real-time, text-based IM, allow IM messages from multiple people to be managed concurrently by a single user (IM users are known to easily handle two to four concurrent IM conversations). In contrast, multiple real-time audio and audio-video conversations are difficult to manage concurrently.
Audio and Audio-video conversations that don't use IM infrastructure use the same “display immediately” principle, e.g., Skype Internet Telephony. This also occurs in circuit-switched telephony (e.g., public-switched telephone network). In all cases, when two or more participants are communicating in real time, their utterances are presented to the other conversational participants as quickly as possible. Indeed, delays in transmission can adversely affect the quality of the conversation.
One apparent exception to the “display immediately” principal is voice mail and multi-media mail services. However, voice mail, e-mail and similar services differ from real-time communication in that the conversational participants do not expect to be logged into the service at the same time. Instead, electronic mail services typically rely on store-and-forward servers. The asynchronous nature of voice mail and multi-media email allow mail from multiple people to be managed concurrently by a single user.
Public-switched telephone service often offer call-waiting and call-hold features, in which a subscriber can place one call on hold while speaking on a second call (and can switch back and forth between calls or join them into a single conference call). However, in such cases, the call participant who is placed on hold can not continue speaking to the subscriber until that subscriber switches back to that call; there is no conversational activity while the call is on hold. In some services, call participants who have been placed on hold (or on a telephone queue) can signal that they wish to be transferred to voice mail; but this ends the conversation.
The push-to-talk feature found on some cellular phones (such as some Nextel and Nokia cellular phones) allows users to quickly broadcast audio information to others in a group (which can vary in size from one person to many). Thus, users can quickly switch among different conversational sessions. However, push-to-talk does not provide the “play-when-requested” capability of the present invention; it does not play back the content that was missed while the user was engaged in other conversations.
Thus, unlike text-based instant messaging, all of the heretofore known real-time communication systems which use time-varying media such as audio and video suffer from a number of disadvantages:
The present invention circumvents these limitations by relaxing the “display immediately” principal.
The present invention discloses a method and apparatus that relaxes the aforementioned “display immediately” principal and allows users to engage in multiple asynchronous, multimedia conversations. Instead of a single continuous presentation of audio or audio-video, the present invention relies on collection of short clips of time-varying signals (typically the length of a sentence uttered in audio or audio-video clip) and transmits each as a separate message, similar to the manner text messages are transmitted after pressing the enter/retum key. On the recipient side, the invention replaces the traditional principal “display immediately” with a new principal “play-when-requested”. With this new combination of presentation principles the recipient sees that a new message has arrived right away; however, information packaged as an audiovisual message is not played until the recipient requests it (or the system decides the recipient is ready to receive it; for example, users might elect to have new messages played whenever they finish replying to another message).
This invention represents an advance in IM technology and allows audio and audio-visual IM participants to delay playing and responding to audio and audio-video messages. Thus, with this new technology, audio and audiovisual conversations can gracefully move between synchronous and asynchronous modes. This method can be extended to telephony to allow multiple, asynchronous telephone conversations. The method can be further generalized to allow any mime type (Multipurpose Internet Mail Extension) or combination of mime types over any communication channel.
One novel extension of this new combination of presentation methods is text-audio and text-video IM, in which a sender types a message while receiving audio, video, or both. The transmitted message can contain text or audio/video or both. This overcomes one limitation of audio-video communication: In audio and audio-video communication, the person creating the message can be overheard. In the present invention, the person who is creating a message can speak the message or type the message within the same communication session.
The present invention also allows each party in a chat to participate without having identical media capabilities (e.g., recipients can read the text of a text-video even if they can't play video, and when a user does not have a keyboard, they can speak an audio IM.
The invention also has the advantage of supporting simple forms of archiving. Rather than store a long extended video or audio recording, the collection of audiovisual messages eliminates unnecessary content, and allows for more efficient methods for archiving and retrieving messages.
The present invention permits users to engage in multiple real-time audio and audio-video communication sessions. In accordance with the invention:
Thus, the present invention replaces the aforementioned “display immediately” principle used in all heretofore known audio and audio-video communication methods with the “play-when-requested” principle: In accordance with the present invention, audio and audio-video content are played only when the user is ready to receive them.
The following example uses an audio-visual communication example, but it should be obvious to anyone skilled in the art that the method can be used in an equivalent manner to support audio and text-video communication. Using the present invention:
An IM user (A) sends an audio-visual message to another IM user (B). Several outcomes are possible:
The invention discloses a novel method that allows users to engage in several concurrent streaming conversation sessions. These conversation sessions can utilize common communication services such as those offered over the Internet or the public-switched telephone network (PSTN). Examples of such Internet services include multi-media IM (e.g., Apple's AN iChat) and voice-over-IP (e.g., Skype). Examples of telephone service include stander cellular and landline voice services as well as audio-video services that operate over the PSTN, e.g., 8×8's Picturephone, DV324 Desktop Videophone, which uses the H.324 communication standard for transmitting audio and video over standard analog phone lines. All of these services are examples of concurrent streaming conversation sessions. They are transmitted as time-varying signals (unlike text), and all use the “display immediately” principle.
The present invention discloses a set of novel methods, which combine state of the art techniques for segmenting, exchanging, and presenting real-time multimedia streams. Support for multiple concurrent conversations that includes multimedia streams relies on the “play-when-requested” principal. A scenario using this principal is illustrated in
In Step 104, Participant B's message has been added to IM log 112, and Participant B also sees an audio-video message indicator 150 from Participant A in IM log 112. In addition, Participant B responds to Participant C by typing, “hi back C” into input window 115.
In Step 106, Participant B selects audio message indicator 150 for playback and while listening, Participant B also sees an audio message indicator 152 from Participant C.
In Step 108, while listening to the audio message from Participant A, Participant B notices that an additional message has been received from Participant C, indicated by audio message indicator 154. Participant B decides not to respond immediately to the message associated with audio message indicator 150 and instead selects audio message indicator 152 to listen to both of Participant C's audio messages. While listening to the messages associated with indicators 152 and 154, Participant B types a message to Participant A in input window 113.
In Step 110, Participant B responds to Participant C's audio message by recording a new audio message 156, while also noticing a second audio-video message indicator 158 from Participant A. In this way, Participant B can concurrently converse with both Participants A and C without ostensibly putting either on “hold”. Both Participants A and C can record new messages at the same time that Participant B is listening or responding to one of their prior messages.
The example in
Other types of user interface styles can be accommodated. It is possible to map the functions to different keys. For example, “1” could be used to replay the current message, “2” could be used to start and stop a recording, and “#” could be used to play the next message. After discussing
In Step 201, Participant A calls Participant B's voice IM service with a message 250, “hi this is A.” Participant A hears an initial system message which may include message 252 informing Participant A that the message is being delivered and to wait for a response or to add an additional message. In step 203, Participant B answers the telephone and receives the message 250 from Participant A. After hearing the message, the end of which might be indicated by a tone, Participant B records message 254 and presses the “#” on the telephone pad to indicate the end of the reply. In Step 205, Participant A listens to Participant B's reply, and responds with Message 256.
In Step 207, while Participant A is listening and responding to Participant B with Message 256, Participant C calls Participant B and records an initial message 258. In Step 209, Participant B listens to Message 256 and while listening notices that another caller has sent an Instant Message (258). Notification might occur through a tone heard while listening to Message 256, or through a visual indicator if the phone set contains a visual display. In Step 211, concurrent with Participant B's listening to Message 258, Participant C records an additional message 260. In Step 213, Participant B responds to Participant A's Message 256 by pressing “*” and creating Message 262. At the end of recording this new message, Participant B presses “#” and hears the next message on Participant B's input queue, in this instance, Participant B hears both of the messages left by Participant C, 258 and 260. Concurrent with these activities, Participant A listens to Message 262 and records a new message 264 in Step 215. While Participant A is recording Message 264, Participant B responds to Messages 258 and 260 from Participant C with Message 266, in Step 217. Also in Step 217, after completing Message 266, Participant B listens to the next message on the input queue, Message 264. In Step 219, Participant C hears Message 266 and records a final message 268, “well got to go, take care.” Participant C disconnects from the phone service. In Step 221, Participant B responds to Participant A's Message 264 by recording a Message 270. After completing the response, signified by pressing “#”, Participant B hears Participant C's final message 268. Participant B could choose to ignore the message by moving to the next message in the queue, to save the message for a later playback and response, or to respond to the message. In this illustration, in Step 225, Participant B decides to respond with Message 274, and is informed by the System Message 276 that Participant is no long logged into the service, and Participant B can either delete the response 274 or add an additional message. Participant chooses neither and presses “#” for the next message in the queue. Participant C will hear Message 274 when Participant C logs back into the server. Concurrent with Steps 221 and 225, Participant A listens to Message 270 and responds with Message 272. In Step 227, Participant B listens to Message 272 and responds with final message 278. In Step 229, Participant A hears final message 278 from Participant B and ends the call. If Participant A had responded, the service would have delivered that message when Participant B logs back into the service.
Notably, Participant B as a subscriber could receive these messages using a phone or an network-enabled computing device, such as a handheld computer, a laptop, or a desktop computer.
The example in
An mapping of key presses (or spoken key words) to functions in this more flexible service might be as follows:
This list is not meant to provide the complete user interface, but rather to be illustrative of the kinds of functions that could be provided. The methods and apparatus required to implement these functions (such as forwarding a message) and to do so using a touchtone, speech recognition, visual and multimodal interfaces are well known in the art.
Although
Process for managing concurrent streaming conversations:
The login process from an instant messaging client to a server establishes a TCP/IP connection for relaying events through the service. These connections provide a means for transmitting signals between processes that are used for initiating and controlling said concurrent, streaming conversation sessions among a plurality of users who communicate with one another in groups of two or more individuals, said system allowing each user to concurrently receive and individually respond to separate streaming messages from said plurality of users. These connections also provide a means for transmitting at least one of said streaming messages. Further information regarding TCP/IP socket connections can be found in: “The Protocols (TCP/IP Illustrated, Volume 1)” by Richard Stevens, Addison-Wesley, first edition, (January 1994). When peer-to-peer connections are needed, TCP/IP connections are established directly between terminal devices. Network streaming of content can be implemented through any standard means, including RTP/RTSP protocols. For more information on the RTP protocol see the IETF document: http://www.ietf.org/rfc/rfc1889. txt. For more information on RTSP see the IETF document: http://www.ietf.org/rfc/rfc2326. txt.
Key user interface and communication events:
Establishing delivery and storage methods for messages with streamed content:
In
Starting an IM conversation session and setting up peer-to-peer connections:
Sending a message with streamed content:
In
Message delivery sub-process and the playing ofstreams:
The message delivery sub-process starts by getting the destination list for the new message for testing delivery types 346. Until condition 358 indicates there are no more destinations, a delivery method is chosen for each destination. If condition 348 chooses delivery method 3, sub-process 350 will create a message containing the stream segments and a NewReady event, and send these through a peer-to-peer connection. If condition 352 chooses delivery method 2, sub-process 354 will store the streams locally as needed, and send only the NewReady event through a peer-to-peer connection. If condition 352 chooses delivery method 1, sub-process 356 will create a message containing the stream segments and a NewReady event, and send it to the IM service, where it will store the message and its stream segments on the appropriate content server, and forward the NewReady event to the designated destinations. All NewReady events contain enough information for the devices receiving it to identify where to send RequestPlay events. When an event is determined by condition 310 to be NewReady, sub-process 312 will indicate or display the NewReady status. Sub-process 312 can also control speech recognition algorithms which could be used to label audio messages and to place the label with an audio message indicator on the recipient's user interface. If the event payload includes a message with stream segments, as a consequence of delivery method 3, sub-process 312 will store the payload in local storage until the user selects it for playing. When user input is determined by condition 326 to be SelectPlay, sub-process 328 will test the message delivery method. If condition 330 selects delivery method 3, sub-process 316 will find the message with content in its local storage, will select the appropriate playing method for each stream and will start playing those stream segments. If condition 332 selects delivery method 2, sub-process 334 will send a RequestPlay event through a peer-to-peer connection to the terminal device that is storing the message. If condition 332 selects method 1, sub-process 336 will send a RequestPlay event to the service, and the service will attempt to resolve this by returning a ResponsPlay event that includes the message with stream segments. When an event is determined by condition 318 to be RequestPlay with delivery method 2, sub-process 320 finds the message matching that RequestPlay event in local storage. Then sub-process 322 sends back a ResponsePlay event with the appropriate stream segments through the peer-to-peer connection. When an event is determined by condition 314 to be ResponsePlay, sub-process 316 will find the message with content in its local storage, will select the appropriate playing method for each stream and will start playing those stream segments.
Network-based concurrent streaming conversation session instant messaging is performed when a user creates a session with one or more IM buddies by logging onto the instant messaging service 416 (see
Service 416 may be embodied in software utilizing a CPU and storage media on a single network server, such as a Power Mac G5 server running Mac OS X Server v1.3 or v 10.4 (see http://www.apple.com/powermac/ for more information on Power Mac G5 server). The server would also run server software for transmitting and storing IM messages and streams, and would be capable of streaming audio and audio-video streams to clients that have limited storage capabilities using Apple's QuickTime Streaming Server 5. Other software running on the server might include MySQL database software, FTPS, and HTTPS server software, an IM server like Jabber which uses the XMPP protocol. (see http://www.jabber.org/ for more information on Jabber and see http:/www.xmpp.org/specs/ for more information on the XMPP protocol that Jabber uses). Alternatively, Service 416 may execute across a network of servers in which account management, session management, and content management are each controlled by one or more separate hardware devices. Further information about MySQL, FTPS and HTTPS can be found at http://www.mysql.com/, http://www.ford-hutchinson.com/˜fh-1-pfh/ftps-ext.html, and http://wp.netscape.com/eng/ss13/draft302.txt.
Thick sender concurrent streaming conversation session instant messaging is performed when a user creates a session with one or more IM buddies by logging onto the instant messaging service 416 (see
An example of computer hardware and software capable of supported the preferred embodiment for a terminal device is an Apple Macintosh G4 laptop computer with its internal random access memory and hard disk. An iSight digital video camera with built-in microphone captures video and speech. The audio/visual stream is compressed using a suitable codec like H.264 in Apple QuickTime 7 and a controlling script assembles audio-visual message segments that are stored in local random access memory as well as on the local hard disk. The audio-visual segments are streamed on the Internet to other users on the IM session using the Apple OS X QuickTime Streaming Server and RTP/RTSP transport and control protocols. The received audio-visual content is stored on the random access memory and the hard disk on the user's Apple Macintosh G4 laptop computer terminal, and is played using the Apple QuickTime 7 media player to the laptops LCD screen and internal speakers as directed by a controlling script operated by the user. Thick receiver concurrent streaming conversation session instant messaging is performed when a user creates a session with one or more IM buddies by logging onto the instant messaging service 416 (see
The user controlling a terminal device without local memory 410 (e.g., a cellular phone) may redirect the audio-visual content to another terminal device 410 (e.g., a local set top box) by directing the network content server 436 to stream directly to the other device using the IM play-when-requested process 412. Similarly a thick receiver terminal device 406 may be directed to redirect audio-visual content to another terminal device 410 using the content server 434 and IM play-when-requested process 408 and a thick sender terminal device 404 may be directed to redirect audio-visual content to a another terminal device 410 using the content server 434 and IM play-when-requested process 404.
Conclusion, Ramifications and Scope
Accordingly, the reader will see that the apparatus and operation of the invention allows users to participate in multiple, concurrent audio and audio-video conversations. Although the invention can be used advantageously in a variety of contexts, it is especially useful in business and military situations that require a high degree of real-time coordination among many individuals.
While the above description contains many specificities, these should not be construed as limitations on the scope of the invention, but rather as an exemplification of one preferred embodiment thereof. Many other variations are possible. For example, the present invention could operate using various voice signaling protocols, such as General Mobile Family Radio Service, and that the methods and communication features disclosed above could be advantageously combined with other communication features such as the buddy lists found in most IM applications and the push-to-talk feature found in cellular communication devices, such as Nokia Phones with PoC (Push-to-talk over cellular). Also the functions of the Instant Messaging Service 416 may be distributed to multiple servers across one or more of the included networks.
Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their legal equivalents.
This patent application claims the benefit of the filing date of our earlier filed provisional patent application, 60/553,046 , filed on date Mar. 16, 2004.
Number | Date | Country | |
---|---|---|---|
60553046 | Mar 2004 | US |