The present invention relates generally to the fields of data networks and communication systems; more specifically, to systems and methods for performing video conferencing over a communications network.
Conferencing systems and methods, in which participants communicate in a conference session or meeting over existing voice and data networks, have been in existence for some time. Examples of conference calling systems include U.S. Pat. No. 6,865,540, which teaches a method and apparatus for providing group calls via the Internet; U.S. Pat. No. 6,876,734, which teaches an Internet-enabled conferencing system accommodating public switched telephone network (PSTN) and Internet Protocol (IP) traffic; U.S. Pat. No. 6,931,001, which discloses a system for interconnecting packet-switched and circuit-switched voice communications; and U.S. Pat. No. 6,671,262, which teaches a system with conference servers for combining IP packet streams in a conference call into combined packet streams, such that the combined IP packet stream utilizes no more bandwidth than each of the original packet streams. A voice conferencing system that uses a packet based conference bridge that receives speech indication signals from individual terminals and then uses those signals to select talkers within the conference is disclosed in U.S. Pat. No. 6,956,828.
In any conferencing system, the usage of network resources is a function of the number of participants. Especially in the case of a video conference, the audio and video media streams normally consume an enormous amount of network bandwidth, and the sheer amount of data involved can easily overwhelm the data processing capacity of the conferencing system. For the same reasons, video conferencing also presents problems with respect to scalability of the conferencing/network infrastructure.
Various proposals have been made to optimize bandwidth usage and data management in an audio/video conferencing environment. For example, U.S. Pat. No. 6,989,856 teaches a distributed video conferencing system in which all video streams, except for the video stream associated with the active speaker, are suppressed at one or more of the media switches that provide an interface from the edge of the network to the core of the network. Although this approach alleviates some of the processing overhead in the core network, the task of handling the large amount of data associated with the video streams arriving from the various end users/end points (EPs) falls to the media switches in the edge network, i.e., between the end user and the media switch. In other words, it is still necessary for the media switches, which are part of the infrastructure, to process the incoming video streams. Additionally, bandwidth consumption remains a problem because the unwanted video streams from end users who are not active speakers are still transmitted over the network before suppression occurs at the media switches. The bandwidth problem is especially acute in wireless networks, where bandwidth between the end user and the media switch is at a premium.
In another approach, U.S. Pat. No. 6,332,153 teaches relaying active speaker information to the EPs so that all of the EPs except for the one designated as the active speaker can suppress their audio streams. That is, audio suppression occurs at the end point source. A major drawback of this approach is that it requires a change in the end point devices in order to process messages/events carrying active speaker information. Moreover, the approach described in the above patent is primarily aimed at suppression of audio packets. A similar approach can be found in commercially-available conferencing software products (see e.g., http://www.arelcom.com/bandwidth.html) which attempt to minimize audio data packet transmission during periods of end user silence.
Thus, what is needed therefore is a mechanism that overcomes the drawbacks of the prior art and optimizes the consumption of network bandwidth and conference bridge resources in a video conferencing system.
By way of further background, U.S. Pat. No. 5,963,217 teaches a network conferencing system that encodes media using text in order to conserve network bandwidth. This text is subsequently translated to speech and video at the endpoint using an appropriate mapping function. Additionally, U.S. Pat. No. 6,925,068 teaches a method for bandwidth savings based on allocation of channels in a wireless physical media environment.
The present invention will be understood more fully from the detailed description that follows and from the accompanying drawings, which however, should not be taken to limit the invention to the specific embodiments shown, but are for explanation and understanding only.
A mechanism to optimize the consumption of network bandwidth and conference bridge resources by ensuring that only those video endpoints that are actively contributing to the conference session at any given instant transmit the video stream to the conference bridge is described. In the following description specific details are set forth, such as device types, system configurations, protocols, applications, methods, etc., in order to provide a thorough understanding of the present invention. However, persons having ordinary skill in the relevant arts will appreciate that these specific details may not be needed to practice the present invention.
According to one embodiment of the present invention, a mechanism is provided for optimizing the usage of network bandwidth and conference bridge resources by facilitating as bidirectional only those media flows that are active in a given instant in time. The media streams of inactive participants to a conference session are set to receive only (i.e., unidirectional), with the streams ordinarily sent from the video endpoints being suppressed or switched off using standards-based signaling mechanisms and/or media negotiation primitives. Because the activity of the conference participants typically changes throughout the session, the media channel characteristics of the participants are dynamically renegotiated based on various triggering conditions. As a result, network bandwidth consumption is drastically reduced to only a handful of active participants, thereby significantly increasing the network throughput. Additionally, the mechanism of the present invention facilitates an increase in utilization of conference resources by eliminating the redundant processing of the inactive media streams originating from each of the remaining endpoints.
In the context of the present application, active participants or endpoints are defined as those that are in one of the following categories. First, an endpoint that is currently, actively speaking in a conference session. Alternatively, this may be an endpoint that has most recently spoken in the conference session, e.g., the last speaker. The conference bridge may obtain this information periodically based on standard algorithms for determining the loudest speaker or event. Secondly, an active endpoint may be defined as an endpoint that contributes continuously to a video composition. Another category of active endpoint is one that has been locked onto by one or more users as a fixed transmission source. Basically, any endpoint whose audio and/or video stream has an interested receiver is defined as an active endpoint.
As can be seen, each of the endpoints shown in
Practitioners in the arts will understand that there exist multiple alternative ways of aggregating/disaggregating the conferencing and mixing resources within the conferencing system “cloud” 10. In other words, the details of conferencing system 10 can vary greatly depending upon application, available resources, network usage, and other particular configuration considerations. For example, the various embodiments described herein are equally applicable to stand-alone, centralized multipoint control units (MCUs) as well as to distributed video conferencing architectures.
In accordance with one embodiment, a conference moderator acts as a trigger to cause the conference bridge to dynamically re-negotiate the media channel directionality of various endpoint devices during a conference session. This moderator function may be facilitated through the use of a graphical user interface (GUI) or a telephony user interface (TUI) running on the moderator's endpoint device. The basic idea is that the conference moderator grants floor control to a conference participant who has requested access to the floor or who has otherwise been waiting in a floor request queue. When a participant receives the floor from the moderator, the conferencing server automatically renegotiates that participant's media channel characteristics, changing the media channel characteristics of that endpoint from receiver-only to send & receive, i.e., from unidirectional to bidirectional transmission. In other words, the media characteristics of the endpoint device are attached to the floor control grant such that only the active speaker endpoint sends video packets to the media mixer—all the remaining endpoints have their video streams turned off or suppressed.
The next event in the method of
Practitioners in the art will appreciate that the method described above may be completely automated by the conferencing system in accordance with a floor control algorithm or floor control access system. In other words, it is not necessary that a conference moderator act to grant individual floor control access to participants on a continual basis.
By way of further example,
In accordance with another embodiment of the present invention, the conference server, upon detecting an active participant in the conference session, signals all non-active endpoints to suppress their video transmission towards the conference bridge (mixer) by setting the media direction parameter of those endpoints to receive-only. The mixer basically detects the one or more loudest speakers in the conference session and designates them the active speaker(s). The conferencing server then signals the non-active endpoints to suppress the video streaming output from the other endpoints to the mixer or conference bridge. As the active speaker status dynamically changes during a conference session, media channel characteristics of the various endpoints are appropriately renegotiated.
Note that in this embodiment the audio streams from each of the endpoints to the mixer are bidirectional, but the video streams are selectively controlled via signaling of the conferencing server such that each endpoint operates in either a receive-only or a send/receive video streaming directional mode. Practitioners in air will appreciate that the change in video transmission directional mode (e.g., unidirectional or bidirectional) for the involved participants may take place in accordance with a variety of different protocols and different signaling mechanisms. This may simply involve the conference server sending a request message to the endpoint device to stop sending Real-Time Transport Protocol (RTP) packets. For instance, in a Session Initiation Protocol (SIP) environment, a reINVITE or UPDATE message may be sent to an endpoint device to suppress and re-enable video transmission. In SIP, a method for suppression and enablement of a video stream may include a MIME-encoded body part containing Session Description Protocol (SDP). The SDP, in turn, contains information about each media stream. One item of the media stream description concerns the directionality of the media. Therefore, by changing a video media stream description marked as “send/recv” to one marked as “send-only” or “recv-only”, either the server or the endpoint can convert a bidirectional video stream to a unidirectional one. Similar, by changing the video media stream description back to “recv-only”, bidirectional flow of video can be restored.
In yet another embodiment of the present invention, each endpoint device includes a voice activity detection (VAD) enabled device or module that can distinguish between silence, breathing, wind, noise, etc., and ordinary speech. In operation, the VAD device triggers video transmission to the mixer only when it detects someone talking. Basically, when speech or voice activity is detected, the endpoint, rather than the conference system, quickly negotiates (with the conferencing server) a change in the media channel characteristics from a receiver-only video transmission mode to a send/receive video transmission mode. In all other cases (e.g., silence, breathing, wind, noise, etc.) video streaming to the mixer is suppressed or turned off.
In a slight variation of the above embodiment, instead of immediately changing to a bidirectional video transmission mode upon detection of voice activity, the endpoint may first use existing floor control mechanisms and algorithms to request control of the floor from the conference moderator or conferencing server. Only after the endpoint has been granted control of the floor would the conferencing server renegotiate the media channel characteristics of the endpoint to allow the endpoint to begin sending video packets to the mixer.
In still another embodiment of the present invention, an in-band signaling mechanism, such as Named Signaling Event (NSE), may be utilized to indicate to the endpoint device to switch off video transmission when there is no audio being received at the mixer from the endpoint device. This approach is similar to the previously described embodiments in that the conferencing bridge indicates to the an endpoint device that it should stop/start video transmission, but in this case the payload (RTP) itself is utilized as the command transmission medium instead of signaling, making this embodiment protocol-independent and codec-independent.
In yet another embodiment of the present invention, a video encoding scheme may be enhanced to signal to the endpoint device whether it should transmit or not, in a manner similar to “freeze picture” control command in H.26x Video codecs. This may be accomplished in a H.323 network using H.245 media control primitives. (H.245 is a control signaling protocol in the H.323 multimedia communication architecture, and is used for of the exchange of end-to-end H.245 messages between communicating H.323 endpoints/terminals.) A video stream in a given direction can be terminated by sending a Close Logical Channel (CLC) command, which has the effect of closing the communication channel between two endpoints. Similarly, the communication channel can be reopened for transport of audiovisual and data information by sending an Open Logical Channel (OLC) command.
In another embodiment, the H.245 FlowControl command with a bit rate of zero can be used to leave the video channel established but unable to transmit any data. When video is again required of the channel, a second FlowControl command with the original video bit rate can be sent, allowing video to flow once again.
It should be understood that elements of the present invention may also be provided as a computer program product which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (e.g., a processor or other electronic device) to perform a sequence of operations. Alternatively, the operations may be performed by a combination of hardware and software. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnet or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, elements of the present invention may be downloaded as a computer program product, wherein the program may be transferred from a remote computer or telephonic device to a requesting process by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
Additionally, although the present invention has been described in conjunction with specific embodiments, numerous modifications and alterations are well within the scope of the present invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.