A videoconferencing system can employ a Multipoint Control Unit (MCU) to connect multiple endpoints in a single conference or meeting. The MCU is generally responsible for combining video streams from multiple participants into a single video stream which can be sent to an individual participant in the conference. The combined video stream from an MCU generally represents a composited view of multiple video images from various endpoints, so that a participant viewing the single video stream can see many participants or views. In general, a videoconference may include participants at endpoints that are on multiple networks or that use different videoconferencing systems, and each network or videoconferencing system may employ one or more MCU. If a conference topology includes more than one MCU, an MCU may composite video streams including one or more video streams that have previously been composited by other MCUs. The result of this ‘multi-stage’ compositing can place images of some conference participants in small areas of a video screen while the images of other participants are given an inordinate amount of screen space. This can result in a poor user experience during a videoconference using multi-stage compositing.
Use of the same reference symbols in different figures may indicate similar or identical items.
A videoconferencing system that creates a composited video stream from multiple input video streams can analyze the input video streams to determine whether any of the input video streams was previously composited or contains filler areas. A set of video images associated with endpoints can thus be generated from the input video streams, and the number of video images generated will generally be greater than or equal to the number of input video streams. A compositing operation for a videoconference can then act on the video images in a user specifiable manner to construct a composited video stream representing a composite of the video images. A video stream composited in this manner may improve a videoconferencing experience by providing a more logical, more useful, or more aesthetically desirable video presentation. For example, the compositing operation can devote equal area to each of the separated video images, even when some of the video images in the input streams are smaller than others. Filler areas from the input video streams can also be removed to make more screen space available to the video images. A multi-stage compositing processing can thus give each participant or view in a videoconference an appropriately sized screen area and appropriate position even when the participant or view was previously incorporated in a composited video image.
Each of networks 110, 120, and 130 in system 100 further provides separate videoconferencing capabilities (e.g., a videoconferencing subsystem) that can be separately employed on network 110, 120, or 130 for a videoconference having participants on only the one network 110, 120, or 130. The videoconferencing subsystems associated with networks 110, 120, and 130 can alternatively be used cooperatively for a videoconference involving participants on multiple networks. The videoconferencing systems associated with individual networks 110, 120, and 130 may be the same or may differ. For example, the separate videoconferencing systems may implement different protocols or have different manufacturers or providers. In general, even when different providers implement videoconferencing systems based on the same protocol, e.g., the H.323 standard, the providers of the videoconferencing systems often provide different implementations of such standards, which may necessitate the use of a gateway device to translate the call signaling and data streams between endpoints of videoconferencing systems of different providers. In the embodiment of
A videoconferencing subsystem associated with network 110 contains multiple videoconferencing sites or endpoints 112. Each videoconferencing site 112 may be, for example, a conference room containing dedicated videoconferencing equipment, a workstation containing a general purpose computer, or a portable computing device such as a laptop computer, a pad computer, or a smartphone. For ease of illustration,
Each conferencing site 112 further includes a computing system 156 containing hardware such as a processor 157 and hardware portions of a network interface 158 that enables videoconference site 112 to communicate via network 110. Computing system 156, in general, may further include software or firmware that processor 157 can execute. In particular, network interface 158 may include software or firmware components. Conferencing control software 159 executed by processor 157 may be adapted for the videoconferencing subsystem on network 110. For example, processor 157 may execute routines from conference control software 159 to produce one or more audio-video data stream including a video image from video system 152 and to transmit the audio-video data stream. Similarly, processor 157 may execute routines from software 159 to receive an audio-video data stream associated with a videoconference and to produce video on display 154 and sound through an audio system (not shown).
The videoconferencing subsystem associated with network 110 also includes a multipoint control unit (MCU) 114 that communicates with videoconference sites 112. MCUs 114 can be implemented in many different ways.
MCU 114 may combine video streams from videoconference sites 112 (and optionally video streams that may be received through gateway system 140) into a composited video stream. The composited video stream that MCU 114 produces can be a single video stream representing a composite of multiple video images from endpoints 112 and possibly video streams received through gateway system 140. In general, MCU 114 may produce different composited video streams for different endpoints 112 or for transmission to another videoconference subsystem. For example, one common feature of MCUs is to remove a participant's own image from the composited image sent to that participant. Thus, each endpoint 112 on network 114 could have a different composited video stream. MCU 114 could also vary the composited video streams for different endpoints 112 to change characteristics such as the number of participants shown in the composited video or the aspect ratio or resolution of the composited video. In particular, MCU 114 may take into account the capabilities of each endpoint 112 or other MCU 124 or 134 when composing an image for that endpoint 112 or remote MCU.
A videoconferencing subsystem associated with MCU 124 operates on network 120 of
A videoconferencing subsystem associated with MCU 134 operates on network 130 of
MCUs 114, 124, and 134 may create respective composited video streams representing composite video image 210, 220, and 230 for transmission to external videoconference systems as described above. In the example of
Some MCUs allow compositing operations using video streams that may have been composited by another MCU, but the resulting image may have individual streams at varying sizes without a good cause. For example,
Process 400 begins with a process 410 of analyzing the input video streams to determine the number of video images or sub-streams composited in each input video stream and the respective areas corresponding to the video images. In particular, each video stream coming into a compositing stage can be evaluated to determine if the video stream is a composited stream. The analysis can consider the content of the video stream as well as other factors. For example, the source of the video stream can be considered if particular sources are known to provide a composited video stream or known to not provide a composited video stream. In some videoconferencing systems, the video streams received directly from at least some endpoints 134 may be known to represent a single video image, while video streams received from other MCUs may or may not be composited video streams. Video streams that are known to not be composited do not need to be further evaluated and can be assumed to contain a single video image occupying the entire area of each frame of video.
With process 400, an MCU generating a composited video stream may add flags or auxiliary data to the video stream to identify the video stream as being composited and even identifying the number of video images and the areas assigned to the video images in each composited frame. In step 412, MCU 134 can check for auxiliary data that MCU 114 or 124 may have added to an input video stream to indicate that the video stream is a composited video stream. Similarly, in some configurations of videoconferencing system 100, MCU 134 and MCU 114 or 124 may be able to communicate via a proprietary application program interface (API) to specify the compositing layout in the previous stage, which could remove the need to do sophisticated analysis of a composited video stream because the sub-streams are known. A videoconferencing standard may also provide commands associated with choosing particular configurations that MCU 134 could send to MCU 114 or 124 to define the previous stage compositing behavior in MCU 114 or 124. This could allow MCU 134 to identify the video images or sub-streams without additional analysis of the incoming stream from MCU 114 or 124. In other configurations, MCU 114 or 124 may be a legacy MCU that is unable to include auxiliary data when a video image is composited, unable to communicate layout information through an API, and unable to receive compositing commands from MCU 134.
A composited video stream can be identified from the image content of the video stream. For example, a composited video data stream will generally include edges that correspond to a transition from an area corresponding to one video image to an area corresponding to another video image or a filler area, and in step 414, MCU 134 can employ image processing techniques to identify edges in frames represented by an input video stream. The edges corresponding to the edges of video images may be persistent and may occur in most or every frame of a composited video stream. Further, the edges may be characteristically horizontal or vertical (not at an angle) and in predictable locations such as lines that divide an image into halves, thirds, or fourths, which may simplify edge identification. In step 414, MCU 134 may, for example, scan each frame for horizontal lines that extend from the far left of a frame to the far right of the frame and then scan for vertical lines that extend from the top to the bottom of the frame. Horizontal and vertical lines can thus identify a simple grid containing separate image areas. More complex arrangements of image areas could be identified from horizontal or vertical lines that do not extend across a frame but instead end at other vertical or horizontal lines. A recursive analysis of image areas thus identified could further detect images in a composited image resulting from multiple compositing operations, e.g., if image 300 of
MCU 134 in step 415 also checks the current video stream for filler areas. The filler areas may, for example, be areas of constant color that do not change over time. Such filler areas may be relatively large, e.g., covering an area comparable or equal to the area of a video image or may be a frame that an MCU 114 or 124 adds around each video image when compositing video images. Further, the MCU 114 or 124 providing an input video stream may add frames around each of the video images composited. The frames can further have consistent characteristics such as a characteristic width in pixels or a characteristic color, and MCU 134 can use such known characteristics of frames to simplify identification of separate video images. Further, a convention can be adopted by MCU 114, 124, and 134 to use specific types of frames to intentionally simplify the task of identifying areas associated with separate video images in a composited video stream.
MCU 134 in step 416 can use the information regarding the locations of edges or filler areas to identify separate image areas in a composited input stream. For example, analysis of one of more frames representing a composite video image 210 of
As a result of repeating analysis process 410, a determination of the total number of video images represented by all of the input video streams may be determined. In particular, each composited video stream may represent multiple video streams. MCU 134 in step 420 can use the total number of video images and other information about the composited video stream or streams to determine an optimal layout for the current compositing stage performed by MCU 134 in process 400. An optimal layout may, for example, give each participant in a meeting an equal area in the output composited image.
The layout selected in step 420 may further depend on user preferences and other information such as the content or a classification of the video images or the capabilities of the endpoint 132 receiving the composited video stream. For example, a user preference may allot more area of a composited image to the video image of a current speaker at the videoconference, a whiteboard, or a slide in a presentation. The selection of the layout may define areas in an output video frame and map the video images to respective areas in the output frame.
Compositing process 400 uses the selected layout and the identified video images or sub-stream in a process 430 that constructs each frame of an output composited video stream. Process 430 in step 432 identifies an area that the selected layout defines in each new composited frame. Step 434 further uses the layout to identify an input data stream and possibly an area in the input data stream that is mapped to the identified area of the layout. If the input data stream is not composited, the input area may be the entire area represented by the input data stream. If the input data stream is a composited video stream, the input area corresponds to a sub-stream of the input data stream. In general, the input area will differ in size from the assigned area in the layout, and step 435 can scale the image area from the input data stream to fit properly in an assigned area of the layout. The scaling can increase or decrease the size of the input image and may preserve the aspect ratio of the input area or stretch, distort, fill, or crop the image from the input area if the aspect ratios of the input area and the assigned layout area are different. In step 436, the scaled image data generated from the input area or video sub-stream can be added to a bit map of the current frame being composited, and step 438 can determine whether the composited frame is complete or whether there are areas in the layout for which image data has not been added. When an output frame is finished, MCU 134 in step 440 can encode the new composite frame as part of a composited video stream in compliance with the videoconferencing protocol being employed.
The areas associated with video images or sub-streams in the input video streams may remain constant over time unless a participant joins or leaves a videoconference. In a step 450, MCU 134 decides whether one or more of the input data streams should be analyzed to detect changes, and if so, process 400 branches back to analysis process 410. Such analysis can be performed periodically or in response to an indication of a change in the videoconference, e.g., termination of an input video stream or a change in video conference information. A change in user preference from a recipient of the output composited video stream from process 134 might also trigger analysis of input video streams in process 410 or selection of a new layout in step 420. Additionally, video conferencing events such as a change in the speaker or presenter may occur that trigger a change in the layout or a change in the assignment of video images to areas in the layout. If such an event occurs, process 400 may branch back to layout selection step 420 or back to analysis process 410. If new analysis is not performed and the layout is not changed, process 400 can execute step 460 and repeat process 430 to generate the next composited frame using the previously determined analysis of the input video streams and the selected layout of video images.
Implementations may include computer-readable media, e.g., a non-transient media, such as an optical or magnetic disk, a memory card, or other solid state storage storing instructions that a computing device can execute to perform specific processes that are described herein. Such media may be or may be contained in a server or other device connected to a network such as the Internet that provides for the downloading of data and executable instructions.
Although particular implementations have been disclosed, these implementations are only examples and should not be taken as limitations. Various adaptations and combinations of features of the implementations disclosed are within the scope of the following claims.