The present disclosure relates generally to video teleconferencing, and more specifically to synchronizing a plurality video streams at a composite video distribution device.
In certain video teleconferencing environments, each of a plurality of individuals has a camera, a microphone, and a display, the combination of is referred to herein as a teleconference endpoint. The video and audio from each endpoint is streamed to a central location where a video processing device, e.g., a Multi-point Control Unit (MCU), takes the video (and audio) from the various endpoints and redistributes the video to other endpoints involved in a conference session.
In some forms, the MCU acts as a video compositor and reformats the video by combining several video images onto a single screen image, thereby forming a “composite” image. The combination of various video feeds onto a single screen requires the reception of one whole frame from each video source in order to create the output frame. When the sources are asynchronous, each source uses a frame buffer. The average latency of these frame buffers is one-half a frame, or 16 milliseconds (ms) at a standard frame rate of 30 frames per second (fps), while the maximum latency is a full frame or approximately 33 ms. Latency may cause undesirable video and audio effects for those participating in the video teleconference.
a and 2b are diagrams showing an example of the mechanics of video generation for individual video streams of the conference endpoints in which a frame capture rate is adjusted by a video compositor.
a is an example block diagram of a network device that is configured to generate a composite video of the conference participants from a plurality of video streams.
b is an example block diagram of an endpoint device that is configured to generate individual video and audio streams for the conference participants.
Overview
Techniques are provided for synchronizing upstream video sources in vertical synchronization time and in frame rate, so that a downstream device can create a composite image with low latency. At a video compositor device, a plurality of video streams are received that comprise at least first and second video streams. First and second vertical synchronization points associated with the first and second video streams are determined. A difference in time or timing offset between the first and second vertical synchronization points is determined. At least one message is generated that is configured to change a video capture frame rate associated with one or both of the first and second video streams to reduce the difference in time (timing offset) and the message is sent to video capture devices for one or both of the first and second video streams.
Techniques are also provided for upstream video sources, e.g., video capture devices or cameras, to receive a message configured to indicate an adjustment to a video capture frame rate. The video capture frame rate is adjusted in response to the message to advance or retard a vertical synchronization point of the video signal produced by the video capture device.
Referring first to
Each of the media encoders 130(1)-130(4) encodes audio and video from cameras 120(1)-120(4) into transport streams 160(1)-160(4). The transport streams 160(1)-160(4) are transmitted to the compositor 150 via network 140. At the compositor 150, the video from transport streams 160(1)-160(4) is encoded into a composite video stream 170. A single composite frame of the composite video stream is shown at 175. The composite video stream is sent to each of media encoders/decoders to decode the composite video stream for display on a corresponding one of the displays 180(1)-180(4) at the endpoints 105(1)-105(4) so that each participant may see each of the other participants during a conference session, as shown. The network 140 may be an intranet or campus network, the Internet, or other Wide Area Network (WAN), or combinations thereof.
Each of the video streams produced by the cameras 120(1)-120(4) is generated according to each camera's internal electronics, and as such, video frames are generated that start at different times, i.e., the video from cameras 120(1)-120(4) are not necessarily synchronized with each other. The video streams may become further shifted in time relative to one another by any network latency that may be introduced during transit via the network 140 to the compositor 150. In order to produce a composite video, the transport streams 160(1)-160(4) are buffered and then decoded. The buffering allows the compositor 150 to align the video from each of the cameras 120(1)-120(4) in order to remove any time differences or latencies between corresponding video frames. The composite image frames are then encoded for transport in composite video stream 170. Buffering also introduces additional undesired latency in the system 100. According to the techniques described herein, latency is reduced by sending a control message or signal back to each of the cameras 120(1)-120(4) to adjust, i.e., advance or retard, their video capture frame rate so that eventually the vertical synchronization (“sync”) points or “syncs”/“V-syncs” of each video frame arrive at the compositor at roughly the same time.
Referring to
At 210, as the video is transmitted from the camera to the media encoder/decoder, e.g., one of media encoders/decoders 130(1)-130(4), it is encoded for transport, e.g. in a Motion Pictures Experts Group (MPEG)-2 Transport Stream (TS). The encoded video stream may be encapsulated into MPEG packets, subsequently encapsulated into IP packets, and further encapsulated using Real-time Transport Protocol (RTP) for transport over network 150. Each video frame is compressed and packed into approximately 5 to 10 packets. The packets are emitted from the video encoder as soon as they are created, as shown. One packet is emitted approximately every 4 ms. Thus, the techniques described herein can optimize the timing of frames at a subframe or video slice level.
In
At 220, a first camera's, e.g., camera 120(1), scan line timing diagram is shown with an earliest vertical sync at line 0. At 230, camera 120(2) has a vertical sync that starts later than the vertical sync for camera 120(1). At 240, camera 120(3) has a vertical sync that starts later than the vertical syncs for cameras 120(1) and 120(2), and at 240, camera 120(4) has a vertical sync that starts later than the vertical sync for camera 120(2), but earlier than the vertical sync for camera 120(3). The relative timing of the vertical syncs shown at 220-250 may also indicate the relative arrival times of the vertical syncs at the compositor 150. According to the techniques described herein, the compositor 150 sends a control signal back to each of the cameras 120(1)-120(4) to adjust their video capture frame rates in order to advance or retard the vertical syncs (as needed) such that each vertical sync arrives at the compositor at roughly the same time. This concept is shown in greater detail in
Referring to
As shown in
In this example, the reference time 340 is set up based on transport stream 160(1), with at Δt1 of zero. The remaining time differences Δt2, Δt3, and Δt4, are shown with approximately the same delays that are shown in
The compositor 150 generates a composite image frame for all the participants in the video teleconference, e.g., participants 110(1)-110(4) as shown in
The compositor 150 also generates and sends feedback signals or control messages 360(1)-360(4) to corresponding cameras 120(1)-120(4) to advance or retard their respective video capture frame rates so that eventually the vertical syncs of each video frame arrive at the compositor at approximately the same time. The compositor 150 uses information from each of the sync delay units 320(1)-320(4) and generates the feedback signals 360(1)-360(4) such that the delay through the sync delay units 320(1)-320(4) is minimized. The process for synchronizing the vertical syncs at the compositor 150 has been briefly described in connection with
Referring now to
The processor 410 is a data processing device, e.g., a microprocessor, microcontroller, systems on a chip (SOCs), or other fixed or programmable logic. The processor 410 interfaces with the memory 450 to execute instructions stored therein. Memory 450 may be any form of random access memory (RAM) or other tangible (non-transitory) memory media that stores data used for the techniques described herein. The memory 450 may be separate or part of the processor 410. Instructions for performing the frame capture rate adjustment computation and signaling process logic 500 may be stored or encoded in the memory 450 for execution by the processor 410.
The functions of the processor 410 may be implemented by a processor or computer readable tangible (non-transitory) medium encoded with instructions or by logic encoded in one or more tangible media (e.g., embedded logic such as an application specific integrated circuit (ASIC), digital signal processor (DSP) instructions, software that is executed by a processor, etc.), wherein the memory 450 stores data used for the computations or functions described herein (and/or to store software or processor instructions that are executed to carry out the computations or functions described herein). Thus, the process 500 may be implemented with fixed logic or programmable logic (e.g., software/computer instructions executed by a processor or field programmable gate array (FPGA)), or the processor or computer readable tangible medium may be encoded with instructions that, when executed by a processor, cause the processor to execute the process 500.
Referring to
Turning now to
At 540, at least one control message is generated that is configured to adjust (change) a video capture frame rate associated with one or both of the first and second video streams to reduce the difference in time between the first and second vertical synchronization points. At 550, a control message is sent to the video capture device for one or both of the first and second video streams, e.g., the control message(s) could be sent to one or both of the cameras 120(1) and 120(2) that generate video frames for transport streams 160(1) and 160(2). Over time the first and second vertical synchronization points will converge. As the first and second vertical synchronization points converge, control messages may be generated to dynamically adjust the corresponding video capture frame rates or to maintain a current frame rate. If it is determined at 530 that there is not a significant difference in time, then control messages for one or both of the first and second video streams may be configured to maintain the first and second vertical synchronization points.
Depending on the magnitude of the adjustment rates for the various vertical syncs, updates are periodically computed to the time differences between the various vertical syncs and the video capture frame rates are dynamically adjusted accordingly. Thus, the control message may be configured to dynamically adjust the video capture frame rate for one or both of the first and second video streams based on a rate of convergence between the first and second vertical synchronization points. The control message generated for the first video capture device may be different than the control message generated for the second video capture device. For example, the control message for the first video capture device may be configured to cause the first video capture device to advance its V-sync and the control message for the second video capture device may be configured to cause the second video capture device to retard its V-sync so that the V-sync of the first video stream and the V-sync of the second video stream converge to align with each other. Moreover, the first and second video streams is only an example and, as depicted in
Said another way, the process 500 involves a network element, e.g., MCU 300 or other element configured to perform compositor operations, receiving a plurality of video streams (at least first and second video streams) over a packet-based network. The network element records the arrival time of each packet, and determines which packets correspond to the start and end of each video frame, e.g., by examining the RTP headers. The network element combines the video streams, one frame from each video stream, to produce a composite video frame as part of a composite video stream. The network element is configured to minimize the latency of the system by minimizing the time that packets sit in their respective buffers before they are combined. To do this, the downstream element synchronizes the streams to a video frame capture rate. The vertical sync or SOF is timed so that the last line of that video source arrives “just-in-time” to be combined into the output composite video frame.
Referring to
The upstream video sources, i.e., video cameras, include a means to control the raster scan rate of the image sensor within. One advantage provided by the techniques described herein is that a multi-megahertz clock or hard synchronization signal does not need to be fed to the camera from the downstream element. The camera uses its own crystal controlled pixel clock and only the start of the video frame needs to be synchronized.
A video teleconference has just begun and cameras 120(1)-120(4) are streaming video frames that have been encoded into transport streams 160(1)-160(4) destined for compositor 150, as shown. A baseline or reference time is shown at 340 that represents the earliest arrival time at the compositor 150 from among the vertical syncs for video frames within transport streams 160(1)-160(4). The reference time or timing point 340 is the same timing point 340 shown in
The compositor 150 computes the delays between each of the vertical syncs in transport streams 160(1)-160(4). In this example, a target timing point 710 is generated by compositor 150. The target timing point 710 is a convergence target for all of the vertical syncs, as will be described hereinafter. The target timing point may also be considered as a target video frame capture rate. The target timing point 710 may be an average or weighted average of the delays, a root-mean square (RMS) time based on the relative delays, based on known statistical, linear, or non-linear characteristics of the system 100, and the like. In another example, the target timing point 710 may be eliminated and the vertical sync may be adjusted so that they converge to each other.
In this example, the compositor 150 may send control message/signal 360(1) to retard the video frame capture rate for camera 120(1) so the vertical sync starts to arrive at the compositor 150 later in time relative to timing point 340 and move toward target timing point 710. Similarly, the compositor 150 may send control message/signals 360(2)-360(4) to advance the video frame capture rate for cameras 120(2)-120(4), respectively, so that the vertical syncs start to arrive at the compositor 150 earlier in time relative to timing point 340 and progress toward target timing point 710. The control message to the respective cameras may contain an adjustment on a percentage basis, a video frame capture frequency basis, or on an incremental time basis, thereby forming a closed loop that does not require clock or timing signals from the compositor, i.e., an absolute frame timing lock is not required. For example, the message may indicate that a camera should increase the video frame capture rate by 0.1%, or from 30 Hz to 30.2 Hz, or capture a frame in 32.5 ms instead of 33.3 ms.
Referring to
The control messages 360(1)-360(4) may be configured to dynamically adjust a video capture frame rate for the transport streams 160(1)-(4) generated by cameras 120(2)-120(4), respectively, based on differences in time between the vertical synchronization points and the vertical synchronization reference or target timing point 710. Alternatively, the control messages 360(1)-360(4) may be configured to dynamically adjust a video capture frame rate for the transport streams 160(1)-(4) generated by cameras 120(2)-120(4), respectively, based on rates of convergence between the vertical synchronization points and the vertical synchronization reference or target timing point 710. The target timing point 710 may also be dynamically adjusted based on differences in time between the vertical synchronization points and/or rates of convergence of the vertical synchronization points.
Referring to
Techniques have been described for upstream video sources to be synchronized in V-sync time and in frame rate, so that a downstream device can create a composite image with low latency. At a video compositor device, a plurality of video streams are received that comprise at least first and second video streams. First and second vertical synchronization points associated with the first and second video streams points are determined. A difference in time between the first and second vertical synchronization points is determined. At least one message is generated that is configured to change a video capture frame rate associated with one or both of the first and second video streams to reduce the difference in time and the message is sent to video capture devices for one or both of the first and second video streams.
Techniques also have been described for upstream video sources, e.g., a video capture devices or cameras to receive a message configured to indicate an adjustment to a video capture frame rate. The video capture frame rate is adjusted in response to the message to advance or retard a vertical synchronization point.
In summary, a downstream video sink sends messages to upstream video sources to adjust their video scan timing. The system achieves low latency from the camera to the combining video output by not using buffer delays to synchronize two pictures headed for the same display. Thus, extremely low latency video communication is achieved when two or more video streams are combined into a single video stream.
The above description is intended by way of example only.