Embodiments of the present disclosure are generally directed to video conferencing systems and related video conferencing methods. More specifically, embodiments relate to a system and method for generating and delivering a video stream containing desired regions of a video conferencing environment to a local and/or a remote location.
The recent increase in remote work has highlighted the need for improvements in video conferencing. In a completely remote video conferencing environment, multiple remote users interact with a central video conferencing graphical user interface (GUI) through their own individual cameras. These individual camera environments typically create a standard portrait view of each person. Occasionally, the video conferencing environment consists of a hybrid setup with multiple remote people and one or more people in a (non-remote) conference room. Generally, most conference rooms have a front-of-room camera that captures the entire conference room and all of the people present. However, in certain conference room seating situations, it is nearly impossible for a front-of-room camera to capture the front of each speaker. For example, in scenarios where the conference room consists of a circular table, with people situated in a circular setup, the front-of-room camera only captures a front-facing profile view of the people facing the camera. Furthermore, in this scenario, the front-of-room camera only captures a side profile of the users sitting perpendicularly to a full field of view (FFOV) of the front-of-room camera.
Accordingly, designers of video-conferencing systems seek to provide desirably oriented portrait views of people in a conference room without altering the seating arrangement. One solution is to use a wide-angle camera to capture more of the conference room. However, the larger FFOV also captures a large amount of unnecessary aspects of the room (windows, walls etc.). Another solution is to use multiple cameras. However, multiple camera configurations struggle to display a profile view of each person, as no one camera FFOV can capture a perfect portrait view of each person from every angle. Furthermore, traditional multi-camera front-of-room video conferencing systems switch between the multiple cameras, only providing a single and often incomplete view based on one camera angle or another, but not both. These multi-camera front-of-room video conferencing systems cannot blend the video streams to incorporate an FFOV of all of the cameras in the system. Moreover, as the number of cameras and input video streams increases, there is an increased amount of strain on the processing components of the image system processor resulting in a need to decrease video picture quality to avoid latency issues. An additional deficiency with multiple cameras is that each camera may have a different delay leading to the audio and video of a particular camera being out of sync or the video streams of the cameras being out of sync with each other. Maintaining synchronization is possible by adding delays, but the delays make the already unacceptable latencies worse.
If the latency in the delivery of images and audio data during a video conference is too large, the ability of a video conferencing system to effectively carry out the video conference is affected since the lag in the presentation to a remote viewer caused by the large latency is annoying to the remote viewer. Also, with a video conferencing participant's typical desire to deliver and present high-quality videos, many manufacturers have pushed to use higher and higher resolution cameras. Multiple high-resolution cameras often increase the lag in delivering the combined video stream data to a remote location.
There is a need for a multi-camera video conferencing system and video conferencing method that allows each speaker to be viewed in a forward facing manner regardless of their seating or speaking position, and also a need for a multi-camera video conferencing system that has a low latency, while also maintaining or providing high-quality images to one or more video conference locations.
Embodiments provide a method for video conferencing. The method includes processing a video stream received from each of a plurality of sensors, where each video stream includes a first version of video data having a first resolution. Processing the video stream includes sampling the first version of the video data to form a second version of the video data, where the second version of the video data has a second resolution that is less than the first resolution, determining one or more regions of interest within the second version of the video data, and generating metadata for each of the one or more regions of interest. The method further includes generating cropping instructions for each of the one or more regions of interest based on the metadata, removing portions of the first version of the video data based on the cropping instructions, and generating a composite display that includes portions of the first version of the video data remaining after removing portions from the first version of video data.
Embodiments of the disclosure provide a method of video conferencing that includes processing a video stream received from each of a plurality of sensors, wherein each video stream includes a first version of video data having a first resolution, and processing the video stream comprises: sampling the first version of the video data to form a second version of the video data, wherein the second version of the video data has a second resolution that is less than the first resolution; determining one or more regions of interest within the second version of the video data, and generating metadata for each of the one or more regions of interest; generating cropping instructions for each of the one or more regions of interest based on the metadata; removing portions of the first version of the video data based on the cropping instructions; and generating a composite display that includes portions of the first version of the video data remaining after removing portions from the first version of video data.
Embodiments of the disclosure provide a method of video conferencing that includes: a plurality of sensors, an image signal processor, a computer vision processor, a virtual cinematographer, and a video composer. The plurality of sensors are each configured to generate a video stream that comprises a first version of video data that has a first resolution. The image signal processor is configured to downscale the first version of the video data to form a second version of the video data that has a second resolution that is less than the first resolution. The computer vision processor is configured to determine two or more regions of interest and generate metadata for each of the two or more regions of interest using the second version of the video data. The virtual cinematographer is configured to: create a ranking of the determined two or more regions of interest based on the metadata; generate crop instructions for each of the two or more regions of interest based on the metadata and the ranking of each of the one or more regions of interest; and crop at least two or more portions of the first version of the video data to form at least two or more presentation regions of interest. The video composer is configured to compile the at least two or more presentation regions of interest.
Embodiments of the disclosure provide a method of video conferencing that includes processing a video stream received from each of a plurality of sensors, wherein each video stream includes a first version of video data having a first resolution, and processing the video stream comprises: a) sampling the first version of the video data to form a second version of the video data, wherein the second version of the video data has a second resolution that is less than the first resolution; b) determining one or more regions of interest within the second version of the video data, c) generating metadata for each of the one or more regions of interest; d) selecting one or more regions of interest as best by ranking each of the one or more regions of interest in the second version of the video data based on the metadata; e) generating cropping instructions for each of the one or more regions of interest selected as best based on the metadata; f) cropping at least two or more portions of video data from the first versions of the video to form two or more presentation regions of interest based on the cropping instructions; and g) generating a composite scene video data that includes the two or more presentation regions of interest.
Embodiments also provide a system performing one or more aspects of the above method.
So that the manner in which the above-recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. However, it is to be noted that the appended drawings illustrate only exemplary embodiments and are therefore not to be considered limiting of its scope, may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
Numerous specific details are set forth in the following description to provide a more thorough understanding of the embodiments of the present disclosure. However, it will be apparent to one of skill in the art that one or more of the embodiments of the present disclosure may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring aspects of the present disclosure.
Traditional video conferencing systems typically include one camera placed at the front of the room to capture a full field of view (FFOV) of the entire conference room. A single front-of-room camera can generally capture all of the persons in the FFOV but can often not provide an optimal view of each participant in the room. In response, some video-conferencing systems have added a second front-of-room camera. The problem with a second front-of-room camera is that the front-of-room cameras are typically unable to capture an optimal view of the people in the room since their viewpoint is limited to a position at the front of the room. In addition, when typical multi-camera front-of-room video conferencing systems desire only one preferable view of a speaker, they are forced to repeatedly switch between a first camera and a second camera, providing a single view based on the first camera angle or the second camera angle. In addition, these multi-camera front-of-room video conferencing systems cannot blend the video stream to incorporate at least portions of an FFOV of the first camera and second camera. Thus, a traditional multi-camera video conferencing system's controller's decision results solely in cycling between the views provided by the one or more cameras within the video conferencing environment.
Cycling between cameras becomes especially problematic when many people are in the FFOV. The controller must decide which camera shows the most optimal view. Thus, the final product is not an optimized video of both people but a one camera view that is slightly better than the other. This problem is compounded as more and more people join the video conference environment. Furthermore, selecting only one FFOV of one camera can exclude other potentially more beneficial camera views for a viewer. Thus, there is a need for a multi-camera system that can provide an optimal view of participants in the conference without having to cycle through video streams.
As briefly discussed above, because latency in the delivery of images and audio data during a video conference affects the ability of the video conferencing system to effectively carry out the video conference without annoying its participants due to a lag in the presentation of the images and audio data relative to the video conference going on at a viewer's location, there is a need for a system that has a desirable low latency, while also maintaining or providing high quality images to one or more video conference locations. Typically, latencies exceeding a 160 millisecond (ms) delay lead to noticeable and undesirable delays in the delivery and presentation of the images and audio data; thus, latency delays less than 130 ms are preferred. As is discussed further below, embodiments of the disclosure provided herein also provide a system and methods of delivering high-resolution images with low latency by at least defining one or more regions of interest (ROI) within the images found within a video stream, extracting the ROIs from a high-resolution stream of images, and transmitting the extracted ROIs to a desired location. The system and methods provided herein thus reduce the amount of unnecessary information that is provided from the video conferencing system to the “pipe-line” that is delivering the video images and audio data to other video conferencing locations, and allow the multiple ROIs to be transferred through a “pipe-line,” each having a high resolution and/or image quality.
The video conferencing system 100 described herein is generally able to combine and alter a video stream generated by an image signal processor from a video generating device (e.g., camera) and thus, continuously use any advantageous camera angle from any camera in the multi-camera video system. Furthermore, because the image signal processing system gathers the video streams from the cameras and combines them, the system can actively combine a defined ROI that contains a first portion of one person from a first camera FFOV with a defined ROI that contains a second portion of the same person from a second camera FFOV. This combining is especially beneficial in situations where the most optimal view of the person may be in a partial blind spot for either the first or second camera or may be in a position where the best view for the person is in a position where the FFOV of the first camera overlaps with the FFOV of the second camera, and selecting only one FFOV from one camera would not be the optimal view.
The plurality of sensors 105A-105C includes a first sensor 105A, a second sensor 105B, and a third sensor 105C, which, in some embodiments, are high-resolution cameras, such as cameras that have a resolution greater than 8 megapixels (MPs) or greater than 12 MP, or greater than 16 MP, or greater than 20 MP, or even greater than MP that are capable of delivering video at at least Full HD (1080 p) video resolution. In some embodiments, the cameras are capable of delivering video at a 2K video resolution, or UHD (2160 p) video resolution, or DCI 4K (i.e., 4K) video resolution, or 8K or greater video resolution. The terms “camera” and “sensor” are generally used interchangeably throughout the disclosure provided herein, and neither term is intended to be limiting as to the scope of the disclosure provided herein since, in either case, these terms are intended to generally describe a device that is at least able to generate a stream of visual images (e.g., frames) based on a field-of-view of one or more optical components (e.g., lenses), and an image sensor (e.g., CCD, CMOS sensor, etc.) disposed within the “camera” or “sensor.” While
The plurality of sensors 105A-105C captures a video stream based on a full field-of-view (FFOV) of the sensor. Typically, the FFOV environment includes one or more subjects, a foreground, a background, and a surrounding area. In some embodiments, a video stream includes video frames and audio data (e.g., audio packets). The video frames, and optional audio data, are hereafter referred to generally as “video data.” In some embodiments, at least the video stream portion of the video data is shared across all nodes within the system.
The first ISP pass 107 receives the video data 106 captured by the plurality of sensors 105A-105C. In some configurations, the first ISP pass 107 includes a plurality of individual ISP-passes 107A-107C, each associated with a respective sensor 105A-105C. The first ISP pass 107 receives the video data 106 from a sensor and resizes the FFOV video data by downscaling the video data gathered by each sensor 105A-105C from a higher resolution (e.g., 4 k) to a lower resolution (e.g., 720p).
The first ISP pass 107 processes the video data from the downscaled version and then sends a version of the video data, referred to herein as the downscaled video data, to a memory location 110A, 110C, or 110E via a path 108A. The first ISP pass 107 can downscale the video data to any useful megapixel (MP) value to allow for the rapid analysis of the downscaled data by the CVP 115. In one embodiment, the video data is downscaled to a resolution of about 1 megapixel (MP).
The memory, such as memory 110A, 110C, and 110E, stores the processed video data from first ISP pass 107. In some embodiments, the memory, such as memory 1106, 110C, and 110E, stores unprocessed (i.e., original video data) in memory locations 1106, 110D, and 110F via paths 109A, 109B, 109C. In addition, in some embodiments, the memory stores regions of interest (ROIs) for later processing by second ISP Pass 145.
Computer vision processor (CVP) 115 processes the downscaled video data, which is made available to the virtual cinematographer (VCT) 120.
The CVP 115 gathers video data from memory 110. The CVP 115 performs three functions. It functions as a classifier to determine whether or not the video data contains one or more ROIs. It generates metadata associated with each ROI, such as metadata relating to a person within the video conferencing environment and/or ROI within which the person is captured. Finally, it serves to determine “who is who.”
The CVP 115 defines one or more ROI within the frames of the video data. The low-resolution video data decreases processing latency as only a small portion of the video data is passed through and analyzed by the VCT 120. Furthermore, as further explained in relation to
The metadata includes one or more video stream characteristics, such as date, time stamp information, sensor 105 information, attributes of one or more people in the ROI, video attributes, and other useful video conferencing information. The classifier constructs a region of interest (ROI) that includes at least a portion of one or more people to determine whether or not the video data captures one or more people. In some embodiments, each speaker within the video conferencing environment is assigned his/her own ROI. The ROI is defined by a bounding box. In one configuration, the bounding box is created to include the subject's face and a portion of their body. However, the bounding box of the ROI can be any arbitrary size and location within the FFOV of any of the one or more cameras.
The CVP 115 also generates metadata that includes a plurality of characteristics for portions of each ROI. The characteristics can further include a color histogram, vertical motion history, history of speech, polar coordinate history of attributes in the ROI, and person's face direction history. The characteristics in one configuration include head segmentation and head pose estimation. Each characteristic is assigned a value by the virtual cinematographer and ranked to determine the “best view” of each person and/or speaker. Here, the CVP 115 generates metadata that includes characteristic information concurrently with creating the metadata that includes the bounding box and its instructions. In another configuration, the CVP 115 sequentially generates the metadata that includes characteristic information after creating the metadata that includes the bounding box instructions. The CVP 115 sends the metadata to a VCT 120.
Each ROI is tagged with a unique ID tag, allowing the CVP 115 to track inferences of specific video characteristics for each person. For example, the CVP 115 makes a plurality of ID tags to identify a first ROI for a first person, a second ROI for a second person from a first camera, a third ROI for a third person, a fourth ROI for a fourth person, and a fifth ROI for a fifth person from a second camera.
Because the CVP 115 is shared across all nodes when there is a single device, multiple sensors, and one SoC within the video conferencing system 100, each subject is individually evaluated at the virtual cinematographer 120 irrespective of the origin of the video stream. This evaluation is especially valuable in configurations where a first portion of an ROI from a first camera FFOV is stitched with a second portion of an ROI from a second camera FFOV to create a full ROI of a person. By not requiring the CVP 115 to be individually associated with each camera, the CVP 115 can individually determine ROI regardless of the origin of the video stream.
The VCT 120 operates on the ROIs 203 and the ROI metadata to generate a ranking on which the best view is decided. In one embodiment, the “best view” is formed using artificial intelligence (AI) software running on the video conferencing system 100. The best view is based on (1) who is speaking, (2) the camera people are facing, and (3) the camera people are closest to (using polar coordinates). The person speaking identifies candidate persons for streaming. The camera people are facing gives information about the camera that best captures the person's face and is more important than the camera people are closest to, which gives information about the camera with the best image quality. If there are multiple cameras that best capture the person's face, the camera with the higher quality image is selected. In addition, the items such as a person's face size, face angle, color histogram, vertical motion history, history of speech, polar coordinate history, and face direction history, all of which are stored in memory, can be used to help select the best view. Once the virtual cinematographer (VCT) determines the “best view” of each ROI, the VCT 120 generates cropping instructions based on the “best view” as determined by software running therein (e.g., artificial intelligence (AI) software).
Video composer (VCMP) 130 executes the crop instruction on the original video data to remove regions in the FFOV outside the defined bounding box or boxes. Removing irrelevant video data reduces the amount of video data that has to be processed by VCMP 130 and any downstream processor such as ISP-pass2. The instructions for creating an ROI and removing video data outside the bounding box are hereafter referred to as “bounding box instructions.”
The video composer (VCMP) 130 receives the cropping instructions for the best views of the ROIs where multiple re-sizers in conjunction with the hardware-accelerated video composer 130 create presentation ROIs. The original video data outside of the best view ROI is removed and discarded. Removing unused original video data increases the efficiency of the video composer 130 by requiring the video composer 130 to generate only one ROI per person. In some cases, an ROI may be referred to as a “people frame” since the ROI primarily contains portions of a person within the video conference environment versus ROIs that contain portions of other areas of interest (e.g., whiteboard surfaces).
In one configuration, the second ISP-pass 145 uses a plug-in re-sizer (for example, software that upscales the video data to a higher resolution). In one configuration, the second ISP pass 145 resizes the person frame 203 by increasing the zoom level of the person frame 203 while maintaining quality within an allotted picture quality (e.g., resolution) threshold value that is stored in memory. In some embodiments, the output of the second ISP pass is sent to memory, skipping the need for SVPP or CPU, thereby reducing latency.
VCMP 130, with the aid of SVPP and possibly a GPU, merges individual ROIs into a “single composite frame” (graphical user interface (GUI). As previously mentioned, downscaling original video data reduces the latency in creating individual ROIs. In some embodiments, the composite frame is handled, without using additional hardware-accelerated elements, by allocating memory of the same size as the outbound video stream (1920×1080) and writing the individual ROIs to the memory.
During steps 234, 236 and 238, the CVP 115 uses the downscaled video data stored in the one or more first memory locations. Using only some pixels allows the CVP 115 to gather information about original video data 109 without having to process every pixel increases the processing speed of this activity by decreasing the computing demands of the processing device (e.g., processor within a SoC). Using the downscaled video data, the CVP 115 acts as a classifier and can determine whether or not one or more people are in a video frame. The classifier includes one or more algorithms that are configured to review and analyze the sampled data found in the video frames to detect whether a person is present in the video data. The classifier has rules and/or software instructions stored therein that are able to detect and isolate a person from all of the other elements (e.g., chairs, tables, painting, windows etc.) in the video data.
Specifically, during step 234, the CVP 115 determines one or more ROIs using the downscaled video data retrieved from memory in step 232. In one example, to determine whether or not the video data contains an ROI, such as a region including one or more people, the CVP 115 typically uses a classifier (e.g., face detection application programming interface (API)) to determine the face and/or upper body of each person. Generally, face detection APIs recognize facial features in video data through face mapping software. Face mapping software superimposes a grid onto the speaker to measure contours (e.g., biometrics) associated with facial features and can be used to train neural networks to automatically detect similar facial features in the future. In another example, the classifier is able to detect or isolate a person by detecting changes in their position or portion thereof, their shape or outline relative to a background, or other useful metric.
Once the CVP 115 determines that one or more people are present in the video data, the CVP 115 constructs a region-of-interest (ROI) 203 for each person around at least their face and upper body. In some embodiments, each speaker can have his/her own ROI assigned to them. The ROI is defined by a bounding box with a desired size defined by one or more bounding box attributes stored in memory. In one configuration, the bounding box is created to include the subject's face and portions of their body. The edges and/or corners of the bounding box include X coordinates and Y coordinates (e.g., pixel coordinates) and are determined using offsets defined by the attributes stored in memory. The amount of one or more offsets in any direction can be determined based on the resolution of the video data that is to be later cropped, such as adjusted so that a controlled number of pixels are found within one or more ROIs. The bounding box coordinates as part of the metadata generated by the CVP 115 are then sent in step 240 to VCT 120 as bounding box instructions. The bounding box instructions can include data relating to the portion of the video data to be processed using the bounding box. The bounding box instructions thus include information relating to at least one bounding box that is used to define one or more ROIs.
While concurrently performing step 234, at step 238, the CVP 115 generates ROI characteristics as part of the metadata based on an analysis of the downscaled video data retrieved from memory during step 232. During this step, the CVP 115 generates data that contains a plurality of characteristics based on one or more attributes or properties of portions of the video data within each of the ROIs. The characteristics include color histogram, vertical motion history, history of speech, polar coordinate history, and face direction history. The stream characteristics further include head segmentation and head pose estimation.
Head segmentation, which estimates the current face direction, gathers data on the angle of the person's head and face relative to the camera. For example, in one configuration, a person may be speaking to a person next to them and facing away from the nearest camera. Because the speaking person is facing away from the camera, the head segmentation would consist of a side profile view of the speaker's head and likely upper body. Thus, the side profile view is where the speaker's head and face are facing away from each camera creating a perpendicular angle between a line emanating from the center of the speaking person's face and a line tracking the shortest distance between each camera and the speaking person. Furthermore, in this configuration, the head segmentation data would likely result in the video-conferencing system 100 determining that the side profile view of the speaker is not the “best view” of the speaker. The head segmentation process is further discussed with respect to
Head pose estimation is a process by which a system estimates the future angle and location of a person's head and face in relation to the camera during normal conversational human behavior (for example, speaking, twitching, moving, or signaling). For example, a speaking person can be speaking to more than one person in the conference room, thus requiring the speaker to pan back and forth across the room. In this situation, the system would gather data determining the future angle and location of the person's head and face relative to each camera. As previously mentioned, the head and face angle of each person relative to each camera is based on a first line extending outwardly from a point in the center of each person's face (e.g., nose), with the intersection of a second line that tracks along the shortest path between each camera and the speaker. Examples of face angles can be found in
In one embodiment, head pose estimation is performed using a neural network that processes images of a single head and returns estimated yaw, pitch, and roll values. In another embodiment, a neural network that processes FFOV images and produces segmentation masks of the upper body, head, and nose is later processed to obtain pitch and yaw for each person's head. In yet another embodiment, a history of head-pose estimation is used to generate angular velocity curves for yaw, pitch, and roll and predict future head poses using a Kalman filter or the Hungarian algorithm.
At step 240, system 100 sends metadata for each of the determined ROIs from the CVP 115 to the virtual cinematographer (VCT) 120, as shown by path 111 in
In step 244, system 100 generates a ranking of the ROIs by the VCT 120 based on the video characteristic information found in the received RD metadata. Generating a ranking of the ROI may include determining a best view for each person within a conferencing environment. The best view is based on characteristics indicative of a person participating in a conversation. The characteristic information can include a conference room participant's face size, face angle relative to a reference frame, vertical motion history, history of speech, and/or face direction history relative to a reference frame.
In step 248, system 100 generates and delivers cropping instructions based on the outcome of the steps shown in
At step 248, system 100 generates cropping instructions. Here, the cropping instructions are generated by the VCT 120 and generally include cropping data relating to the portion of the video data that is to be cropped and video data that is to be retained after performing activity 266 by use of the received RD metadata. As previously mentioned, the cropping instructions are based on the ROI with the “best view,” obtained in step 246 based on the sampled and downscaled video data. The cropping instructions include X and Y coordinates of the edges and/or corners of each of the bounding boxes. Typically, when the ROI includes individual “people frames,” a bounding box encapsulates the person, including their face and portions of their upper body. One example of a typical bounding box dimensions is the ROIs, as depicted in
At step 250, system 100 includes sending the crop instructions to the video composer 130 (path 112 in
Video Conferencing Process Flow Examples
As previously mentioned, original video data 106A and 1066 is captured by the plurality of sensors 105A-105B. Each sensor 105A-105B is communicatively coupled to the first ISP pass 107 of the image processing system 301. The original video data 106A, 106B is thus sent to and received by the first ISP pass 107. In this example, the video data is generated from a 12-megapixel (MP) camera capable of delivering video at a 4K resolution. Each sensor 105A-105B is not limited by the size of the video data that it gathers. A first version of the original video data, 106A, 106B, is then downscaled and stored in memory 110 as sampled video data 108A. A second version of the original video data 106A is stored in memory 110. In this example, the sampled video data 108A is downscaled to a video resolution of 720p or less.
The sampled video data 108A is retrieved from memory and sent to the CVP 115. Once the VCT 120 has determined the best ROI, the VCT 120 sends crop instructions to first ISP pass 107, where multiple re-sizers in conjunction with the video composer 130 create individual presentation ROIs from the original video data 106A, 106B. The original video data is retrieved from memory 110 by the first ISP-pass 107, retrieved from memory 110 by the video composer 130, and passed between the video composer 130 and the first ISP pass 107. The video data is first altered to create a downsized portion of the ROIs by first ISP pass 107 and then sent to the video composer 130 as video data 116 to finalize the eventual creation of the presentation ROIs. In this example, the ROIs are downscaled to a 720p resolution or less to not overwhelm the processing capabilities of second ISP-pass 145 as well as to ensure a deterministic latency for the completion of the second ISP-pass 145 since the prior performed processing steps is not a significant rate limiting sequence in this process.
As depicted in
Typically the final composite display 305 includes the one or more presentation ROIs 303A-303D. As depicted in
Each person 402-408 in
As depicted in
Using the FFOV of the second sensor 415B, the video-conferencing system prevents the user from facing a particular sensor (e.g., front-of-room camera). Turning and speaking to a sensor with an FFOV perpendicular to the speaker's face creates an unnatural speaking environment in which the speaker has to choose between facing the people in the conference room or the main sensor (e.g., front-of-room camera). By using all of the video data gathered by the first sensor 415A, the second sensor 415B, and third sensor 415C, the video-conferencing system allows the speaker (e.g., 403) to have a more natural conversation with other people (e.g., people 402, and people 404-408) in the room without having to turn away to face one specific sensor. The video-conferencing system 100 is especially helpful in allowing the speaker (person 403) to communicate directly with each member of the conference room by maintaining a traditional line of sight with each person and not turning away to face a particular sensor. Traditional video-conferencing systems that use one or more sensors require the user to continuously adjust to the sensor if the user is interested in being seen at other remote locations. The only view optimization the traditional systems typically conduct is flip-flopping between video feeds. However, as noted above, the video conferencing system 100 disclosed herein uses all of the video data gathered from the plurality of sensors 415A-415C, and determines the best view using an ROI generated by CVPs for each sensor at the ISP from the aggregate video data, thus seamlessly creating the best view of the speaker without requiring the speaker to adjust their position relative to any one sensor. Furthermore, as seen by the position of the sensors on the conference room table, the FFOV of the positions of the sensors are such that at least one of them can continuously capture the face of the speaker no matter which way they are facing. In traditional video conferencing setups, the person speaking can turn away from the one or more sensors, and thus the traditional system has no optimal view (or an optimal view that consists solely of a side profile view).
In addition, because the video conferencing system described herein aggregates the video data gathered by the sensors 415A, 415B, and 415C, the video conferencing system allows the remote viewers to view the speaker in a front profile view of their physical orientation in the room. For example, as depicted in
As previously mentioned, gathering the original video data from each sensor and determining the best view using the original video data, as commonly done in conventional systems, increases latency due to a large amount of received data and needs to be analyzed to determine the best view. Thus the process flow described herein is needed to reduce the computing power required to continuously determine the best views of each person 402-405. Furthermore, in configurations where only the speaker is shown in the final composite video 900, the processing speed becomes increasingly significant. The speaker can change more than once within a period of time, requiring the video-conferencing system to alter the displayed video to a new speaker seamlessly without a delay. Traditional process flows without downscaling and using a CVP 115, and VCT 120 have a video lag or a complete slowdown as the processing power required to process each ROI of each speaker overwhelms traditional hardware capabilities. However, as described herein, a CVP unit on each device processes the video stream, generates metadata about the people in the video streams, shares that metadata across all nodes within the system, and agrees “who is who.” Each node in the system then sub-samples/crops the original video stream to be processed into a single stream. By sub-sampling/cropping the video stream, the number of pixels that have to be processed by ISP/GPU/Composing elements is reduced, reducing overall latency. In addition, because 415A may only have one camera sensor, the glass-to-glass latency is lower than 415B if 415B were to process full resolution (4 k) images from its camera modules. If this were done, there would be noticeable video de-synchronization between the ROIs from 415A and ROIs from 415B. To manage this, the above-described process sub-samples to reduce the glass-to-glass latency of 415B so that it is comparable to 415A.
In traditional video conferencing configurations, the FFOV 751 captured by a single front-of-room sensor is the only view of the conference room environment. The traditional FFOV 751 of the conference room typically does not include the second sensor 415B and the third sensor 415C. In most cases, traditional front-of-room cameras only capture a front portrait view of the people facing the sensor. As depicted in the FFOV 751, only persons 405 and 404 face the sensor 415A. Thus, in this position, only the two forward-facing people 404 and 405 have a front portrait view (e.g., a forward-facing view of the speaker's head and upper body). People 402-403 and 406-408 have only a side profile view. In this seating arrangement of people 402-408, there is no way for the front-of-room sensor to capture a portrait view of people 402-403 and people 406-408. In
In this example, sensors 415A-415C capture original video data 106A-106C. Original video data 106A is associated with sensor 415A. Original video data 106B is associated with sensor 415B. Original video data 106C is associated with sensor 415C. The original video data 106A-106C captured from each sensor 415A-415C is passed to the first ISP pass 107. The video data 106A-106C is sampled (i.e., downscaled) at the first ISP pass 107. The downscaled video data 108A-108C is stored in memory 110. The CVP 115 then retrieves the downscaled video data 108A-108C from memory. As previously mentioned, the CVP 115 determines the ROI (bounded by a bounding box) for each person and generates metadata associated with each ROI. The metadata, including bounding box instructions, is sent to the VCT 120. The metadata is shown as 111 in
The image processing system 101, which is configured to implement the various methods described above, can include or utilize a personal computing device, e.g., a desktop or laptop computer, configured with hardware and software that a user may employ to engage in routine computer-related activities, such as video conferencing activities. In some embodiments, the image processing system 101 can include a SoC that generally includes a processor, memory, and a peripherals interface. The processor may be any one or combination of a programmable central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a video signal processor (VSP) that is a specialized DSP used for video processing, a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a neural network coprocessor, or other hardware implementation(s) suitable for performing the methods set forth herein, or portions thereof.
The memory, coupled to the processor, is non-transitory and represents any non-volatile memory of a size suitable for storing one or more software applications, video data, metadata, and other types of data described herein. Examples of suitable memory may include readily available memory devices, such as random access memory (RAM), flash memory, a hard disk, or a combination of different hardware devices configured to store data.
The peripherals interface is configured to facilitate the transfer of data between the various video conferencing devices and one or more of the plurality of peripheral devices that are integrated with or are disposed in wired or wireless communication with the sensors and other connected electronic devices within the video conferencing system. The peripherals interface may include one or more USB controllers and/or may be configured to facilitate one or more wireless communication protocols that may be used may include, but are not limited to Bluetooth, Bluetooth low energy (BLE), Infrastructure Wireless Fidelity (Wi-Fi), Soft Access Point (AP), WiFi-Direct, Address Resolution Protocol (ARP), ANT UWB, ZigBee, Wireless USB, or other useful personal area network (PAN), wide area network (WAN), local area network (LAN), wireless sensor network (WSN/WSAN), near field communication (NFC) or cellular network communication protocols.
It is believed that the system and methods provided herein provide significant advantages over traditional methods that include numerous process bottlenecks that increase the latency of a video stream. Traditional methods do not sample video stream data to determine which portions of the video data should be kept (and subsequently compiled) and which portions should be discarded. By discarding the irrelevant portions of the original video data, the systems described herein can increase processing speed by requiring the system to perform less work. Specifically, by downscaling the original video data using the CVP 115 and the VCT 120, the systems described herein determine which portions of original video data contain the best view of each speaker with minimal latency and desirable video quality. In general, all other unwanted or unnecessary video data is removed prior to starting the heavy computing task of compiling (e.g., composing) the video. In addition, other image processing task such as noise reduction and color tuning on the video increases the computing requirements as the number of pixels increases.
Unlike the disclosed system and methods provided herein, traditional systems and methods downscale the entire video data before determining what portion of the video data needs to be further processed before the processed video data is delivered to a desired location within a video conferencing environment. Furthermore, traditional methods do not have parallel processing schemes, where a small portion of the original video data is downscaled, and the original video data is unaltered until later in the process of generating the desired video stream. Thus, traditional methods include bottlenecks, where the original video data is downscaled to a new downscaled video data size, then analyzed as downscaled video data, then upscaled back to an original (or greater) video data, and then compiled to form an altered set of video data.
Any downscaled megapixels need to be subsequently upscaled if any regions of interest are found. This upscaling is especially computationally intensive since all data from a camera is compressed without any vetting to find the desired ROIs. Downscaling and upscaling associated with this traditional process flow results in a larger latency. Downscaling the video data without affecting the original video data removes the need for the processing elements within a system-on-a-chip (SoC) to downscale the original video data only to later re-scale the video data after the ROI have been determined.
The system and methods described herein, therefore, remove the need for downscaling all of the original video data to perform various “best” or “preferred” view types of analyses on the video data, saving computing power and instead only altering the original video data to remove unwanted portions of video data thus speeding up the process of compiling multiple high-resolution images. The downscaling allows the system to determine the ROI and create cropping instructions for altering the original high-resolution video stream data received from the sensors 105A-105C and stored in memory (i.e., 110B, 110D, and 110F) without ever having to re-scale the original video stream data. Furthermore, in some configurations, the original video data need not be re-scaled after the individual ROIs are formed.
The methods, systems, and devices described herein collectively provide a multi-camera video conferencing system and video conferencing method that allows each individual speaker to be viewed in a desired manner regardless of their seating or speaking position, and also a multi-camera video conferencing system that has a low latency, while also maintaining or providing high-quality images to one or more video conference locations.
While the foregoing is directed to embodiments of the present disclosure, other and further embodiments of the disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.