In a videoconference, framing refers to the view or frame provided to the far end or site. Originally, framing was performed manually, first locally and then remotely. The pan, tilt, and zoom (PTZ) of the camera was manually controlled to provide a desired picture. Generally, the camera was set to show all participants present in the meeting room and not moved as people entered or left or different people spoke. An improvement over the manual system were systems that determined the talker and then automatically directed the camera to that talker. This would usually involve moving the camera, which was disorienting to viewers at the far end. In some cases, the last image before movement started was just displayed until movement was completed. In a further improvement, two cameras were used, one to frame all the participants or the whole room and one for talker focus. The transmitted image would change from the talker view to the room, or all participants view, when the talker changed so that a live view was always available, but camera motion was not shown.
While these improvements provided a better experience than manual framing, they were still limited to all participants or a single talker. In practice, there are many more situations than these two, such as multiple talkers, and those situations were not handled smoothly. When those situations were occurring, the viewers at the far end had a less pleasant experience, as either some of the talkers were not shown or there were an excessive number of framing changes.
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features. The figures are not necessarily drawn to scale, and in some figures, the proportions or other aspects may be exaggerated to facilitate comprehension of particular aspects.
While implementations are described herein by way of example, those skilled in the art will recognize that the implementations are not limited to the examples or figures described. It should be understood that the figures and detailed description thereto are not intended to limit implementations to the particular form disclosed but, on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean “including, but not limited to”.
A videoconferencing system may include multiple cameras with different fields-of-view (FOVs) of a conference room. These cameras may be operated to frame the conference room, a person in the conference room, multiple persons in the conference room, and so forth, depending on the FOV of each camera and default settings. For example, frames of images from a first camera with a first FOV may be sent to a far site to zoom on a single person that is talking while frames of images from a second camera with a wider FOV may be sent to the far site to frame multiple persons talking or to frame all persons in the conference room. When frames from different cameras are used, the resulting transition from one frame to the next frame may not be handled very smoothly, which results in an unpleasant viewing experience. By analyzing frames to determine a region of interest and determining if a change in camera source is involved, a system may apply one or more transition effects that may help improve the presentation of frames.
Frames, such as still images captured by a video stream, are acquired by cameras in a videoconferencing system. These frames are processed and analyzed by one or more computer vision systems to extract numerical or symbolic information. In some embodiments, these computer vision systems include neural networks that help identify persons in a conference room and their corresponding location. In particular, when frames are inputted to a neural network, the output of the neural network may provide information of features of persons detected in the frames, such as faces, bodies, or heads of such persons. In one example, bounding boxes may be outputted by the neural network to identify the faces of persons in the conference room.
In a similar manner, audio acquired from the microphones in a videoconference system may also be analyzed to determine the source location of sound detected by the microphones. In some embodiments, these sounds may correspond to voices of persons talking during a videoconference. An audio system may process this audio data to determine a horizontal or vertical position of a source location of the sound that is detected.
By combining the feature data and the sound source data, a state may be determined for frames of images acquired by the cameras of a videoconferencing system. For example, the combination of bounding boxes and the horizontal or vertical position of a sound due to a voice talking may help determine that one person of a group of persons in a conference room is talking at a particular time. Overtime, such a state may change if other persons are talking, if the current talker stops talking, and so forth.
The state of the frames associated with a videoconference may help identify the type of framing, and corresponding regions of interests, that are desired. For example, a state of one person talking may correspond to framing the current talker, while a different state involving multiple persons talking may correspond to group framing. A state and desired type of framing may help identify a particular camera of the multiple cameras in a videoconferencing system that are desirable to present frames associated with the current state. Given that each camera has a corresponding FOV, the state and desired framing may be used to determine the camera with the best FOV to implement the desired framing.
In some embodiments, a first camera may be used with a FOV that is centered in the conference room, but does not span the entire conference room. For example, the first camera may have a focused center that includes a table and a few persons typically sitting at the table and participating in a videoconference. In some scenarios, this first camera may serve as the default camera and may be of high-resolution. A second camera, such as a wide-angle camera, may also be used in a conference room with a FOV that captures the entire conference room (and thus the entire FOV of the first camera). This wide-angle camera may be used for scenarios where the focus is all of the persons in the room, several persons talking that are far apart, and so forth. In some scenarios, this wide-angle camera is not considered the default camera to use and thus may be of lower resolution than the first camera. In other embodiments, there may be three or more cameras, each with different FOVs of the conference room. Some of these cameras may be part of the same structure or individually positioned within the conference room.
When switching from one state to another state, frames with corresponding regions of interest to be sent to the far site may be acquired from the same camera. In other scenarios, one frame with a first region of interest to be sent to the far site may be acquired by a first camera, while the next frame with a second region of interest may be acquired from a second camera. The transitions between these frames and regions of interest (which are due to changed states during the videoconference) may be unpleasant, particularly when a change in a camera source is involved.
To prevent these unpleasant experiences and to make framing decisions more automatic, an analysis is performed to determine if the different frames involved, and thus the different regions of interest to be used, require a change in a camera source. If no change in camera source is involved, a first frame having a first region of interest may be replaced with the second frame having a second region of interest using an ease transition, as both frames are acquired by the same camera. By contrast, if a change in camera source is involved, a dissolve transition is used instead. The dissolve transition may phase out the presentation of a first frame having the first region of interest, while phasing in the presentation of a second frame having the second region of interest. In some scenarios, this may involve phasing from the presentation of the first frame having the first region of interest, to a white or blank screen or frame, and then from the white or blank screen or frame to the second frame having the second region of interest. These two transition effects, which are dependent upon the change in states, help improve the experience for the viewer. In addition, the framing decisions become more automatic while providing a pleasant experience for the far site.
During a videoconference, one or more cameras (e.g., camera 118 and camera 120) capture video and provide the captured video to the video module 110 and codec 112 for processing. Some embodiments may include two cameras, while other embodiments include three or more cameras. In one example, one camera (e.g., 118) is a smart camera and one camera (e.g., 120) is not a smart camera. In some examples, two or more cameras (e.g., camera 118 and camera 120) are cascaded such that one camera controls some or all operations of the other camera. In some examples, two or more cameras (e.g., camera 118 and camera 120) are cascaded such that data captured by one camera is used (e.g., by the control module 114) to control some or all operations of the other camera.
In another example, a setup for a conference room may include three or more cameras with different FOVs or partially overlapping FOVs that are positioned within the conference room. These cameras may be included individually or in sets near or around a display screen, on a table, within various corners of the room, and so forth. A location and FOV provided by each camera may be evaluated to determine the camera that could provide the most appropriate face forward view of participants and talkers during a videoconference. In a scenario where a videoconference participant is moving, a first camera with a first FOV may be most appropriate to provide a view of the face of the participant at a first time, while a second camera with a second FOV may be most appropriate to provide a view of the face of the participant at a second time, after the participant has moved.
In yet another example, the endpoint 100 may include only a single camera, such as camera 118, and that camera is a wide angle electronic-pan-tilt-zoom camera. In some examples, when a view subject is zoomed in upon, a sub-portion of the captured image containing the subject is rendered, whereas other portions of the image are not. Additionally, one or more microphones 122 capture audio and provide the audio to the audio module 106 and codec 108 for processing. These microphones 122 can be table or ceiling microphones, or they can be part of a microphone pod or the like. In one or more examples, the microphones 122 are tightly coupled with one or more cameras (e.g., camera 118 and camera 120). The endpoint 100 uses the audio captured with these microphones 122 primarily for the conference audio.
As shown in
After capturing audio and video, the endpoint 100 encodes the audio and video in accordance with an encoding standard, such as MPEG-4, H.263, H.264 and H.265. Then, the network module 116 outputs the encoded audio and video streams to the remote endpoints 102 via the network 104 using an appropriate protocol. Similarly, the network module 116 receives conference audio and video through the network 104 from the remote endpoints 102 and transmits the received audio and video to their respective codecs 108/112 for processing. The endpoint 100 also includes a loudspeaker 130 which outputs conference audio, and a display 132 that outputs conference video.
In at least one example of this disclosure, the endpoint 100 uses the two or more cameras 118, 120 in an automated and coordinated manner to handle video and views of the videoconference environment dynamically. In some examples, the first camera (e.g. 118) is a fixed electronic pan-tilt-zoom (EPTZ) wide-angle camera, and the second camera 120 is a fixed EPTZ telephoto camera. In other examples, the first camera 118 or the second camera 120 may be manual or EPTZ cameras that are not fixed. In even further examples, the field of view of the telephoto camera 120 is approximately centered on the field of view of the wide-angle camera 118. This centered configuration allows higher resolution images for the central area of the conference room, where the endpoint 100 is generally directed and the participants usually sit. Using the wide-angle camera (e.g. 118), the endpoint 100 captures video of the room or at least a wide or zoomed-out view of the room that would typically include all the videoconference participants 121 as well as some of their surroundings.
According to some examples, the endpoint 100 uses the telephoto camera (e.g., 120) to capture video of one or more participants, including one or more current talkers, in a tight or zoomed-in view.
In some examples, the endpoint 100 alternates between tight views of a talker and wide views of a room. In some examples, the endpoint 100 alternates between two different tight views of the same or different talkers. In some examples, the endpoint 100 will capture a first view of a person with one camera and a second view of the same person with another camera and determine which view is better for sharing with a remote endpoint 102.
In at least one example of this disclosure, the endpoint 100 outputs video from only one of the two cameras 118, 120 at any given time. As the videoconference proceeds, the output video from the endpoint 100 can switch from the view of one camera to another. In accordance with some examples, the endpoint 100 outputs a room-view when there is no participant talking and a people-view when one or more participants 121 are talking.
In one or more examples, the endpoint 100 uses an audio-based locator 134 and a video-based locator 136 to determine locations of participants 121 and frame views of the environment and participants 121. A framing module 142 in the control module 114 uses audio and/or video information from these locators 134, 136 to perform framing operations, such as cropping one or more captured views, such that one or more subsections of a captured view are displayed on a display 132 and/or transmitted to a far site or remote endpoint 102.
In some examples, transitions between the two views from the cameras 118, 120 can be faded and blended to avoid sharp cut-a-ways when switching between camera views. Other types of video transitions, such as dissolves, cuts, wipes, slides, pushes, splits, and the like, can be used to switch between camera views. The specific transitions that are used may be varied as well. In some examples, a switch from a first view to a second view for transmission to a remote endpoint 102 will not occur until an active participant 121 has been present in the second view for a minimum amount of time. In at least one example of this disclosure, the minimum amount of time is one second. In at least one example, the minimum amount of time is two seconds. In at least one example, the minimum amount of time is three seconds. In at least one example, the minimum amount of time is four seconds. In at least one example, the minimum amount of time is five seconds. In other examples, other minima (e.g., 0.5-7.0 seconds) are used, depending on such factors as the size of a conference room, the number of participants 121 at an endpoint 100, the cultural niceties of the participants 140 at the remote endpoint 102, and the sizes of one or more displays 132 displaying captured views.
In examples where only a single camera 118 is present and that camera is a wide angle, high definition EPTZ camera, the above discussed framing options of room or participant views and talker views are developed from the single camera. In such examples, transitions are preferably performed as described in U.S. Pat. No. 10,778,941, which is hereby incorporated by reference. All of these decisions on particular views to be provided are made by the framing module 142.
The processor 206 can include digital signal processors (DSPs), central processing units (CPUs), graphics processing units (GPUs), dedicated hardware elements, such as neural network accelerators and hardware codecs, and the like in any desired combination.
The memory 210 can be any conventional memory or combination of types of conventional memory, such as SDRAM and flash memory, and can store modules 216 in the form of software and firmware, or generic programs, for controlling the endpoint 200. In addition to software and firmware portions of the audio and video codecs 108, 112, the audio and video based locators 134, 136, framing module 142 and other modules discussed previously, the modules 216 can include operating systems, a graphical user interface (GUI) that enables users to control the endpoint 200 such as by selecting to mute the endpoint 200, and algorithms for processing audio/video signals and controlling the cameras 202. SDRAM can be used for storing video images of video streams and audio samples of audio streams and can be used for scratchpad operation of the processor 206. In at least one example of this disclosure, one or more of the cameras 202 can be a panoramic camera.
The network interface 208 enables communications between the endpoint 200 and remote endpoints (102). In one or more examples, the general interface 212 provides data transmission with local devices such as a keyboard, mouse, printer, overhead projector, display, external loudspeakers, additional cameras, and microphone pods, etc.
The cameras 202 and the microphones 204 capture video and audio, respectively, in the videoconference environment and produce video and audio streams or signals transmitted through the bus 214 to the processor 206. In at least one example of this disclosure, the processor 206 processes the video and audio using algorithms in the modules 216. For example, the endpoint 200 processes the audio captured by the microphones 204 as well as the video captured by the cameras 202 to determine the location of participants 121 and control and select from the views of the cameras 202. Processed audio and video streams may be sent to remote devices coupled to a network interface 208 and devices coupled to a general interface 212. This is just one example of the configuration of an endpoint 100 and other configurations are well known.
Referring now to
In step 306, the audio streams are used in combination with the video streams to find talkers. Examples of talker localization include U.S. Pat. Nos. 9,030,520; 9,542,603; 9,723,260; 10,091,412; and 10,122,972, which are hereby incorporated by reference. An audio-visual frame may refer to one or more blocks of data that include computer vision information and audio process information generated at, or corresponding to, a specific moment in time. A talker is a person that becomes a target or a subject of interest being tracked using an audio-visual map.
After the talkers are found in step 306, in step 308 the parties are framed as desired. Examples of framing decisions include U.S. Pat. Nos. 9,800,835; 10,187,579; and 10,778,941, which are hereby incorporated by reference. Further improvements in framing decisions are discussed below.
Of note, the conference room 400 may use more than two cameras. In that case, the FOV of each corresponding camera may be exclusive, partially overlapping, or fully overlapping with other FOVs of other cameras. For example, conference room 400 may include three cameras each with a 60 degree FOV that together cover a 150 degree FOV of the conference room. In yet another example, the various cameras may include a FOV for a front of the conference room, a back of the conference room, a side of the conference room, and so forth.
When in empty room state 502, a transition can occur to group framing state 504 or can remain in empty room state 502. In group framing state 504, transitions can occur to empty room state 502, any talker state 506, or remain in group framing state 504. In any talker state 506, transitions can occur to conversation mode state 508, group framing state 504 or unambiguous talker state 510. In conversation mode state 508, transitions can occur to unambiguous talker state 510, group framing state 504 or remain in conversation mode state 508. In unambiguous talker state 510, transitions can occur to conversation mode state 508, group framing state 504 or remain in unambiguous talker state 510.
At step 516, a second frame or set of frames can be acquired from the first camera 118 or 120 or the second camera 120 or 118. In one embodiment, the first frame(s) acquired at 514 and the second frame(s) acquired at 516 may both be acquired by the first camera 118 or 120. In another embodiment, the first frame(s) acquired at 514 and the second frame(s) acquired at 516 may both be acquired by the second camera 120 or 118. In yet another embodiment, the first frame(s) acquired at 514 may be acquired by the first camera 118 or 120, while the second frame(s) acquired at 516 may be acquired by the second camera 120 or 118, or vice versa. A second region of interest or view may also be identified from the second frame(s). At step 518, a second state associated with the second frame(s) is determined. This second state may be any of empty room state 502, a group framing state 504, any talker state 506, a conversation mode state 508, or an unambiguous talker state 510, as shown in
At step 520, a determination of change data associated with the cameras is made. This determination is based on a comparison of the first state determined at 515 and the second state determined at 518. Referring back to
At step 522, a decision is made whether the change data determined at 520 is indicative of a change in camera source. If the change data is not indicative of a change in camera source, the process continues to step 524. At step 524, output data comprising an ease transition is determined. The ease transition indicates the first frame is to be replaced with the second frame, as both frames are acquired by the same camera source. In one embodiment, the ease transition is performed by ending the first frame and beginning the next frame with no overlap or no gap between the frames. After the ease transition is performed, frames from the designated camera (which is selected as part of the change data), continues to send frames until the next state change. Upon detecting another state change, the transition process 512 is repeated to determine if a change in camera source is involved.
If at 522, a decision is made that the change data indicates a change in camera source, the process continues to step 526. At step 526, output data is determined which comprises a dissolve transition. A dissolve transition may comprise fading out the first frame to a white screen or a black screen and then fading in from the white or black screen to the next frame. This type of transition improves the overall viewing experience instead of performing an ease operation, given that the change in camera source would prevent the changes in frames from being smooth or pleasant. After the dissolve transition is performed, frames from the designated camera (which is selected as part of the change data) continues to send frames until the next state change. Upon detecting a state change, the transition process is repeated to determine if a change in camera source is involved.
After performing either of 524 or 526, the process continues to step 528. At 528, the output data determined at 524 or 526 is sent to the far site. The frames may then be presented at the far site using the designated type of transition. Doing so improves the transition effect that is applied between frames, based on the state changes detected and change of camera sources needed, if any. As a result, the user experience is substantially improved. In addition, by performing transition process 512, the transition process 512 becomes more automatic.
In some embodiments, the first camera (which could represent telephoto camera 120) may be designated as a preferred camera to use, given its higher resolution and focus on the center of the room where a conference table 402 and most participants may be located. Thus, if the desired framing or views associated with the two states may be accomplished using only the first camera 120 or only the second camera 118, the framing would be implemented using the first camera 120 and an ease transition, given the preference to use the first camera 120 whenever possible. However, if the desired framing or views associated with the first state and the second state involves a change in a camera source, then the transition would be a dissolve transition using frames from the two camera sources, as needed. By employing the dissolve transition during a camera change, the user experience between the two frames or views is greatly improved.
Of note, the transition process 512 may be repeated continuously during a videoconference. As frames continue to be acquired by the different cameras, state changes may be detected and the transition process 512 may be used to determine if and when changes in camera sources are involved and the type of transition to be applied for such frames or views. In even further embodiments, the transition process 512 may be expanded to include the acquisition of frames from three or more cameras (with differing FOVs covering the same conference room) and the analysis of such frames to determine state changes, change data, and types of transitions to be applied to such frames or views.
Operation of the flowchart of
In
In
In
In
In
In some examples, in cases of two talkers that are widely separated, split screen operation, where each talker is individually framed and the two frames are combined for transmission to the far site, is used. Split screen operation 1020 is illustrated in
In some embodiments, split screen may also be performed for scenarios with three or more talkers. With three or more talkers, three or more frames may be developed in step 1028 for split screen operation. Alternatively, the talkers may be grouped into one or more groups and the groups may be compared to determine if the groups are widely separated at 1026. For example, if there are three talkers and two of the three talkers are close to each other, the two talkers that are close to each other are grouped and that group is compared to the third talker to check if the two are widely separated.
Referring to
In
In
In
In
If participant 406 were to commence talking for greater than three seconds, condition 1002 is met and the frame 1102 would change to include both participants 404 and 406. If participant 408 then proceeds to start talking, the frame 1102 returns to that shown in
In
In
Therefore, the use of the described framing states, framing conditions, and framing transitions provides pleasant transitions between the various situations from empty rooms to nonspeaking participants, to single talkers, through multiple talkers, to a presenter mode, back to multiple talkers in a conversation mode and so on. When the cameras are a wide-angle camera and a telephoto camera which have approximately the same centers for the fields of view, transitions are performed using easing or dissolving, based on the camera of the starting frame and the camera of the ending frame. The conversation states and conditions and framing decisions provide a fully automated framing mechanism to provide pleasant framing of the individuals in the near site for any of the conditions relating to number of talkers, participants, and the like. The far site now sees the appropriate number of individuals that are talking, either in focus if there is only one or multiple, or if multiple individuals are talking. If no individuals are talking or the far site is doing the talking, then the natural position of framing the group is performed. These framing decisions are performed automatically without requiring input by any participant or administrator, to automatically provide a pleasant experience for the far site.
While the description has focused on use of a wide-angle camera and a telephoto camera, any two cameras with differing fields of view can be used and transitions would occur under the same principles.
The various examples described are provided by way of illustration and should not be construed to limit the scope of the disclosure. Various modifications and changes can be made to the principles and examples described herein without departing from the scope of the disclosure and without departing from the claims which follow.
This application claims priority to, U.S. Provisional Patent Application No. 63/202,527, filed on Jun. 15, 2021, entitled “Telephoto and Wide-Angle Automatic Framing”, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63202527 | Jun 2021 | US |