Speaker anticipation

Information

  • Patent Grant
  • 11019308
  • Patent Number
    11,019,308
  • Date Filed
    Friday, November 8, 2019
    5 years ago
  • Date Issued
    Tuesday, May 25, 2021
    3 years ago
Abstract
Systems and methods are disclosed for anticipating a video switch to accommodate a new speaker in a video conference comprising a real time video stream captured by a camera local to a first videoconference endpoint is analyzed according to at least one speaker anticipation model. The speaker anticipation model predicts that a new speaker is about to speak. Video of the anticipated new speaker is sent to the conferencing server in response to a request for the video on the anticipated new speaker from the conferencing server. Video of the anticipated new speaker is distributed to at least a second videoconference endpoint.
Description
TECHNICAL FIELD

The present disclosure pertains to a videoconferencing system, and more specifically to anticipating a video switch to accommodate a new speaker.


BACKGROUND

Multi-endpoint videoconferencing allows participants from multiple locations to collaborate in a meeting. For example, participants from multiple geographic locations can join a meeting and communicate with each other to discuss issues, share ideas, etc. These collaborative meetings often include a videoconference system with two-way audio-video transmissions. Thus, virtual meetings using a videoconference system can simulate in-person interactions between people.


However, videoconferencing consumes a large amount of both computational and bandwidth resources. In order to conserve those resources, many videoconferencing systems devote resources depending on how much the videoconference needs to use each video source. For example, the videoconference system will expend more resources for a participant who is actively speaking than a participant who is listening or not directly engaged in the conversation, oftentimes by using low resolution video for the non-speaking participant and high resolution video for the actively speaking participant. When the participant who is speaking changes, the videoconferencing server will switch from the first speaker to the current speaker's video source, and/or will increase the prominence of the new speaker in the videoconference display.


However, current methods of speaker detection and video switching are slow and depend on detecting a participant who is already speaking. For example, attention delay due to the time for processing the active speakers, confusion in audio sources (e.g., mistakenly identifying a closing door or voices from another room as a speaking participant), and/or not picking up on other cues (e.g., the speaker pauses to draw on a whiteboard) are common problems. Thus, there is a need to improve the accuracy and speed of in-room speaker detection and switching.





BRIEF DESCRIPTION OF THE DRAWINGS

The above-recited and other advantages and features of the present technology will become apparent by reference to specific implementations illustrated in the appended drawings. A person of ordinary skill in the art will understand that these drawings only show some examples of the present technology and would not limit the scope of the present technology to these examples. Furthermore, the skilled artisan will appreciate the principles of the present technology as described and explained with additional specificity and detail through the use of the accompanying drawings in which:



FIG. 1 shows an example block diagram illustrating an example environment for a videoconference system providing speaker anticipation capabilities, in accordance with various embodiments of the subject technology;



FIG. 2 is a flowchart illustrating an exemplary method for anticipating a video switch to accommodate a new speaker in a videoconference;



FIG. 3 is a flowchart illustrating an exemplary method for accommodating a new speaker in a videoconference;



FIG. 4 is an illustration of a videoconference endpoint, conferencing service, and remote videoconference endpoint(s) used together in a multi-endpoint videoconference meeting interaction, in accordance with various embodiments;



FIG. 5A shows an example training model in accordance with various embodiments;



FIG. 5B shows an example conference room for determining a semantics model in accordance with various embodiments.



FIG. 6 shows an example diagram of a universal speaker anticipation model;



FIG. 7 shows an example diagram of a personalized speaker anticipation model; and



FIG. 8 shows an example of a system for implementing certain aspects of the present technology.





OVERVIEW

Various examples of the present technology are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the present technology.


In some embodiments, the disclosed technology addresses the need in the art for improving the accuracy and speed of in-room speaker detection and switching. Making selections of newly active speakers as early as possible is advantageous for a number of reasons, since computational and bandwidth resources are conserved when each video source contributes to the videoconference proportional to its use, such that less prominent sources are sent in a form suitable for use at small scale, and only the most prominent sources are sent at high bandwidth.


Thus, in some embodiments, once it is determined that one of multiple contributing endpoints is about to begin speaking, the bandwidth can be uprated on that contribution link to an operating point suitable for prominent display, so that the transition in operating point is not visible and does not delay the switch to prominence. Additionally and/or alternatively, the determination of a new speaker affects the content that is sent in the videoconference stream. For example, the cameras may begin to find, focus, and/or center on the predicted new speaker. The cameras may also frame one or multiple new speakers based on the prediction. For instance, if two conference participants sitting next to each other are preparing to speak at similar times (e.g., take frequent turns speaking or simultaneously speak), cameras in the conference room can be controlled to focus on both of the speakers to frame them at the same frame.


In some embodiments, the present technology is a videoconference system for anticipating a video switch to accommodate a new speaker in a videoconference. Anticipation is based on a model, which can have multiple inputs. Since there may be no single indicator to predict the next speaker, a multimodal architecture is more likely to produce stronger predictions that reduce the delay in selecting new speakers—both in the room, and among discrete contributing participants in separate rooms. Video can be used in conjunction with audio and other in-room metrics (including audio/visual data collected by an application on a participant's mobile phone) to anticipate their speech. This anticipation optimizes both the switching between speakers within the room and the transmitted bandwidths of participants contributing to a videoconference meeting.


The videoconference system includes multiple endpoints across multiple geographic locations, with a videoconference server configured to host a multi-endpoint meeting amongst the multiple endpoints. The videoconference includes at least one videoconference endpoint and at least one endpoint remote from the videoconference endpoint, although a meeting can be any combination of local and remote videoconference endpoints.


The videoconference system predicts a need to switch video by anticipating a new speaker through a predictive model that uses behavioral analytics, which is done by analyzing a real time video stream captured by a camera located at a videoconference endpoint. The real time video stream is analyzed according to one or more speaker anticipation models that predict whether a participant is preparing to speak.


A videoconference server receives, from the videoconference endpoint participating in the meeting or videoconference, a prediction of a new speaker. The server receives new speaker predictions from all or a portion of the endpoints participating in the video conference, both local and remote, such that it receives one or more predictions of new speakers at each of those endpoints.


Based on the received prediction, the videoconference server determines an allocation of media bandwidth that will be distributed to the participating endpoints, including both local and remote endpoint(s). Default allocations may be low resolution/low bandwidth video or audio unless a participant is speaking or is determined to be likely to speak soon, in which case the bandwidth allocation may be increased. If the bandwidth allocation is increased based on a prediction that a participant is preparing to speak at the videoconference endpoint, the videoconference server will request upgraded video of that participant from the videoconference endpoint according to the allocation that been determined or allotted.


In embodiments, the allocation of media bandwidth is determined based on the score of the prediction. For example, the allocation of bandwidth to a videoconference endpoint can be increased based on the strength of the prediction being high or above a predetermined threshold. The allocation of bandwidth may also and/or alternatively be based on comparing the score of the videoconference endpoint's prediction with other endpoints participating in the videoconference.


Once the videoconference endpoint receives the videoconference server's request for upgraded video of the anticipated speaker, the videoconference endpoint transmits video of the anticipated speaker in accordance with the request. The videoconference system then distributes the video of the anticipated new speaker to at least one other videoconference endpoint participating in the videoconference.


DETAILED DESCRIPTION


FIG. 1 shows an example block diagram illustrating an example environment for a videoconference system providing speaker anticipation capabilities, in accordance with various embodiments of the subject technology. In some embodiments the disclosed technology is deployed in the context of a conferencing service system having content item synchronization capabilities and collaboration features, among others. An example videoconference system configuration 100 is shown in FIG. 1, which depicts conferencing service 110 interacting with videoconference endpoint 112 and remote videoconference endpoint(s) 114.



FIG. 1 shows an embodiment in which conferencing service 110 is in communication with one or more videoconference endpoints (e.g., videoconference endpoint 112 and remote videoconference endpoint 114). Videoconference endpoints are any devices that are in communication with conference service 110, such as mobile phones, laptops, desktops, tablets, conferencing devices installed in a conference room, etc. In some embodiments, videoconference endpoint 112 is specific to a single conference participant, and the videoconference is a web meeting. In other embodiments, videoconference endpoint 112 is part of a video conference room system that includes a number of conference participants.


Real time video is generated at videoconference endpoint 112 in response to the initiation of a videoconference meeting, so that the participants at each videoconference endpoint can view and/or hear participants at other videoconference endpoints. As used herein, “real time” or “near real time” refers to relatively short periods of time. “Real time” does not imply instantaneous, but is often measured in fractions of a second, or seconds. “Near real time” is often measured in seconds to minutes.


Videoconference endpoint 112 comprises camera 116 that captures real time video at videoconference endpoint 112, such as real time video of at least one conference participant participating in the videoconference meeting. The system uses camera 116 to capture real time video to monitor conferencing participants in the videoconference, which can then be provided as input into behavioral analytics which form a predictive model for who is likely to speak next. Camera's 116 media channels can contribute a number of factors, including gaze change, head movement, a participant inhaling (indicator of speaking), hand raise or other hand gesture, sitting up straight, etc. Conferencing service 110 receives the real time video stream from videoconference endpoint 112 from all or a portion of that endpoint's participants, which is then distributed to remote videoconference endpoint 114.


Video distribution service 118 determines how the real time video stream is distributed to remote videoconference endpoint 114. Video distribution service 118 can determine, for example, the bandwidth that is devoted to downloading from videoconference endpoint 112, and/or the quality of video from videoconference endpoint 112. Accordingly, if a participant is preparing to speak at video conference endpoint 112, video distribution service 118 will send a request for upgraded or high resolution video from video conference endpoint 112. In some embodiments, the upgraded/high resolution video of videoconference endpoint 112 comprises only a single conference participant. In other embodiments, the upgraded/high resolution video will be a single conference participant among multiple conference participants within a videoconference room system. Video distribution service 118 can request video at low resolutions for participants who are not speaking and are not likely to speak, or the system can default to low resolution video unless it is determined that a participant is speaking or is likely to speak.


Endpoint interface 120 distributes the real time video stream from videoconference endpoint 112 to remote endpoint 114 based on the determinations from video distribution service 118. Each endpoint, such as videoconference endpoint 112, interfaces with conferencing service 110 through their respective conferencing service interface 122. Conferencing service interface 122 receives conferencing service 110 requests and transmits real time video stream for distribution to remote endpoint 114 in accordance with those requests (e.g., transmitting upgraded/higher resolution video or downgraded/low resolution/default video).


New speakers are anticipated based at least in part on analyzing real time video stream from videoconference endpoint 112. In reference to FIG. 2, which shows a flowchart illustrating an exemplary method for anticipating a video switch in order to accommodate a new speaker, videoconference endpoint 112 analyzes the real time video stream captured by camera 116 according to at least one speaker anticipation model (step 210).


Speaker anticipation service 124 determines and/or applies at least one speaker anticipation model to real time video stream and/or data derived from real time video stream. For example, speaker anticipation model may be one or a combination of machine learned models that predict, based on video images and/or data, that a participant is about to speak (step 212a). Examples of neural networks are Convolution Neural Network (CNN) and Long Short-Term memory (LSTM).


As a complimentary embodiment, the speaker anticipation model can comprise one or more semantics-based (Semantics) models (step 212b), the details of which will be discussed more fully herein. After a new speaker has been predicted, and as the system switches to the new speaker, videoconference endpoint 112 can continue to monitor the conference room using a diverse array of sensory inputs. The sensory inputs can be sensory information collected from a number of sensors within the conference room, such as one or more cameras, microphones, motion detectors, ultrasonic devices that pair to mobile devices (in order to receive sensor data collected on the mobile devices), and/or any other sensors capable of collecting sensory information relevant to building a cognitive representation of the videoconference endpoint's environment for in room activities. Accordingly, the semantics models may be determined by receiving the sensory information from diverse sensory inputs (step 224) and then providing that sensory information to a cognitive architecture (e.g., an example cognitive architecture can be Bayesian paradigms or similar).


Based on the speaker anticipation model applied, speaker anticipation service 124 determines a prediction that a participant is about to speak (step 214). In some embodiments, the prediction can be a binary prediction. For example, the prediction may only have one of two values (e.g., 1=participant is likely to speak; 0=participant will not speak).


In other embodiments, the prediction may be more detailed. For example, the speaker anticipation model can determine if the prediction is statistically significant (step 216), and may only transmit the prediction to conferencing service 110 if the prediction is sufficiently significant. For example, speaker anticipation service 124 may have a cutoff threshold, wherein the prediction fails to be transmitted if its prediction score is less than a predetermined percentage or value. Additionally and/or alternatively, if the prediction is not statistically significant, speaker anticipation service 124 may end there and/or continue analyzing the real time video stream at videoconference endpoint 112 until there is a significant prediction (or the videoconference ends). In other embodiments, speaker anticipation service 124 may transmit the prediction to conferencing service 110 regardless of the prediction's score.


In reference to FIG. 3, which shows a flowchart illustrating an exemplary method for accommodating a new speaker in a videoconference, conferencing service 110 receives the prediction at videoconference endpoint 112 (step 310). Conferencing service 110 determines a prediction score (step 320), either through receiving the prediction score from videoconference endpoint 112, by comparison to predictions from remote videoconference endpoint 114, or both.


For example, conferencing service 110 can determine whether the prediction score exceeds a cutoff threshold—e.g., the prediction must be at least 75% likely that the participant will begin speaking, prediction error must be below 1%, noise in the prediction must be below 0.25%, the confidence interval must be above 1.96, etc.


Additionally and/or alternatively, conferencing service 110 can determine prediction scores based on comparisons to predictions from at least one remote videoconference endpoint 114 by ranking the predictions against each other. The highest ranked prediction, for example, can be designated as the new anticipated speaker. As an example, for videoconference displays that enable more than one speaker to be displayed prominently, the top three predictions can be designated as new speakers. In some embodiments, the top ranked predictions may also need to also meet a certain cutoff threshold (e.g., only the top 3 predictions above 75% likelihood will be distributed, or are even available to be ranked).


If the prediction score fails to meet the predetermined threshold and/or ranking, then the video is distributed among all videoconference endpoints at its default bandwidth or is downgraded to a lower resolution (e.g., videoconference endpoint 112 transmits a real time video stream at its default quality or at a downgraded/lower resolution video) (step 322). However, if the prediction score meets or exceeds the predetermined threshold and/or ranking, conferencing service 110 then modifies or determines a new allocation of the media bandwidth for the videoconference endpoints (step 324). The allocation of the media bandwidth for videoconference endpoint 112, in particular, is increased based on the prediction and/or prediction score associated with videoconference endpoint 112.


Conferencing service 110 requests the upgraded video of the anticipated speaker from videoconference endpoint 112 according to the allocation determined (step 326). Referring back to FIGS. 1 and 2, videoconference endpoint 112 receives the request from conferencing service 110 (step 218), such as at conferencing service interface 122. Based on the request or accompanying instructions of the request, conferencing service interface 122 communicates with camera 116 in order gather a real time video stream or transmit the real time video stream to conferencing service 110 in accordance with the allocation. As a result, videoconference endpoint 112 sends the upgraded video of the anticipated speaker to conferencing service 110 (step 220). The real time, upgraded video is distributed to remote endpoint 114 in accordance with the determined allocation (steps 222, 328).


TRAINING THE MODEL


FIG. 4 shows a detailed illustration of videoconference endpoint 112, conferencing service 110, and remote videoconference endpoint(s) 114 that are used together in a multi-endpoint videoconference meeting interaction, in accordance with various embodiments that train one or more speaker anticipation models. In embodiments represented by FIG. 4, videoconference system configuration 100 is enabled to determine a predictive, speaker anticipation model from a guided learning dataset from historical video feeds.


In some embodiments, videoconference endpoint 112 is specific to a single conference participant in a web meeting. Accordingly, the speaker anticipation model is trained by a guided learning dataset from historical video feeds including a single conference participant. In other embodiments, however, videoconference endpoint 112 is a video conference room system that includes multiple conference participants. The guided learning dataset is then historical video feeds that comprise all or a portion of the entire video conference room.


The guided learning dataset is derived from a series of annotated video frames from historical video feeds. No assumptions are made in advance with regard to features predictive of a participant preparing to speak; any incoming information is potentially important. Accordingly, video frames are labeled as speaking frames when the video is accompanied by audio from the same endpoint, and are labeled as non-speaking or pre-speaking (e.g., is preparing to speak) for video frames that are not accompanied by audio. The frames and audio are collected from camera 116 at videoconference endpoint 112, which generates audio, visual, and/or multi modal data used to generate video feeds and the training models that are based off those video feeds. FIG. 5, for example, shows just such an example training model in accordance with various embodiments.


Referring to the embodiments described by FIG. 5, incoming audio 512 (e.g., audio signals transmitted to videoconference endpoint 112) and outgoing audio 514 (e.g., audio detected at videoconference endpoint 112 and transmitted to conferencing service 110) are measured and collected in relation to time 510. Thus, incoming audio 512 and outgoing audio 514 during a period of time is matched to video generated during the same time period.


The guided learning dataset is derived from a series of labeled video frames from the historical video feeds comprising video frames. Each frame of input video 518 that corresponds to a certain time is a data point that is manually annotated. For example, label 516 of input video 518 frames can refer to speaking, pre-speaking, or non-speaking frames. For example, “Speak” or “Speaking” is a label created for speaking frames that occur when the video is accompanied by audio from the same endpoint—e.g., during time periods where outgoing audio 514 generates an audio signal concurrently, which signifies that the participant is speaking during that time period. However, for video occurring at predetermined amounts of time preceding the audio signals (say, for example, 2-3 minutes before the “Speak” frames), the video frames can be labeled “Intend to Speak” or “Pre-Speak.”. The “Pre-Speak” label signifies the frames in which the participant is preparing to speak, but has not uttered audio recognizable as speaking yet—such as, for example, video frames that are not accompanied by audio from the same endpoint but precede the frames labeled as speaking frames. So examples can be detecting the participant clearing their throat, changing their pattern of eye movement or focus, posture changes, etc. A “Listen” or “Non-Speaking” label is selected for all other, non-speaking frames (or, alternatively, all frames default to the “Listen” or “Non-Speaking” label unless designated otherwise).


Referring to FIG. 6, which shows an example diagram of training and developing a universal speaker anticipation model, historical video feeds are labeled (610) on conferencing service 110. The labeled historical feeds can be stored in historical data feeds store 410, which provides a guided data set for training to speaker anticipation modeling service 420.


The speaker anticipation model is derived by training a deep learning CNN architecture on the guided learning dataset, which analyzes static frames to identify visual speaker anticipation cues. Since convolutional networks pass many filters over a single image (the filter corresponding to a feature potentially predictive of a participant preparing to speak), each time a match is found between the filter and portions of the image, the match is mapped onto a feature space particular to that visual element.


Speaker anticipation modeling service 420, for example, provides the labeled historical video feeds to CNN Service 422 (612). CNN Service 422 then applies CNN Model 520 to each of the labeled frames, which extracts and transforms features that help distinguish the participant's facial expression in each frame. Once CNN Service 422 has applied CNN Model 520 to some number of frames needed to determine a model (e.g., such that the model does not suffer from small number statistics), conferencing service 110 distributes (614) the determined CNN model to videoconference endpoint 112 by providing the trained CNN model to CNN real time service 424.


In some embodiments, the speaker anticipation model is further developed or derived by providing the output from the CNN architecture as input to an Long Short-Term Memory network (LSTM). An LSTM network is well-suited to learn from experience to classify, process and predict time series when there are time lags of unknown size and bound between important events Thus, it takes as its input not just the current input example, but also what they perceived one or more steps back in time, since there is information in the sequence itself.


Thus, the LSTM network uses the features extracted from CNN Model 520 to analyze sequences of frames, and sequences of the visual speaker anticipation cues. The multiple view frames enable the speaker anticipation model to account for dynamics and/or changes between frames that signify a participant is preparing to speak. Thus, LSTM Model 522 accounts for temporal dynamics in videos (i.e., the order of frames or facial expressions), which is effective in detecting subtle facial changes right before the participant utters the first word. Speaker anticipation modeling service 420, for example, provides the output from CNN Service 422 to LSTM Service 426 (616). LSTM Service 426 applies LSTM Model 522 to a series of the labeled frames, either before or after CNN Model 520 has been applied. Once LSTM Service 426 has applied LSTM Model 522 to a sufficient and representative number of frames needed to determine a reasonably accurate LSTM model, conferencing service 110 distributes (618) the determined LSTM model to videoconference endpoint 112 (e.g., provides the CNN model to LSTM real time service 428).


In some embodiments, the speaker anticipation model is determined based on a combination of CNN, LSTM, and the semantic representation model. In other embodiments, only the output of LSTM is the speaker anticipation model, or, alternatively, only the output of CNN determines the speaker anticipation model. Regardless of the combination used, the derived speaker anticipation model is provided to videoconference endpoint 112 as a trained model.


Additionally, the speaker anticipation model is further complemented by a semantic representation model. The semantic representation model enables the ability to focus on specific sensory data in the presence of distractions and background noise while still staying alert to relevant and/or important information that unexpectedly appears in the background. This ability implies the simultaneous operation of a selective filter and a deliberate steering mechanism that, together, performs efficient allocation of cognitive resources. For example, the cameras can change directionality, focus, and/or who is centered based on the deliberate steering mechanism. A camera attention mechanism for a principled Artificial General Intelligence (AGI) architecture can be built on top of Cisco Spatial Predictive Analytics DL pipelines for Deep Fusion.


Turning to FIG. 5B, an example conference room for determining a semantics model in accordance with various embodiments is shown. Multiple conference participants 530 can be located within an immersive collaboration room (e.g., conference room 526) that is covered by a set of cameras 528. Cameras 528 can be multiple in number, and can be located in various positions throughout conference room 526 (e.g., on a conference table, mounted to a wall or ceiling, etc.), such that the environment of conference room 526 and/or conference participants 530 can be covered sufficiently. Conference room 526 can comprise conference assistant device 532 that controls the videoconference session, including one or more displays 534 that show remote endpoints participating in the conference.


As the system switches to one or more anticipated new speakers, the system can continue to monitor conference room 526. A cognitive architecture can be developed from diverse sensory inputs, such as inputs from one or more cameras 528, sensors included within conference assistant device 532, directional microphones 536, motion devices 538, time of flight detectors 540, ultrasonic devices 542 for pairing with conference participants' 530 mobile devices (e.g., microphone, accelerometer, and/or gyroscope data collected by the mobile device itself that could be useful in detecting aspects of conference room 526 environment), and/or any other device 544 capable of collecting sensory information relevant to building a cognitive representation of the videoconference endpoint's environment and/or in room activities.


The video conferencing system remains reactive to unexpected events while keeping the focus on the main speaker. This requires both deliberate top-down attention (i.e. information relevant to current goals receive processing) and reactive bottom-up attention (i.e. information relevant to other events that might be important for current goals receive processing). For example, at least one camera 528 is always focused on the entire room and detects movements and/or gestures of the room occupants. At least one other camera 528 can detect and track movements of at least one conference participant 530 that is of significance, and can furthermore be used to predict future actions of conference participants 530 based on previous actions. No assumptions can be made in advance with regards to the environment and system tasks; any incoming information is potentially important.


While primitive physical features and signal characteristics may give rough clues to the importance of information, this information is insufficient for focusing attention decisions. However, a cognitive AGI architecture adaptively trained on footage from several prior meeting room recordings and enactments (e.g., scripted meetings meant for training purposes) will provide sufficient information.


Once the system has built a sufficiently rich cognitive representation of its own environment for several such in room activities and built into the collaboration room systems, it continues to acquire video (together with any other input sensory information available) that is being used for video conferencing, and then uses that video to train on and detect more such actions and behavior for that particular room (e.g., conference room 526) based on a cognitive architecture. An example of a cognitive architecture can be self-learning Bayesian paradigms, although similar paradigms can be used. In this way, the system deployed in the immersive room becomes more accurate over time in ignoring trivial gestures, tightly focusing on actions that are significant, and evaluating the importance of incoming information. This defers processing decisions to actual execution time, at which time resource availability is fully known and information competes for processing based on attention-steered priority evaluation.


Accordingly, the semantic representation model analyzes multiple signals, including any combination of video data, audio data, and conference participant in-room location data, from historical videoconferences taking place in a specific meeting room.


In some embodiments, the semantic representation model complements and/or takes as it input the speaker anticipation model. The semantic representation model, in some embodiments, can be developed by providing annotated frames (raw input video 518 frames, frames from CNN model 520, frames from LSTM model 522, or frames from a combination of CNN and LSTM models) to semantic modeling service 430. Once semantic modeling service 430 has applied the semantic representation model to the number of frames needed to determine a model with sufficient confidence, conferencing service 110 distributes the determined semantic representation model to videoconference endpoint 112 (e.g., provides the trained semantic representation model to semantic real time modeling service 432.


UNIVERSAL SPEAKER ANTICIPATION MODEL

After providing the trained speaker anticipation model to videoconference endpoint 112, the speaker anticipation model can now be used in its real time, predictive context.


In FIGS. 4 and 6, speaker anticipation service 124 receives real time video frames 434 generated by camera 116. This collects real time data, such as audio, visual, and/or multi-modal data (620). The real time video frames are then analyzed according to the derived/trained speaker anticipation model in order to predict whether the participant in the frames is about to speak.


Real time frames 434 are provided to speaker anticipation service 124, which analyzes real time frames 434 according to the provided speaker anticipation model. Frames are predictively labeled based on the speaker anticipation model (e.g., labeled by real time labeling service 436) and is transmitted to conferencing service 110. In some embodiments, the predictive labels are passed through conferencing service interface 122 to endpoint interface 120, which communicates with video distribution service 118. Allocation service 438 then uses the predictive labels to determine an allocation of media bandwidth distributed to videoconference endpoint 112 and remote videoconference endpoint 114, where the allocation of media bandwidth of the videoconference endpoint 112 is increased based on the strength of the predictive labels.


For example, in the embodiments shown in FIG. 6, a universal model is applied in its predictive capacity by applying a combination of CNN model 520 and LSTM model 522 to real time data. Real time frames 434, for example, are provided to CNN real time service 424 (622), which then applies CNN Model 520 to each real time frame. CNN Model 520 extracts and transforms features in real time that are predictive of the participant preparing to speak in the videoconference, such as features related to the participant's real time facial expression in each frame.


Once CNN Model 520 has been applied to the real time frames, the analyzed real time frames are provided (624) to LSTM real time service 428. LSTM Model 522 continues to be a machine learning algorithm that analyzes sequences of frames and sequences of the visual speaker anticipation cues, but in real time. LSTM real time service 428, for example, applies LSTM Model 522 to a series of the real time frames, which accounts for temporal dynamics in the real time videos (i.e., the order of frames or facial expressions), effective in detecting subtle facial changes right before the participant utters the first word.


Once LSTM real time service 428 has applied LSTM Model 522 to the real time frame sequences, a predicted label 524 is generated for each real time frame at real time labeling service 436 (626). Predicted label 524 can label real time frames as speaking, pre-speaking, or non-speaking frames based on the models applied. Predicted labels 524 for each frame and/or one or more predictions that a participant is about to speak is then sent to conferencing service 110 (628).


Additionally and/or alternatively, where videoconference endpoint 112 is a video conference room system that includes a plurality of conference participants, predicted labels 524 are furthermore generated based fully or in part on the semantic representation model being applied to real time video frames or sequences. The semantic representation model can make predictions based on semantic information, such as the context of the conference room. For example, cultural cues or cues personalized to the specific conference room can be illustrative of who the next speaker will be. Some examples may include detecting where an authority figure sits in the conference room (e.g., a CEO or practice group leader), and if other participants within the room turn to face them. Other examples may include distinguishing participants who sit outside the main conference room table, such that participants at the table are more likely to speak than those seated quietly in the room's periphery. Participants who are standing, begin to stand, walk towards a portion of the room reserved for speaking (e.g., podium, in front of a screen, etc.), and/or otherwise make more movements may also increase the probability they will speak. Semantic model should also distinguish between fidgeting or leaving participants, which does not necessarily signify that a participant wishes to speak.


In some embodiments, the speaker anticipation model is updated continuously. The models can be updated during the training process, as the model is applied to the real time video stream, and/or both. For example, speaker anticipation service 124 can send take the predicted label 524 and/or features from analyzed real time video frames as input to update service 440. Update service 440 updates the speaker anticipation model based on the analyzed labeled real time video frames, which can be used to update one or more of CNN service 422, LSTM service 426, or semantic modeling service 430. Speaker anticipation service 124 (e.g., CNN real time service 424, LSTM real time service 428, and/or semantic real time modeling service 432) can receive the updated models upon completion, or can periodically request for updated models. After the model is sufficiently trained, it can predict whether the participant will speak a few seconds before the first utterance. If there is audio from the participant, the predicted label 524 can provide corroboration to the speaker anticipation model. If there is no audio from the participant, update service 440 can modify and/or correct the speaker anticipation model.


Referring to FIG. 4, once a prediction is received by conferencing service 110, allocation service 438 determines an allocation of media bandwidth to be distributed to videoconference endpoint 112 and remote videoconference endpoint 114, where the allocation of media bandwidth of videoconference endpoint 112 is increased based on the strength of the received real time prediction.


In some embodiments, allocation service 438 ignores signals having a prediction below a threshold. In other embodiments, allocation service 438 ranks the received predictions according to one or more of the CNN model, LSTM model, or the semantic representation model. Rankings below a threshold may be ignored. For rankings above a threshold, endpoint interface 120 sends a request to videoconference endpoint 112 for high resolution video and/or upgraded video of the participant according to the allocation determined at allocation service 438. This prepares conferencing service 110 for switching video to the new participant who is expected to speak.


In response to receiving the request for video on the anticipated new speaker, videoconference endpoint 112 sends video of the anticipated new speaker to conferencing service 110. Video of the anticipated new speaker is distributed to remote videoconference endpoint 114 in accordance with allocation service 438. High resolution video, for example, can be sent to all other videoconference endpoints after detecting that the high resolution video from videoconference endpoint 112 includes a speaker.


PERSONALIZED SPEAKER ANTICIPATION MODEL


FIG. 7 shows an example diagram of a personalized speaker anticipation model, in accordance with server embodiments. Like the embodiments described above, annotated historical data is collected, provided to one or more models for training, and distributed to videoconference endpoint 112 (710, 712, 714). Videoconference endpoint 112 collects real time data, provides it to the speaker anticipation model for analysis, and then generates predicted label 524 (720, 722, 724).


In order to personalize a speaker anticipation model, videoconference endpoint 112 performs online machine learning to extract detectable features that are uniquely correlated with the particular participant preparing to speak (726). The features videoconference endpoint 112 initially looks for may be based, at least partly, on features extracted by CNN model 520, LSTM model 522, and/or semantic representation model, but the features may be updated, modified, and/or added as videoconference endpoint 112 continues to train with the participant (728). Once videoconference endpoint 112 develops a model unique to the participant, it can apply that personalized speaker anticipation model to real time data in order to generate personalized predictions (730). Those personalized predictions are then sent to conferencing service 110 for allocation determinations.


COMPUTING MACHINE ARCHITECTURE


FIG. 8 shows an example of computing system 800 in which the components of the system are in communication with each other using connection 805. Connection 805 can be a physical connection via a bus, or a direct connection into processor 810, such as in a chipset architecture. Connection 805 can also be a virtual connection, networked connection, or logical connection.


In some embodiments computing system 800 is a distributed system in which the functions described in this disclosure can be distributed within a datacenter, multiple datacenters, a peer network, etc. In some embodiments, one or more of the described system components represents many such components each performing some or all of the function for which the component is described. In some embodiments, the components can be physical or virtual devices.


Example system 800 includes at least one processing unit (CPU or processor) 810 and connection 805 that couples various system components including system memory 815, such as read only memory (ROM) and random access memory (RAM) to processor 810. Computing system 800 can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 810.


Processor 810 can include any general purpose processor and a hardware service or software service, such as services 832, 834, and 836 stored in storage device 830, configured to control processor 810 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. Processor 810 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric.


To enable user interaction, computing system 800 includes an input device 845, which can represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech, etc. Computing system 800 can also include output device 835, which can be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems can enable a user to provide multiple types of input/output to communicate with computing system 800. Computing system 800 can include communications interface 840, which can generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.


Storage device 830 can be a non-volatile memory device and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs), read only memory (ROM), and/or some combination of these devices.


The storage device 830 can include software services, servers, services, etc., that when the code that defines such software is executed by the processor 810, it causes the system to perform a function. In some embodiments, a hardware service that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 810, connection 805, output device 835, etc., to carry out the function.


For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks including functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software.


Any of the steps, operations, functions, or processes described herein may be performed or implemented by a combination of hardware and software services or services, alone or in combination with other devices. In some embodiments, a service can be software that resides in memory of a client device and/or one or more servers of a content management system and perform one or more functions when a processor executes the software associated with the service. In some embodiments, a service is a program, or a collection of programs that carry out a specific function. In some embodiments, a service can be considered a server. The memory can be a non-transitory computer-readable medium.


In some embodiments the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.


Methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer readable media. Such instructions can comprise, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, solid state memory devices, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.


Devices implementing methods according to these disclosures can comprise hardware, firmware and/or software, and can take any of a variety of form factors. Typical examples of such form factors include servers, laptops, smart phones, small form factor personal computers, personal digital assistants, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.


The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are means for providing the functions described in these disclosures.


Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims.

Claims
  • 1. A video conference method comprising: receiving sensory input from at least one sensor local to a participant of a videoconference;analyzing the sensory input based on historical data to yield a prediction score that the participant is about to participate in the videoconference, the prediction score defined based on a range, the range including a point indicating the participant is equally likely to participate and not participate;distributing video of the participant, via the videoconference, when the prediction score exceeds a threshold defined within the range, the threshold being a predetermined percentage or a predetermined value, the threshold displaced from the point of the range.
  • 2. The method of claim 1, wherein the historical data is a historical video feed including the participant.
  • 3. The method of claim 1, wherein, the sensory input is a real time video stream captured by a camera local to a first videoconference endpoint, andthe analyzing of the sensory input is performed by at least one speaker anticipation model to yield the prediction score.
  • 4. The method of claim 3, wherein, the at least one speaker anticipation model is derived by: training a first machine learning algorithm on a guided learning dataset, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation cues, andproviding an output of the first machine learning algorithm as an input to a second machine learning algorithm,the second machine learning algorithm analyzes sequences of the static frames and sequences of the visual speaker anticipation cues, andthe output of the second machine learning algorithm is the at least one speaker anticipation model.
  • 5. The method of claim 3, further comprising: analyzing video data consisting of the real time video stream captured by the camera local to the first videoconference endpoint to generate labeled real time video frames, the labeled real time video frames labeled as real time video speaking frames or real time video pre-speaking frames; andapplying a machine learning algorithm to analyze the labeled real time video frames and update the at least one speaker anticipation model.
  • 6. The method of claim 3, wherein, the videoconference includes a plurality of conference participants, andthe at least one speaker anticipation model is a semantic representation model created by a machine learning algorithm that has analyzed a plurality of signals including at least one of video data, audio data, and conference participant in-room location data, from historical videoconferences taking place in a specific meeting room.
  • 7. The method of claim 6, further comprising: ranking the plurality of signals captured during a real time video conference according to the semantic representation model; andignoring signals having a ranking below the threshold.
  • 8. A videoconference system comprising: a processor;a storage device in communication with the processor; anda videoconference endpoint configured, via the storage device and the processor, to: receive sensory input from at least one sensor local to a participant of a videoconference;analyze the sensory input based on historical data to yield a prediction score that the participant is about to participate in the videoconference, the prediction score defined based on a range, the range including a point indicating the participant is equally likely to participate and not participate; anddistribute video of the participant, via the videoconference, when the prediction score exceeds a threshold defined within the range, the threshold being a predetermined percentage or a predetermined value, the threshold displaced from the point of the range.
  • 9. The videoconference system of claim 8, wherein the historical data is a historical video feed including the participant.
  • 10. The videoconference system of claim 8, wherein, the sensory input is a real time video stream captured by a camera local to a first videoconference endpoint, andthe sensory input is analyzed using at least one speaker anticipation model to yield the prediction score.
  • 11. The videoconference system of claim 10, wherein the at least one speaker anticipation model is derived by: training a first machine learning algorithm on a guided learning dataset; andproviding an output of the first machine learning algorithm as an input to a second machine learning algorithm, the output of the second machine learning algorithm being the at least one speaker anticipation model.
  • 12. The videoconference system of claim 11, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation cues.
  • 13. The videoconference system of claim 12, wherein the second machine learning algorithm analyzes sequences of the static frames and sequences of visual speaker anticipation cues.
  • 14. The videoconference system of claim 8, further comprising: an attention mechanism configured to: collect video, audio, and gesture recognition to perform a predictive analysis;allow camera feed focus in real time; andlearn which actions are significant to focus camera attention on.
  • 15. At least one non-transitory computer readable medium comprising instructions that, when executed, cause at least one computing device to perform operations comprising: receiving sensory input from at least one sensor local to a participant of a videoconference;analyzing the sensory input based on historical data to yield a prediction score that the participant is about to participate in the videoconference, the prediction score defined based on a range, the range including a point indicating the participant is equally likely to participate and not participate; anddistributing video of the participant, via the videoconference, when the prediction score exceeds a threshold defined within the range, the threshold being a predetermined percentage or a predetermined value, the threshold displaced from the point of the range.
  • 16. The at least one non-transitory computer readable medium of claim 15, wherein the historical data is a historical video feed including the participant.
  • 17. The at least one non-transitory computer readable medium of claim 16, wherein the operations include analyzing static frames of the historical video feed to identify visual speaker anticipation cues.
  • 18. The at least one non-transitory computer readable medium of claim 17, wherein the operations include analyzing sequences of the static frames and sequences of the visual speaker anticipation cues.
  • 19. The at least one non-transitory computer readable medium of claim 15, wherein, the sensory input is a real time video stream captured by a camera local to a first videoconference endpoint, andthe sensory input is analyzed using at least one speaker anticipation model to yield the prediction score.
  • 20. The at least one non-transitory computer readable medium of claim 19, wherein the at least one speaker anticipation model is a semantic representation model created by a machine learning algorithm after analysis of a plurality of signals including video data, audio data, and conference participant in-room location data from historical videoconferences taking place in a meeting room.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 15/646,470 filed on Jul. 11, 2017, which claims the benefit of U.S. Provisional Patent Application Ser. No. 62/524,014 filed on Jun. 23, 2017, the contents of which are incorporated by reference in their entireties.

US Referenced Citations (568)
Number Name Date Kind
4460807 Kerr et al. Jul 1984 A
4890257 Anthias et al. Dec 1989 A
4977605 Fardeau et al. Dec 1990 A
5185848 Aritsuka et al. Feb 1993 A
5293430 Shiau et al. Mar 1994 A
5694563 Belfiore et al. Dec 1997 A
5699082 Marks et al. Dec 1997 A
5745711 Kitahara et al. Apr 1998 A
5767897 Howell Jun 1998 A
5825858 Shaffer et al. Oct 1998 A
5874962 de Judicibus et al. Feb 1999 A
5889671 Autermann et al. Mar 1999 A
5917537 Lightfoot et al. Jun 1999 A
5970064 Clark et al. Oct 1999 A
5995096 Kitahara et al. Nov 1999 A
6023606 Monte et al. Feb 2000 A
6040817 Sumikawa Mar 2000 A
6075531 DeStefano Jun 2000 A
6085166 Beckhardt et al. Jul 2000 A
6115393 Engel et al. Sep 2000 A
6191807 Hamada et al. Feb 2001 B1
6298351 Castelli et al. Oct 2001 B1
6300951 Filetto et al. Oct 2001 B1
6392674 Hiraki et al. May 2002 B1
6424370 Courtney Jul 2002 B1
6456624 Eccles et al. Sep 2002 B1
6463473 Gubbi Oct 2002 B1
6553363 Hoffman Apr 2003 B1
6554433 Holler Apr 2003 B1
6573913 Butler et al. Jun 2003 B1
6597684 Gulati et al. Jul 2003 B1
6646997 Baxley et al. Nov 2003 B1
6665396 Khouri et al. Dec 2003 B1
6697325 Cain Feb 2004 B1
6700979 Washiya Mar 2004 B1
6711419 Mori Mar 2004 B1
6721899 Narvaez-Guarnieri et al. Apr 2004 B1
6754321 Innes et al. Jun 2004 B1
6754335 Shaffer et al. Jun 2004 B1
RE38609 Chen et al. Oct 2004 E
6816464 Scott et al. Nov 2004 B1
6865264 Berstis Mar 2005 B2
6894714 Gutta et al. May 2005 B2
6938208 Reichardt Aug 2005 B2
6954617 daCosta Oct 2005 B2
6978499 Gallant et al. Dec 2005 B2
7046134 Hansen May 2006 B2
7046794 Piket et al. May 2006 B2
7058164 Chan et al. Jun 2006 B1
7058710 McCall et al. Jun 2006 B2
7062532 Sweat et al. Jun 2006 B1
7085367 Lang Aug 2006 B1
7124164 Chemtob Oct 2006 B1
7149499 Oran et al. Dec 2006 B1
7180993 Hamilton Feb 2007 B2
7185077 O'Toole et al. Feb 2007 B1
7209475 Shaffer et al. Apr 2007 B1
7340151 Taylor et al. Mar 2008 B2
7366310 Stinson et al. Apr 2008 B2
7418664 Ben-Shachar et al. Aug 2008 B2
7441198 Dempski et al. Oct 2008 B2
7453864 Kennedy et al. Nov 2008 B2
7478339 Pettiross et al. Jan 2009 B2
7496650 Previdi et al. Feb 2009 B1
7500200 Kelso et al. Mar 2009 B2
7530022 Ben-Shachar et al. May 2009 B2
7552177 Kessen et al. Jun 2009 B2
7577711 McArdle Aug 2009 B2
7584258 Maresh Sep 2009 B2
7587028 Broerman et al. Sep 2009 B1
7606714 Williams et al. Oct 2009 B2
7606862 Swearingen et al. Oct 2009 B2
7620902 Manion et al. Nov 2009 B2
7634533 Rudolph et al. Dec 2009 B2
7774407 Daly et al. Aug 2010 B2
7792277 Shaffer et al. Sep 2010 B2
7826372 Mabe et al. Nov 2010 B1
7826400 Sackauchi Nov 2010 B2
7830814 Allen et al. Nov 2010 B1
7840013 Dedieu et al. Nov 2010 B2
7840980 Gutta Nov 2010 B2
7848340 Sackauchi et al. Dec 2010 B2
7881450 Gentle et al. Feb 2011 B1
7920160 Tamaru et al. Apr 2011 B2
7956869 Gilra Jun 2011 B1
7986372 Ma et al. Jul 2011 B2
7995464 Croak et al. Aug 2011 B1
8059557 Sigg et al. Nov 2011 B1
8063929 Kurtz et al. Nov 2011 B2
8081205 Baird et al. Dec 2011 B2
8140973 Sandquist et al. Mar 2012 B2
8154583 Kurtz et al. Apr 2012 B2
8169463 Enstad et al. May 2012 B2
8219624 Haynes et al. Jul 2012 B2
8274893 Bansal et al. Sep 2012 B2
8290998 Stienhans et al. Oct 2012 B2
8301883 Sundaram et al. Oct 2012 B2
8340268 Knaz Dec 2012 B2
8358327 Duddy Jan 2013 B2
8385355 Figueira et al. Feb 2013 B1
8423615 Hayes Apr 2013 B1
8428234 Knaz Apr 2013 B2
8433061 Cutler Apr 2013 B2
8434019 Nelson Apr 2013 B2
8456507 Mallappa et al. Jun 2013 B1
8462103 Moscovitch et al. Jun 2013 B1
8478848 Minert Jul 2013 B2
8489765 Vasseur et al. Jul 2013 B2
8520370 Waitzman, III et al. Aug 2013 B2
8620840 Newnham et al. Dec 2013 B2
8625749 Jain et al. Jan 2014 B2
8630208 Kjeldaas Jan 2014 B1
8630291 Shaffer et al. Jan 2014 B2
8634314 Banka et al. Jan 2014 B2
8638354 Leow et al. Jan 2014 B2
8638778 Lee et al. Jan 2014 B2
8645464 Zimmet et al. Feb 2014 B2
8675847 Shaffer et al. Mar 2014 B2
8694587 Chaturvedi et al. Apr 2014 B2
8694593 Wren et al. Apr 2014 B1
8706539 Mohler Apr 2014 B1
8707194 Jenkins et al. Apr 2014 B1
8732149 Lida Hiromi et al. May 2014 B2
8738080 Nhiayi et al. May 2014 B2
8751572 Behforooz et al. Jun 2014 B1
8767716 Trabelsi et al. Jul 2014 B2
8774164 Klein et al. Jul 2014 B2
8831505 Seshadri Sep 2014 B1
8842161 Feng et al. Sep 2014 B2
8850203 Sundaram et al. Sep 2014 B2
8856584 Matsubara Oct 2014 B2
8860774 Sheeley et al. Oct 2014 B1
8862522 Jaiswal et al. Oct 2014 B1
8874644 Allen et al. Oct 2014 B2
8880477 Barker et al. Nov 2014 B2
8890924 Wu Nov 2014 B2
8892646 Chaturvedi et al. Nov 2014 B2
8914444 Hladik, Jr. Dec 2014 B2
8914472 Lee et al. Dec 2014 B1
8924862 Luo Dec 2014 B1
8930840 Riskó et al. Jan 2015 B1
8942085 Pani et al. Jan 2015 B1
8947493 Lian et al. Feb 2015 B2
8948054 Kreeger et al. Feb 2015 B2
8972494 Chen et al. Mar 2015 B2
8982707 Moreno et al. Mar 2015 B2
9003445 Rowe Apr 2015 B1
9031839 Thorsen et al. May 2015 B2
9032028 Davidson et al. May 2015 B2
9075572 Ayoub et al. Jul 2015 B2
9118612 Fish et al. Aug 2015 B2
9131017 Kurupacheril et al. Sep 2015 B2
9137119 Yang et al. Sep 2015 B2
9137376 Basart et al. Sep 2015 B1
9143729 Anand et al. Sep 2015 B2
9165281 Orsolini et al. Oct 2015 B2
9197553 Jain et al. Nov 2015 B2
9197701 Petrov et al. Nov 2015 B1
9197848 Felkai et al. Nov 2015 B2
9201527 Kripalani et al. Dec 2015 B2
9203875 Huang et al. Dec 2015 B2
9204099 Brown Dec 2015 B2
9219735 Hoard et al. Dec 2015 B2
9246855 Maehiro Jan 2016 B2
9258033 Showering Feb 2016 B2
9268398 Tipirneni Feb 2016 B2
9298342 Zhang et al. Mar 2016 B2
9323417 Sun et al. Apr 2016 B2
9324022 Williams, Jr. et al. Apr 2016 B2
9335892 Ubillos May 2016 B2
9338065 Vasseur et al. May 2016 B2
9338084 Badoni May 2016 B2
9349119 Desai et al. May 2016 B2
9367224 Ananthakrishnan et al. Jun 2016 B2
9369673 Ma et al. Jun 2016 B2
9374294 Pani Jun 2016 B1
9407621 Vakil et al. Aug 2016 B2
9419811 Dong et al. Aug 2016 B2
9432512 You Aug 2016 B2
9449303 Underhill et al. Sep 2016 B2
9495664 Cole et al. Nov 2016 B2
9513861 Lin et al. Dec 2016 B2
9516022 Borzycki et al. Dec 2016 B2
9525711 Ackerman et al. Dec 2016 B2
9544224 Chu et al. Jan 2017 B2
9553799 Tarricone et al. Jan 2017 B2
9558451 Nilsson et al. Jan 2017 B2
9563480 Messerli et al. Feb 2017 B2
9596099 Yang et al. Mar 2017 B2
9609030 Sun et al. Mar 2017 B2
9609514 Mistry et al. Mar 2017 B2
9614756 Joshi Apr 2017 B2
9640194 Nemala et al. May 2017 B1
9654385 Chu et al. May 2017 B2
9667799 Olivier et al. May 2017 B2
9674625 Armstrong-Mutner Jun 2017 B2
9762709 Snyder et al. Sep 2017 B1
20010030661 Reichardt Oct 2001 A1
20020018051 Singh Feb 2002 A1
20020061001 Garcia-Luna-Aceves et al. May 2002 A1
20020076003 Zellner et al. Jun 2002 A1
20020078153 Chung et al. Jun 2002 A1
20020101505 Gutta Aug 2002 A1
20020105904 Hauser et al. Aug 2002 A1
20020116154 Nowak et al. Aug 2002 A1
20020140736 Chen Oct 2002 A1
20020159386 Grosdidier et al. Oct 2002 A1
20020188522 McCall et al. Dec 2002 A1
20030005149 Haas et al. Jan 2003 A1
20030028647 Grosu Feb 2003 A1
20030046421 Horvitz et al. Mar 2003 A1
20030061340 Sun et al. Mar 2003 A1
20030067912 Mead et al. Apr 2003 A1
20030068087 Wu et al. Apr 2003 A1
20030091052 Pate et al. May 2003 A1
20030117992 Kim et al. Jun 2003 A1
20030133417 Badt, Jr. Jul 2003 A1
20030154250 Miyashita Aug 2003 A1
20030174826 Hesse Sep 2003 A1
20030187800 Moore et al. Oct 2003 A1
20030197739 Bauer Oct 2003 A1
20030225549 Shay et al. Dec 2003 A1
20030227423 Arai et al. Dec 2003 A1
20040039909 Cheng Feb 2004 A1
20040054885 Bartram et al. Mar 2004 A1
20040098456 Krzyzanowski et al. May 2004 A1
20040153563 Shay et al. Aug 2004 A1
20040210637 Loveland Oct 2004 A1
20040218525 Elie-Dit-Cosaque et al. Nov 2004 A1
20040253991 Azuma Dec 2004 A1
20040267938 Shoroff et al. Dec 2004 A1
20050014490 Desai et al. Jan 2005 A1
20050031136 Du et al. Feb 2005 A1
20050048916 Suh Mar 2005 A1
20050055405 Kaminsky et al. Mar 2005 A1
20050055412 Kaminsky et al. Mar 2005 A1
20050085243 Boyer et al. Apr 2005 A1
20050099492 Orr May 2005 A1
20050108328 Berkeland et al. May 2005 A1
20050111487 Matta et al. May 2005 A1
20050114532 Chess et al. May 2005 A1
20050131774 Huxter Jun 2005 A1
20050143979 Lee et al. Jun 2005 A1
20050175208 Shaw et al. Aug 2005 A1
20050215229 Cheng Sep 2005 A1
20050226511 Short Oct 2005 A1
20050231588 Yang et al. Oct 2005 A1
20050286711 Lee et al. Dec 2005 A1
20060004911 Becker et al. Jan 2006 A1
20060020697 Kelso et al. Jan 2006 A1
20060026255 Malamud et al. Feb 2006 A1
20060072471 Shiozawa Apr 2006 A1
20060083193 Womack et al. Apr 2006 A1
20060083305 Dougherty et al. Apr 2006 A1
20060084471 Walter Apr 2006 A1
20060116146 Herrod et al. Jun 2006 A1
20060133404 Zuniga et al. Jun 2006 A1
20060164552 Cutler Jul 2006 A1
20060224430 Butt Oct 2006 A1
20060250987 White et al. Nov 2006 A1
20060271624 Lyle et al. Nov 2006 A1
20060274647 Wang et al. Dec 2006 A1
20070005752 Chawla et al. Jan 2007 A1
20070021973 Stremler Jan 2007 A1
20070025576 Wen Feb 2007 A1
20070041366 Vugenfirer et al. Feb 2007 A1
20070047707 Mayer et al. Mar 2007 A1
20070058842 Vallone et al. Mar 2007 A1
20070067387 Jain et al. Mar 2007 A1
20070071030 Lee Mar 2007 A1
20070083650 Collomb et al. Apr 2007 A1
20070091831 Croy et al. Apr 2007 A1
20070100986 Bagley et al. May 2007 A1
20070106747 Singh et al. May 2007 A1
20070116225 Zhao et al. May 2007 A1
20070120966 Murai May 2007 A1
20070139626 Saleh et al. Jun 2007 A1
20070149249 Chen et al. Jun 2007 A1
20070150453 Morita Jun 2007 A1
20070168444 Chen et al. Jul 2007 A1
20070192065 Riggs et al. Aug 2007 A1
20070198637 Deboy et al. Aug 2007 A1
20070208590 Dorricott et al. Sep 2007 A1
20070248244 Sato et al. Oct 2007 A1
20070250567 Graham et al. Oct 2007 A1
20080049622 Previdi et al. Feb 2008 A1
20080059986 Kalinowski et al. Mar 2008 A1
20080068447 Mattila et al. Mar 2008 A1
20080071868 Arenburg et al. Mar 2008 A1
20080080532 O'Sullivan et al. Apr 2008 A1
20080089246 Ghanwani et al. Apr 2008 A1
20080107255 Geva et al. May 2008 A1
20080133663 Lentz Jun 2008 A1
20080140817 Agarwal et al. Jun 2008 A1
20080154863 Goldstein Jun 2008 A1
20080159151 Datz et al. Jul 2008 A1
20080181259 Andreev et al. Jul 2008 A1
20080192651 Gibbings Aug 2008 A1
20080209452 Ebert et al. Aug 2008 A1
20080270211 Vander Veen et al. Oct 2008 A1
20080278894 Chen et al. Nov 2008 A1
20080293353 Mody et al. Nov 2008 A1
20090003232 Vaswani et al. Jan 2009 A1
20090010264 Zhang Jan 2009 A1
20090012963 Johnson et al. Jan 2009 A1
20090019374 Logan et al. Jan 2009 A1
20090049151 Pagan Feb 2009 A1
20090064245 Facemire et al. Mar 2009 A1
20090073988 Ghodrat et al. Mar 2009 A1
20090075633 Lee et al. Mar 2009 A1
20090089822 Wada Apr 2009 A1
20090094088 Chen et al. Apr 2009 A1
20090100142 Stern et al. Apr 2009 A1
20090119373 Denner et al. May 2009 A1
20090129316 Ramanathan et al. May 2009 A1
20090132949 Bosarge May 2009 A1
20090147714 Jain et al. Jun 2009 A1
20090147737 Tacconi et al. Jun 2009 A1
20090168653 St. Pierre et al. Jul 2009 A1
20090193327 Roychoudhuri et al. Jul 2009 A1
20090234667 Thayne Sep 2009 A1
20090254619 Kho et al. Oct 2009 A1
20090256901 Mauchly et al. Oct 2009 A1
20090271467 Boers et al. Oct 2009 A1
20090278851 Ach et al. Nov 2009 A1
20090282104 O'Sullivan et al. Nov 2009 A1
20090292999 LaBine et al. Nov 2009 A1
20090296908 Lee et al. Dec 2009 A1
20090303908 Deb et al. Dec 2009 A1
20090306981 Cromack et al. Dec 2009 A1
20090309846 Trachtenberg et al. Dec 2009 A1
20090313334 Seacat et al. Dec 2009 A1
20100005142 Xiao et al. Jan 2010 A1
20100005402 George et al. Jan 2010 A1
20100031192 Kong Feb 2010 A1
20100046504 Hill Feb 2010 A1
20100061538 Coleman et al. Mar 2010 A1
20100070640 Allen, Jr. et al. Mar 2010 A1
20100073454 Lovhaugen et al. Mar 2010 A1
20100077109 Yan et al. Mar 2010 A1
20100094867 Badros et al. Apr 2010 A1
20100095327 Fujinaka et al. Apr 2010 A1
20100121959 Lin et al. May 2010 A1
20100131856 Kalbfleisch et al. May 2010 A1
20100157978 Robbins et al. Jun 2010 A1
20100162170 Johns et al. Jun 2010 A1
20100165863 Nakata Jul 2010 A1
20100183179 Griffin, Jr. et al. Jul 2010 A1
20100211872 Rolston et al. Aug 2010 A1
20100215334 Miyagi Aug 2010 A1
20100220615 Enstrom et al. Sep 2010 A1
20100241691 Savitzky et al. Sep 2010 A1
20100245535 Mauchly Sep 2010 A1
20100250817 Collopy et al. Sep 2010 A1
20100262266 Chang et al. Oct 2010 A1
20100262925 Liu et al. Oct 2010 A1
20100275164 Morikawa Oct 2010 A1
20100302033 Devenyi et al. Dec 2010 A1
20100303227 Gupta Dec 2010 A1
20100316207 Brunson Dec 2010 A1
20100318399 Li et al. Dec 2010 A1
20110072037 Lotzer Mar 2011 A1
20110075830 Dreher et al. Mar 2011 A1
20110082596 Meagher et al. Apr 2011 A1
20110087745 O'Sullivan et al. Apr 2011 A1
20110116389 Tao et al. May 2011 A1
20110117535 Benko et al. May 2011 A1
20110131498 Chao et al. Jun 2011 A1
20110149759 Jollota Jun 2011 A1
20110154427 Wei Jun 2011 A1
20110228696 Agarwal et al. Sep 2011 A1
20110230209 Kilian Sep 2011 A1
20110255570 Fujiwara Oct 2011 A1
20110264928 Hinckley Oct 2011 A1
20110267962 J S A et al. Nov 2011 A1
20110270609 Jones et al. Nov 2011 A1
20110271211 Jones et al. Nov 2011 A1
20110274283 Athanas Nov 2011 A1
20110283226 Basson et al. Nov 2011 A1
20110314139 Song et al. Dec 2011 A1
20120009890 Curcio et al. Jan 2012 A1
20120013704 Sawayanagi et al. Jan 2012 A1
20120013768 Zurek et al. Jan 2012 A1
20120026279 Kato Feb 2012 A1
20120054288 Wiese et al. Mar 2012 A1
20120072364 Ho Mar 2012 A1
20120075999 Ko et al. Mar 2012 A1
20120084714 Sirpal et al. Apr 2012 A1
20120092436 Pahud et al. Apr 2012 A1
20120140970 Kim et al. Jun 2012 A1
20120163177 Vaswani et al. Jun 2012 A1
20120179502 Farooq et al. Jul 2012 A1
20120190386 Anderson Jul 2012 A1
20120192075 Ebtekar et al. Jul 2012 A1
20120213062 Liang et al. Aug 2012 A1
20120213124 Vasseur et al. Aug 2012 A1
20120233020 Eberstadt et al. Sep 2012 A1
20120246229 Carr et al. Sep 2012 A1
20120246596 Ording et al. Sep 2012 A1
20120284635 Sitrick et al. Nov 2012 A1
20120296957 Stinson et al. Nov 2012 A1
20120303476 Krzyzanowski et al. Nov 2012 A1
20120306757 Keist et al. Dec 2012 A1
20120306993 Sellers-Blais Dec 2012 A1
20120307629 Vasseur et al. Dec 2012 A1
20120308202 Murata et al. Dec 2012 A1
20120313971 Murata et al. Dec 2012 A1
20120315011 Messmer et al. Dec 2012 A1
20120321058 Eng et al. Dec 2012 A1
20120323645 Spiegel et al. Dec 2012 A1
20120324512 Cahnbley et al. Dec 2012 A1
20130003542 Catovic et al. Jan 2013 A1
20130010610 Karthikeyan et al. Jan 2013 A1
20130027425 Yuan Jan 2013 A1
20130028073 Tatipamula et al. Jan 2013 A1
20130038675 Malik Feb 2013 A1
20130047093 Reuschel et al. Feb 2013 A1
20130050398 Krans et al. Feb 2013 A1
20130055112 Joseph et al. Feb 2013 A1
20130061054 Niccolai Mar 2013 A1
20130063542 Bhat et al. Mar 2013 A1
20130070755 Trabelsi et al. Mar 2013 A1
20130086633 Schultz Apr 2013 A1
20130090065 Fisunenko et al. Apr 2013 A1
20130091205 Kotler et al. Apr 2013 A1
20130091440 Kotler et al. Apr 2013 A1
20130094647 Mauro et al. Apr 2013 A1
20130113602 Gilbertson et al. May 2013 A1
20130113827 Forutanpour et al. May 2013 A1
20130120522 Lian et al. May 2013 A1
20130124551 Foo May 2013 A1
20130128720 Kim et al. May 2013 A1
20130129252 Lauper et al. May 2013 A1
20130135837 Kemppinen May 2013 A1
20130141371 Hallford et al. Jun 2013 A1
20130148789 Hillier et al. Jun 2013 A1
20130177305 Prakash et al. Jul 2013 A1
20130182063 Jaiswal et al. Jul 2013 A1
20130185672 McCormick et al. Jul 2013 A1
20130198629 Tandon et al. Aug 2013 A1
20130210496 Zakarias et al. Aug 2013 A1
20130211826 Mannby Aug 2013 A1
20130212202 Lee Aug 2013 A1
20130215215 Gage et al. Aug 2013 A1
20130219278 Rosenberg Aug 2013 A1
20130222246 Booms et al. Aug 2013 A1
20130225080 Doss et al. Aug 2013 A1
20130227433 Doray et al. Aug 2013 A1
20130235866 Tian et al. Sep 2013 A1
20130242030 Kato et al. Sep 2013 A1
20130243213 Moquin Sep 2013 A1
20130250754 Vasseur et al. Sep 2013 A1
20130252669 Nhiayi Sep 2013 A1
20130263020 Heiferman et al. Oct 2013 A1
20130275589 Karthikeyan et al. Oct 2013 A1
20130290421 Benson et al. Oct 2013 A1
20130297704 Alberth, Jr. et al. Nov 2013 A1
20130300637 Smits et al. Nov 2013 A1
20130311673 Karthikeyan et al. Nov 2013 A1
20130325970 Roberts et al. Dec 2013 A1
20130329865 Ristock et al. Dec 2013 A1
20130335507 Aarrestad et al. Dec 2013 A1
20140012990 Ko Jan 2014 A1
20140028781 MacDonald Jan 2014 A1
20140040404 Pujare et al. Feb 2014 A1
20140040819 Duffy Feb 2014 A1
20140049595 Feng et al. Feb 2014 A1
20140063174 Junuzovic et al. Mar 2014 A1
20140068452 Joseph et al. Mar 2014 A1
20140068670 Timmermann et al. Mar 2014 A1
20140078182 Utsunomiya Mar 2014 A1
20140108486 Borzycki et al. Apr 2014 A1
20140111597 Anderson et al. Apr 2014 A1
20140126423 Vasseur et al. May 2014 A1
20140133327 Miyauchi May 2014 A1
20140136630 Siegel et al. May 2014 A1
20140157338 Pearce Jun 2014 A1
20140161243 Contreras et al. Jun 2014 A1
20140195557 Oztaskent et al. Jul 2014 A1
20140198175 Shaffer et al. Jul 2014 A1
20140204759 Guo et al. Jul 2014 A1
20140207945 Galloway et al. Jul 2014 A1
20140215077 Soudan et al. Jul 2014 A1
20140219103 Vasseur et al. Aug 2014 A1
20140237371 Klemm et al. Aug 2014 A1
20140253671 Bentley et al. Sep 2014 A1
20140280595 Mani et al. Sep 2014 A1
20140282213 Musa et al. Sep 2014 A1
20140293955 Keerthi Oct 2014 A1
20140296112 O'Driscoll et al. Oct 2014 A1
20140298210 Park et al. Oct 2014 A1
20140317561 Robinson et al. Oct 2014 A1
20140337840 Hyde et al. Nov 2014 A1
20140358264 Long et al. Dec 2014 A1
20140372908 Kashi et al. Dec 2014 A1
20150004571 Ironside et al. Jan 2015 A1
20150009278 Modai et al. Jan 2015 A1
20150023174 Dasgupta et al. Jan 2015 A1
20150029301 Nakatomi et al. Jan 2015 A1
20150052095 Yang et al. Feb 2015 A1
20150067552 Leorin et al. Mar 2015 A1
20150070835 Mclean Mar 2015 A1
20150074189 Cox et al. Mar 2015 A1
20150081885 Thomas et al. Mar 2015 A1
20150082350 Ogasawara et al. Mar 2015 A1
20150085060 Fish et al. Mar 2015 A1
20150088575 Asli et al. Mar 2015 A1
20150089393 Zhang et al. Mar 2015 A1
20150089394 Chen et al. Mar 2015 A1
20150113050 Stahl Apr 2015 A1
20150113369 Chan et al. Apr 2015 A1
20150128068 Kim May 2015 A1
20150142702 Nilsson et al. May 2015 A1
20150172120 Dwarampudi et al. Jun 2015 A1
20150178626 Pielot et al. Jun 2015 A1
20150215365 Shaffer et al. Jul 2015 A1
20150254760 Pepper Sep 2015 A1
20150288774 Larabie-Belanger Oct 2015 A1
20150301691 Qin Oct 2015 A1
20150304120 Xiao et al. Oct 2015 A1
20150304366 Bader-Natal et al. Oct 2015 A1
20150319113 Gunderson et al. Nov 2015 A1
20150324689 Wierzynski et al. Nov 2015 A1
20150350126 Xue Dec 2015 A1
20150358248 Saha et al. Dec 2015 A1
20150365725 Belyaev et al. Dec 2015 A1
20150373063 Vashishtha et al. Dec 2015 A1
20150373414 Kinoshita Dec 2015 A1
20160037304 Dunkin et al. Feb 2016 A1
20160043986 Ronkainen Feb 2016 A1
20160044159 Wolff et al. Feb 2016 A1
20160044380 Barrett Feb 2016 A1
20160050079 Martin De Nicolas et al. Feb 2016 A1
20160050160 Li et al. Feb 2016 A1
20160050175 Chaudhry et al. Feb 2016 A1
20160070758 Thomson et al. Mar 2016 A1
20160071056 Ellison et al. Mar 2016 A1
20160072862 Bader-Natal et al. Mar 2016 A1
20160094593 Priya Mar 2016 A1
20160105345 Kim et al. Apr 2016 A1
20160110056 Hong et al. Apr 2016 A1
20160165056 Bargetzi et al. Jun 2016 A1
20160173537 Kumar et al. Jun 2016 A1
20160182580 Nayak Jun 2016 A1
20160203404 Cherkasova et al. Jul 2016 A1
20160266609 McCracken Sep 2016 A1
20160269411 Malachi Sep 2016 A1
20160277461 Sun et al. Sep 2016 A1
20160283909 Adiga Sep 2016 A1
20160307165 Grodum et al. Oct 2016 A1
20160309037 Rosenberg et al. Oct 2016 A1
20160315802 Wei et al. Oct 2016 A1
20160321347 Zhou et al. Nov 2016 A1
20160335111 Bruun et al. Nov 2016 A1
20170006162 Bargetzi et al. Jan 2017 A1
20170006446 Harris et al. Jan 2017 A1
20170070706 Ursin et al. Mar 2017 A1
20170093874 Uthe Mar 2017 A1
20170104961 Pan et al. Apr 2017 A1
20170150399 Kedalagudde et al. May 2017 A1
20170171260 Jerrard-Dunne et al. Jun 2017 A1
20170228251 Yang et al. Aug 2017 A1
20170289033 Singh et al. Oct 2017 A1
20170324850 Snyder et al. Nov 2017 A1
20170347308 Chou et al. Nov 2017 A1
20170353361 Chopra et al. Dec 2017 A1
20180013656 Chen Jan 2018 A1
20180166066 Dimitriadis et al. Jun 2018 A1
Foreign Referenced Citations (16)
Number Date Country
101055561 Oct 2007 CN
101076060 Nov 2007 CN
102572370 Jul 2012 CN
102655583 Sep 2012 CN
101729528 Nov 2012 CN
102938834 Feb 2013 CN
103141086 Jun 2013 CN
204331453 May 2015 CN
3843033 Sep 1991 DE
959585 Nov 1999 EP
2773131 Sep 2014 EP
2341686 Aug 2016 EP
WO 9855903 Dec 1998 WO
WO 2008139269 Nov 2008 WO
WO 2012167262 Dec 2012 WO
WO 2014118736 Aug 2014 WO
Non-Patent Literature Citations (42)
Entry
Hradis et al., “Voice Activity Detection from Gaze in Video Mediated Communication,” ACM, Mar. 28-30, 2012 http://medusa.fit.vutbr.cz/TA2/TA2., pp. 1-4.
Author Unknown, “A Primer on the H.323 Series Standard,” Version 2.0, available at http://www.packetizer.com/voip/h323/papers/primer/, retrieved on Dec. 20, 2006, 17 pages.
Author Unknown, ““I can see the future” 10 predictions concerning cell-phones,” Surveillance Camera Players, http://www.notbored.org/cell-phones.html, Jun. 21, 2003, 2 pages.
Author Unknown, “Active screen follows mouse and dual monitors,” KDE Community Forums, Apr. 13, 2010, 3 pages.
Author Unknown, “Implementing Media Gateway Control Protocols” A RADVision White Paper, Jan. 27, 2002, 16 pages.
Author Unknown, “Manage Meeting Rooms in Real Time,” Jan. 23, 2017, door-tablet.com, 7 pages.
AverUSA, “Interactive Video Conferencing K-12 applications,” “Interactive Video Conferencing K-12 applications” copyright 2012. http://www.averusa.com/education/downloads/hvc brochure goved.pdf (last accessed Oct. 11, 2013).
Choi, Jae Young, et al; “Towards an Automatic Face Indexing System for Actor-based Video Services in an IPTV Environment,” IEEE Transactions on 56, No. 1 (2010): 147-155.
Cisco Systems, Inc, “VXLAN Network with MP-BGP EVPN Control Plane Design Guide,” Mar. 21, 2016, 44 pages.
Cisco Systems, Inc. “Cisco webex: WebEx Meeting Center User Guide for Hosts, Presenters, and Participants” © 1997-2013, pp. 1-394 plus table of contents.
Cisco Systems, Inc., “Cisco Webex Meetings for iPad and iPhone Release Notes,” Version 5.0, Oct. 2013, 5 pages.
Cisco Systems, Inc., “Cisco WebEx Meetings Server System Requirements release 1.5.” 30 pages, Aug. 14, 2013.
Cisco Systems, Inc., “Cisco Unified Personal Communicator 8.5”, 2011, 9 pages.
Cisco White Paper, “Web Conferencing: Unleash the Power of Secure, Real-Time Collaboration,” pp. 1-8, 2014.
Clarke, Brant, “Polycom Announces RealPresenceGroup Series,” “Polycom Announces RealPresence Group Series” dated Oct. 8, 2012 available at http://www.323.tv/news/polycom-realpresence-group-series (last accessed Oct. 11, 2013).
Clauser, Grant, et al., “Is the Google Home the voice-controlled speaker for you?,” The Wire Cutter, Nov. 22, 2016, pp. 1-15.
Cole, Camille, et al., “Videoconferencing for K-12 Classrooms,” Second Edition (excerpt), http://www.iste.org/docs/excerpts/VIDCO2-excerpt.pdf (last accessed Oct. 11, 2013), 2009.
Eichen, Elliot, et al., “Smartphone Docking Stations and Strongly Converged VoIP Clients for Fixed-Mobile convergence,” IEEE Wireless Communications and Networking Conference: Services, Applications and Business, 2012, pp. 3140-3144.
Epson, “BrightLink Pro Projector,” BrightLink Pro Projector. http://www.epson.com/cgi-bin/Store/jsp/Landing/brightlink-pro-interactive-projectors.do?ref=van brightlink-pro—dated 2013 (last accessed Oct. 11, 2013).
Grothaus, Michael, “How Interactive Product Placements Could Save Television,” Jul. 25, 2013, 4 pages.
Hannigan, Nancy Kruse, et al., The IBM Lotus Samteime VB Family Extending the IBM Unified Communications and Collaboration Strategy (2007), available at http://www.ibm.com/developerworks/lotus/library/sametime8-new/, 10 pages.
Hirschmann, Kenny, “TWIDDLA: Smarter Than the Average Whiteboard,” Apr. 17, 2014, 2 pages.
InFocus, “Mondopad,” Mondopad. http://www.infocus.com/sites/default/files/InFocus-Mondopad-INF5520a-INF7021-Datasheet-EN.pdf (last accessed Oct. 11, 2013), 2013.
MacCormick, John, “Video Chat with Multiple Cameras,” CSCW '13, Proceedings of the 2013 conference on Computer supported cooperative work companion, pp. 195-198, ACM, New York, NY, USA, 2013.
Microsoft, “Positioning Objects on Multiple Display Monitors,” Aug. 12, 2012, 2 pages.
Mullins, Robert, “Polycom Adds Tablet Videoconferencing,” Mullins, R. “Polycom Adds Tablet Videoconferencing” available at http://www.informationweek.com/telecom/unified-communications/polycom-adds-tablet-videoconferencing/231900680 dated Oct. 12, 2011 (last accessed Oct. 11, 2013).
Nu-Star Technologies, “Interactive Whiteboard Conferencing,” Interactive Whiteboard Conferencing. http://www.nu-star.com/interactive-conf.php dated 2013 (last accessed Oct. 11, 2013).
Nyamgondalu, Nagendra, “Lotus Notes Calendar and Scheduling Explained!” IBM, Oct. 18, 2004, 10 pages.
Polycom, “Polycom RealPresence Mobile: Mobile Telepresence & Video Conferencing,” http://www.polycom.com/products-services/hd-telepresence-video-conferencing/realpresence-mobile.html#stab1 (last accessed Oct. 11, 2013), 2013.
Polycom, “Polycom Turns Video Display Screens into Virtual Whiteboards with First Integrated Whiteboard Solution for Video Collaboration,” Polycom Turns Video Display Screens into Virtual Whiteboards with First Integrated Whiteboard Solution for Video Collaboration—http://www.polycom.com/company/news/press-releases/2011/20111027 2.html—dated Oct. 27, 2011.
Polycom, “Polycom UC Board, Transforming ordinary surfaces into virtual whiteboards” 2012, Polycom, Inc., San Jose, CA, http://www.uatg.com/pdf/polycom/polycom-uc-board- datasheet.pdf, (last accessed Oct. 11, 2013).
Schreiber, Danny, “The Missing Guide for Google Hangout Video Calls,” Jun. 5, 2014, 6 pages.
Shervington, Martin, “Complete Guide to Google Hangouts for Businesses and Individuals,” Mar. 20, 2014, 15 pages.
Shi, Saiqi, et al, “Notification That a Mobile Meeting Attendee Is Driving”, May 20, 2013, 13 pages.
Stevenson, Nancy, “Webex Web Meetings for Dummies” 2005, Wiley Publishing Inc., Indianapolis, Indiana, USA, 339 pages.
Stodle. Daniel, et al., “Gesture-Based, Touch-Free Multi-User Gaming on Wall-Sized, High-Resolution Tiled Displays,” 2008, 13 pages.
Thompson, Phil, et al., “Agent Based Ontology Driven Virtual Meeting Assistant,” Future Generation Information Technology, Springer Berlin Heidelberg, 2010, 4 pages.
TNO, “Multi-Touch Interaction Overview,” Dec. 1, 2009, 12 pages.
Toga, James, et al., “Demystifying Multimedia Conferencing Over the Internet Using the H.323 Set of Standards,” Intel Technology Journal Q2, 1998, 11 pages.
Ubuntu, “Force Unity to open new window on the screen where the cursor is?” Sep. 16, 2013, 1 page.
VB Forums, “Pointapi,” Aug. 8, 2001, 3 pages.
Vidyo, “VidyoPanorama,” VidyoPanorama—http://www.vidvo.com/products/vidyopanorama/ dated 2013 (last accessed Oct. 11, 2013).
Related Publications (1)
Number Date Country
20200077049 A1 Mar 2020 US
Provisional Applications (1)
Number Date Country
62524014 Jun 2017 US
Continuations (1)
Number Date Country
Parent 15646470 Jul 2017 US
Child 16678729 US