A video conferencing system can include a number of electronic devices that exchange information among a group of participants. Examples of electronic devices include, but are not limited to, mobile phones, tablets, base stations, network access points, customer-premises equipment (CPE), laptops, cameras, and wearable electronics. In some scenarios, electronic devices can include input devices, output devices, or a combination that allows for the exchange of content in the form of audio, video, or a combination of audio and video data. The exchange of content may be facilitated through software applications, generally referred to as conference applications, communication networks, and network services.
Some video conferencing devices can incorporate artificial intelligence (AI)-based systems for video conference functionality. For example, video conferencing system can include functionality for object detection, movement processing, face detection, and the like.
Various features will now be described with reference to the following drawings. Throughout the drawings, reference numbers may be re-used to indicate correspondence between referenced elements. The drawings are provided to illustrate examples described herein and are not intended to limit the scope of the disclosure.
Certain examples described herein provide a system and method for identifying instances of people within a field of view of a conferencing device and providing a solution to remove instances of objects determined not to be a person detected. Generally, aspects relate to the utilization of instance segmentation and onboard AI algorithms to identify characterized attributes of detected objects as well as object identification and count. Such aspects can include but are not limited to using various sensors such as image capture sensors, video capture sensor, distance sensors, etc. for object detection. Other aspects further include using the various sensors to determine characterized attributes of detected objects for a determination process for removing objects based on the characterized attributes detected from displaying during a conference call.
With reference to the general example, when objects with characterized attributes are detected, the video conferencing device determines if an individual object satisfies a specified requirement based on the characterized attributes. The objects that do not satisfy the requirements are removed from the conferencing calling for an optimized video conferencing experience.
Video conferencing devices are widely used electronic devices that are typically deployed within a conferencing room and connect to a host application where the host application is communicating with other participants around the world via the internet. The host application is displayed on a digital monitor and is comprised of a user interface that displays the participant's video stream, audio stream, and a presentation of a participant who has opted to present during the video conference. Video conferencing devices typically display everything in the field of view (FOV) of the camera sensors on the host application. As applied to video conferencing applications, many of the objects detected within the FOV of the image sensors correspond to organic beings (e.g., individuals) that are participating in the video conference and are intending to be captured by the camera sensors. Additionally, additional non-organic or inanimate objects may also be captured and represented in the FOV of the camera sensors.
Many of the detected non-organic objects, such as furniture or fixtures, are also intended to be captured as part of the video conference application. However, some subsets of non-organic objects that are also capable/configured to generate content such as television monitors, computing device screens, etc., may also be captured in the FOV. If such a subset of non-organic objects are also providing content from the video conference application, representation of such a subset of non-organic detected objects will typically generate video looping scenarios and may cause the video conference application to result in dual-person detection. The dual-person detections can come from a digital representation of a person participating in a conferencing event, where the conferencing devices capture an image of the person twice and present it on a digital monitor, creating a looping effect. Unintended video looping can tend to interrupt or disrupt video conferencing application functionality, such as causing errors associated with dual individual recognition. For example, a video conferencing application may have difficulty focusing on an individual speaking if face mapping functionality determines that two individuals (e.g., the organic being and the non-organic representation of the being) are both present in the FOV. This can lead to errors in the video conferencing application or incorrect focus on the non-organic representation. In another example, the execution of a video conferencing application may require additional computing device resources to process video imaging data, including increased processing resources, memory resources, etc. utilized to process the looping video feed (e.g., a video of the video being generated).
The aforementioned inadequacy of the typical video conferencing device, among others, is addressed in some examples by the disclosed technique of a video conferencing device processing additional sensor data, also sometimes referred to in this description as an electronic device.
Examples described herein provide an electronic device capable of processing image sensor data for detecting objects within a FOV of an electronic device. The electronic device can be fitted with a plurality of image-capturing sensors housed locally to the electronic device. The plurality of image-capturing sensors can also be located remotely to the election device and communicate with the electronic device via a wired or wireless communication system.
Based on the processed image-capturing sensor data, the electronic device can then attempt to characterize detected objects, such as those related to organic beings or non-organic representations of organic beings (e.g., non-organic objects that may otherwise be represented/detected as organic beings). The electronic device processes the image sensor data and detects all objects within a FOV of the electronic device. Illustratively, based on the detection of an object, the electronic device assigns a unique identifier to each identified object creating an instance segmentation. The electronic device is also configured for facial recognition of an object with respect to facial tracking and produces a unique facial identifier of an organic being. Facial recognition can utilize lip and eye movement as well as variable focus analysis for determining the presence of a face.
As an object traverses across the FOV of an electronic device the processor maintains the unique identifier with respect to the object. Facial recognition can determine a facial identifier, with respect to a detected face, of an organic being of the same face detected previously. In an instance where the processor determines a unique identifier matches a second unique identifier, a duplicate facial identifier is present, the processor removes the duplicate facial identifier from video conference framing. Object detection is also utilized to differentiate objects within a FOV of the electronic device as being inanimate objects, such as a digital screen that is projecting images of organic beings. The processor is configured to remove sensor data of a captured digital display to prevent looping of the video conferencing, preventing the video conferencing to display any presentation on a screen that is within the FOV of an electronic device.
Responsive to processed image capture data (e.g., detected objects), sometimes also referred to as video sensor data, the electronic device initiates a sequence of tests about the objects detected to determine if the detected object is an organic being, retrieving sensor data from a plurality of distance sensors, image sensors, thermal sensors, and various other sensors. The processor begins processing sensor data and determines if the detected object is a two-dimensional (2D) or three-dimensional (3D) model by utilizing a plurality of Indirect Time of Flight (IToF) sensors, depth camera, and image-capturing sensors. The determination of an object represented in a 2D model is met with the processor removing sensor data of the 2D model from the video conference framing and may be removed from further tests for determining if the object is an organic being. The determination of a 3D model can be determined to include the object in the video conference framing and/or perform further tests in determining whether the object is an organic being. The plurality of ITOF sensors, depth cameras, and image-capturing sensors, can be housed within the electronic device, or located remotely and communicate with the electronic device via a wired or wireless communication system.
Furthermore, the processor begins processing sensor data to determine the liveness of the object detected utilizing the plurality of image-capturing sensors and/or infrared (IR) cameras. The liveness test determines the facial characteristics of an object and can discern if an object has been detected that comprises a face, such as an image on a wall, a digital representation of an organic being, or a mannequin, is not an organic being. The liveness test further determines that the heat signatures from attributed temperatures, captured from infrared camera, further distinguishes between an organic being or a mannequin. The heat signatures of the organic beings, as part of an environmental attribute, are further compared with other organic beings (i.e., dog) within the FOV of the electronic device. The liveness test process the detected heat signatures and identifies the heat signatures that correspond to that of a person.
Furthermore, the processor begins processing sensor data to determine the position of an object within the FOV of an electronic device. The determination of the object can be sensed by distance sensors (e.g., Infrared, LIDAR, Ultrasonic) and determine the relative position of an object if that object is located within a defined region of interest. The position of an object of the plurality of objects can also be determined by bounding box object detection and pixel information from the instance segmentation. Based on the determination of any individual or combination of determination of 2D/3D, liveness, and position, does the processor rule to include or exclude a detected object from a video conference framing. Where the video conference framing creates a bounded box on the host application and displays the determined organic being within an individual bounded box during the video conferencing.
In another example described herein provide an electronic device capable of processing additional sensor data to detect an organic being present within a conferencing room during a live conference call. The processor utilized to determine if an object is an organic being is a network-based processing component or service, where the electronic device communicates via a network protocol, the plurality of sensor data of the plurality of sensor to a cloud processor. Cloud processing can be utilized for any of the aforementioned processing completed by the onboard processor of the electronic device.
By way of illustration, as previously described, in a typical video conferencing system, the video conferencing device 108 creates an infinite loop of the image captured within the FOV 107. The loop framing 101 depicts an undesirable looping effect that is infinitely looping.
At block 201, the conferencing device begins with a detection, for a determination, of objects within the FOV. The image sensors are used for detecting a plurality of objects within the FOV of the conferencing device by way of analyzing data collected from the image sensors. The processor identifies the plurality of objects within a FOV of the conferencing device by an algorithmic process of the data captured and specifies a unique identifier to an object within the plurality of objects.
At block 202, the conferencing device performs a test on the detected objects within the FOV of the conferencing device. Specifically, the conference device can determine whether the characteristics of the detected object are characteristic of a two-dimensional object (2D) or are characteristics of a three-dimensional object (3D), which may be generally referred to as a 2D/3D test. Illustratively, the 2D/3D tests can be performed by the conference device utilizing inputs from a plurality of Indirect Time of Flight (IToF) sensors, depth cameras, and image-capturing sensors. The conferencing device can process the sensor data to determine whether a detected object has characteristics or attributes that correspond to length, width, and height, which may be characterized as a three-dimensional object or indicate a two-dimensional object. Alternatively, if the detected object has characteristics of only length and width (and not height), the detected object may be characterized as two-dimensional (e.g., not likely an organic object).
By way of example, the 2D/3D test on the detected person(s) (block 202) is completed by way of utilizing an ITOF sensor comprised on the electronic device 311. As depicted in the
At block 203, the conferencing device performs a liveness test on the plurality of detected objects. The conferencing device can be configured to perform a liveness test on objects that have been detected to be 3D, or on all the objects that have been detected within the FOV of the conferencing device. Generally described, a liveness test can correspond to one or more tests or techniques in which characteristics of a detected object correspond to real or actual signals generated by the detected object as opposed to reproduced signals, such as via a display mechanism.
By way of example, performing a liveness test on the detected person(s) (block 203) is conducted utilizing the image capturing sensor, as depicted in
At block 204, the conferencing device performs a position determination of the plurality of objects detected and determines if the liveness test results should be excluded from the determination process or included based on the location of the objects within the FOV. Illustratively, the FOV of the conferencing device can be organized into one or more zones, such as according to defined geometric shapes, physical boundaries, non-geometric shapes, etc. In accordance with illustrative embodiments, individual zone or regions may be associated with a likelihood of whether a detected object will correspond to an organic or non-organic object. In some embodiments, individual zones or regions may be associated with a likelihood that an organic object may be present. For example, a zone or region associated with a speaking dais may be more likely to correspond to an organic object. In other embodiments, individual zone or regions may be associated with a likelihood that an organic object may not be present. For example, a zone or region associated with a wall-mounted display screen that is fixed to a wall may be associated with a lower likelihood or low likelihood that an organic object is present. In some embodiments, the zones may be configured utilizing some form of graphical user interface to elicit user inputs. In other embodiments, at least some portion of zones may be predefined or pre-configured, such as during the installation or configuration of the conference device. In still other embodiments, at least some portion of zones may be dynamically determined or dynamically updated based on the implementation of the processor determination routine 200 and consistent determination of non-organic objects or a lack of liveliness (or low liveliness).
By way of example, performing a position determination on the detected person(s) (block 204) is conducted utilizing a plurality of image-capturing sensors and/or a plurality of various sensors, as depicted in the
At block 205, the processor makes a determination if the detected object of the plurality of objects should be included within a video conference call, or if the objects should be excluded based on the characterized attributes. In some embodiments, a single characteristic or attribute (e.g., a characterization of a 2D object) may be controlling. In other embodiments, a combination of characteristics may be processed. For example, the conferencing device may utilize a weighting algorithm and thresholds to determine whether consideration of the characterizations or attributes as a whole is sufficient to characterize the object for exclusion or inclusion (or both). At block 206, the determined data is transmitted to a host conferencing application, where the objects that were detected to be an organic being are included within a framing feature. The determined data may be transmitted in accordance with established application programming interface (API) structures. The determined data can include various identifiers. Additionally, the determined data can include additional metadata, such as determined characteristics, data values, and the like.
The electronic device as mentioned previously may be comprised of a plurality of image-capturing sensors and a plurality of various sensors. The plurality of image-capturing sensors and a plurality of various sensors are utilized for the determination of an object detected being an organic being by day of processing, by the processor, the sensor data captured by the plurality of image-capturing sensors and various sensors comprised of the electronic device. The determination of the detected object being an organic being as mentioned previously is by way of the electronic device performing a 2D/3D test on the detected person(s) (as illustrated at block 202), performing a liveness test on the detected person(s) (as illustrated at block 203), performing a position determination on the detected person(s) (as illustrated at block 204), and the like. The processor can make a determination of an object being an organic being by any individual test of the electronic device performing a 2D/3D test on the detected person(s), performing a liveness test on the detected person(s), performing a position determination on the detected person(s) or by any combination of the electronic device performing a 2D/3D test on the detected person(s), performing a liveness test on the detected person(s), performing a position determination on the detected person(s). Accordingly, the order and sequential nature illustrated in processor determination routine 200 is illustrative and should not be construed as limiting.
By way of the processors determination the person that was detected is an organic being and not a representation of an organic being, the processor includes the detected person(s) (at block 205) in the teleconference framing. An example of the teleconference framing can be observed as depicted in
The framing of an organic being in the FOV 310 of the electronic device 311 is represented by the lack of framing of the various objects 307, 308, and 309. The sensor data utilized to determine the presence of an organic being is also utilized to remove from video framing objects that are not determined to be an organic being.
The digital screen 400 also displays a FOV 412 representation 402 of the electronic device 423. However, as observed in the FOV 412 representation 402 displayed on the digital screen 400, the representation of the digital screen 400 is removed from all displayed matter about the digital screen 400. The object detection of the electronic device 423 determines the object within the FOV 412 is a digital screen 400 and based on the captured sensor data, the processor can be configured to remove any displayed image on the FOV 412 representation 402 of the digital screen 400.
The video conferencing device 512 is also unable to prevent the continuous loop as depicted on the digital screen 500 of the FOV 511 representation 502 of the conferencing device. As the digital screen 500 is within the FOV 511 of the conferencing device the continuous capturing of the digital screen 500 within the FOV 511 creates an infinite loop of the FOV 511 representation 502.
The processor as previously mentioned can determine the object within the FOV 611 is a digital screen and based on the captured sensor data, the processor can be configured to remove any displayed image on the representation of the digital screen 602. As shown, Person 4 606 remains in the representation of the digital screen 602 and is determined to be an organic being by way of framing of Person 4 621. The processor, through the processing of the sensor data, frames the organic beings within the FOV 611 of the electronic device 613 by framing Person 1 618, framing Person 2 619, framing Person 3 620, and framing Person 4 621.
The input interface 709, provides the processor 707 sensor data capture from the distance sensors 701 and 702, the image capturing sensors 703 and 704, and the various sensors 710. The input interface 709 can also accept input from the optional input device, such as a keyboard, mouse, digital pen, etc. In some cases, the electronic device 311 may include more (or fewer) components than those shown in
The output interface 705 can provide connectivity to a display configured to present the video conferencing event, where the electronic device is configured to provide a live video and audio stream to a host application, where the host application is presented on the screen.
The network interface 708 can provide connectivity to one or more networks or computing systems. The processor 707 can thus receive information and instructions from other computing systems or services via a network. The processor 707 can also communicate to and from memory 706 and further provide output information for the display via the output interface 705.
The memory 706 can correspond non-transitory computer-readable medium that includes computer program instructions that the processor 707 executes in order to implement one or more examples of the electronic device system. The memory 706 generally includes RAM, ROM, or other persistent or non-transitory memory. The memory 706 can store an operating system that provides computer program instructions for use by the processor 707. The memory 706 can further include computer program instructions and other information for implementing aspects of the electronic device system. For example, the memory 706 includes host application software for communicating with the computing devices or the conferencing services by the network interface 708.
The input interface 809, provides the processor 807 sensor data capture from the distance sensor 801 and 802, the image capturing sensors 803 and 804, and the plurality of various sensors 811. The input interface 809 can also accept input from the optional input device, such as a keyboard, mouse, digital pen, etc. In some cases, the electronic device 311 may include more (or fewer) components than those shown in
The output interface 805 can provide connectivity to a display configured to present the video conferencing event, where the electronic device is configured to provide a live video and audio stream to a host application, where the host application is presented on the screen.
The network interface 808 can provide connectivity to one or more networks or computing systems. The processor 807 can thus receive information and instructions from other computing systems or services via a network. The processor 807 can also communicate to and from a memory 806 and further provide output information for the display via the output interface 805.
The memory 806 can correspond to non-transitory computer-readable media that includes computer program instructions that the processor 807 executes in order to implement one or more examples of the electronic device system. The memory 806 generally includes RAM, ROM, or other persistent or non-transitory memory. The memory 806 can store an operating system that provides computer program instructions for use by the processor 807. The memory 806 can further include computer program instructions and other information for implementing aspects of the electronic device system. For example, the memory 806 includes host application software for communicating with the electronic device 800 or a conferencing service by the network interface 808.
The electronic device can be configured to communicate with a cloud computer 810 where the cloud computer processes data received from the network interface 808. Where the cloud computer 810 is utilized to offload the previously described processing requirements of the electronic device to the cloud processor. An example of the electronic device processor offloading to a cloud process, the processor transmits sensor data to the cloud processor, where the cloud processor makes a determination to remove the content of a digital screen as shown on the representation of the live FOV on the digital screen 602 of
Conditional language such as, among others, “can,” “could.” “might.” or “may,” unless specifically stated otherwise, are otherwise understood within the context as used in general to convey that certain examples include, while other examples do not include, certain features, elements, and/or blocks. Thus, such conditional language is not generally intended to imply that features, elements, and/or blocks are in any way required for any examples or that any example necessarily includes logic for deciding, with or without user input or prompting, whether these features, elements, and/or blocks are included or are to be performed in any particular example.
Disjunctive languages such as the phrase “at least one of X, Y, or Z.” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain examples require at least one of X, at least one of Y, or at least one of Z to each be present.
Any process descriptions, elements or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include computer-executable instructions for implementing specific logical functions or elements in the process. Alternate implementations are included within the scope of the examples described herein in which elements or functions may be deleted, executed out of order from that shown, or discussed, including substantially concurrently or in reverse order, depending on the functionality involved as would be understood by those skilled in the art.
Unless otherwise explicitly stated, articles such as “a” or “an” should generally be interpreted to include one or more described items. Accordingly, phrases such as “a device configured to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations. For example, “a processor configured to carry out recitations A, B, and C” can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.