This disclosure relates generally to attention monitoring of individuals participating in a meeting or videoconference.
Monitoring the attention of a group of listeners, be they students in a classroom or attendees of a meeting or seminar, is always challenging. Adding in a virtual element only makes the problem more difficult.
A method, apparatus, non-transitory processor readable memory, and system are provided for indicating session participant attention by determining the attention level of each participant in a session, providing a display of each participant, and providing an attention level indicator on the display of each participant indicating the attention level of the respective participant. In selected embodiments, the attention level of each participant is determined by determining a gaze direction of each participant. In other embodiments, the gaze direction is determined by using a neural network to develop facial keypoint values for each participant. In other embodiments, the gaze direction is determined by using a neural network that detects a 3-D orientation of a head for each participant. In selected embodiments, the attention level indicator is provided on the display by performing one or more of the following: displaying a video stream of each participant in a frame that is color coded to indicate the attention level of the participant displayed in said frame; by displaying first and second multi-color attention bars with a video stream of the participant, each multi-color attention bar representing a different period of time, where each multi-color attention bar includes a plurality of color sections indicating a plurality of different attention levels for the participant during the session, where each color section has a length indicating a percentage of time at each respective attention level; by displaying a video stream of each participant in a window that is tinted with a color indicating the attention level of the participant displayed in said window; by displaying a video stream of each participant in a window that is blurred with a blurriness amount indicating the attention level of the participant displayed in said window; by displaying a video stream of each participant in a window that is saturated with a saturation amount indicating the attention level of the participant displayed in said window; and by displaying a video stream of each participant in a window that is sized with a relative window size indicating the attention level of the participant displayed in said window.
The present invention may be understood, and its numerous objects, features and advantages obtained, when the following detailed description of a preferred embodiment is considered in conjunction with the following drawings.
A system, apparatus, methodology, and computer program product are described for detecting or measuring the attention level of meeting participants, and then displaying the measured attention level as a value on a display of the meeting participants. In some examples, the participants, whether local or remote or mixed, are presented in a gallery view layout. The frame of each participant is colored, such as red, yellow or green, to indicate the attention level. In some examples the entire window is tinted in colors representing the attention level. In some examples the blurriness of the participant indicates attention level, the blurrier, the more attentive. In some examples, the saturation the participant indicates attention level, the less saturated, the more attentive. In some examples, the window sizes vary based on attention level, the larger the window, the less attentive the participant. In some examples, color bars are added to provide indications of percentages of attention level over differing time periods. All of these displays allow the instructor or presenter to quickly determine the attention level of the participants and take appropriate actions.
In some examples neural networks are used to find the faces of the participants and then develop facial keypoint values. The facial keypoint values are used to determine gaze direction. The gaze direction is then used to develop an attention score. The attention score is then used to determine the settings of the layout, as described above, in use, such as red, yellow or green frames.
In the drawings and the description of the drawings herein, certain terminology is used for convenience only and is not to be taken as limiting the examples of the present disclosure. In the drawings and the description below, like numerals indicate like elements throughout.
Throughout this disclosure, terms are used in a manner consistent with their use by those of skill in the art, for example:
Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain high level understanding from digital images or videos. Computer vision seeks to automate tasks imitative of the human visual system. Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high dimensional data from the real world to produce numerical or symbolic information. Computer vision is concerned with artificial systems that extract information from images. Computer vision includes algorithms which receive a video frame as input and produce data detailing the visual characteristics that a system has been trained to detect.
Machine learning includes neural networks. A convolutional neural network is a class of deep neural network which can be applied analyzing visual imagery. A deep neural network is an artificial neural network with multiple layers between the input and output layers.
Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains. Artificial neural networks exist as code being executed on one or more processors. An artificial neural network is based on a collection of connected units or nodes called artificial neurons, which mimic the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a ‘signal’ to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The signal at a connection is a real number, and the output of each neuron is computed by some non linear function of the sum of its inputs. The connections are called edges. Neurons and edges have weights, the value of which is adjusted as ‘learning’ proceeds and/or as new data is received by a state system. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold.
Examples discussed in the present disclosure are applicable to online learning. But embodiments of the present invention can be applied to any scenario where the participants' attention is one of the important parameters for a meeting, conference, seminar, or workshop. In one example, teachers in an online learning system can work with students much more effectively and derive the complete statistics of student's attention. Those statistics are utilized by teachers to improve the online education experience.
Examples according to this invention use head pose estimation or gaze detection to determine the attention level of video conference participants and use visual effects overlaid on transmitted video or received video to identify the attention level of each participant. The attention level can be indicated as a score, development of which is discussed below, with ranges of scores indicating high, medium and low levels of attention.
In particular, examples according to this invention are applicable to the online education or meeting industry in general for use with participants having a personal camera with or without a codec. The camera is placed in front of the participants when the participants are engaged in the online education/seminar. Examples according to the invention are also applicable to group online education or meeting sessions where a high resolution camera is viewing the participants from the front of the room and has a clear view of each participant.
Face finding neural network operations are preferably performed on the classroom setting. The neural network operations produce bounding boxes 21A-25A for each participant 11-15 as shown in the group view 2 of
The faces inside the bounding boxes 21A-25A are then obtained and placed in a gallery view format 3, where each individual 11-15 has a separate window or frame 31A-35A.
With the gallery view format 3, the teacher or speaker can quickly detect how students or participants are paying attention to the lecture. If most of the rectangles 31A-37A are in red, then the teacher might pause the lecture and ask the students to pay attention through the preferred method of communication. If all of the rectangles 31A-37A are in green, then all the students are paying good attention.
In some examples, the colors used to indicate attention level have a spectrum from red to green, rather than discrete levels. That is, darker red indicates lesser amounts of attention. Other color ranges may be used.
Instead of indicating level of attention, the colors may also be used to indicate how long someone has had a lack of attention, rather than the level of lack of attention. In such an approach, darker red may indicate the participant has been not paying attention for a longer time, rather than a more severe lack of attention.
Colored frames may be used with the examples of
In
The values used to develop the color bars are the participant attention levels determined at periodic intervals and stored to develop the participant attention level over time.
In some examples, other cues are used to change scores to indicate a higher need of intervention by the presenter. For example, recognizing gestures that indicate the need for help, such as raised hands, a shaking head, and confused micro expressions, may indicate that the participant needs attention from the presenter, even if their “attention” score may be high. These gestures can be determined by a neural network programmed to detect a given set of gestures used to indicate a help request. In such cases of needing help, the weights discussed below used to develop an attention score may be adjusted such that the presentation mechanisms emphasize the participant in the same manner as if the participant was not paying attention. In another example, a separate score is used to indicate a level of need of help, allowing the system to call out those asking for assistance to the presenter as potentially needing intervention. This second score can lead to different colors being used on the frame than the attention colors. In one example, blue or purple is used to indicate someone is using gestures to ask for help, even if they are otherwise paying attention. In another example, color intensity or saturation is used to indicate how long that participant has indicating needing special help. In other examples, particularly if separate scores are being developed, two colored frames, an inner frame and an outer frame, can be used, one indicating attention level and the other indicating requested assistance level. In some examples, scores for both intensity of distraction, the inverse of attention, and duration of distraction may be combined using various approaches of weights, etc. to develop the attention level score.
As described above, neural networks develop bounding boxes of the participants. Software operations combined with further neural network operations are used to develop attention scores.
There are the following steps for the student or attendee attention algorithm.
At step 10-1, the method starts.
At step 10-2, the gaze direction is detected by using facial keypoints.
At step 10-3, the attention score is calculated based on the detected gaze direction.
At step 10-4, the image buffer is updated and the attention statistics are dumped.
At step 10-5, the method ends.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
POSE_CENTER: when the participant faces toward the camera
POSE_LEFT: when the participant faces left side to the camera
POSE_RIGHT: when the participant faces right side to the camera
POSE_UP: when the participant faces up side to the camera
POSE_DOWN: when the participant faces down side to the camera
The gaze direction detection is mainly divided into two steps. The first step is to decide if the gaze is center directions or side directions (left/right). To check that, the logic processing step 11-2 computes the min score for left side and right side. The min left score is the min of leftEyeScore and leftEarScore (e.g., leftScore=min(leftEyeScore, leftEarScore)). The min right score is the min of rightEyeScore and rightEarScore (e.g., rightScore=min(rightEyeScore, rightEarScore)). Then, the logic processing step 11-2 computes the ratio Scale value from these two scores (e.g., Scale=(max(leftScore, rightScore)/min(leftScore, rightScore))). If the logic processing step 11-3 determines that ratio Scale value is bigger than SideThreshold (negative outcome from step 11-2), then the gaze direction is a side view, not a center view. And at logic processing step 11-4, the side view gaze direction is evaluated by comparing the leftScore and rightScore. If (leftScore<rightScore), then the gaze direction is POSE_LEFT (outcome step 11-6), but otherwise the gaze direction is POSE_RIGHT (outcome step 11-7). Referring back to the logic processing step 11-3, if the ratio Scale value is less than the SideThreshold value (affirmative outcome from step 11-3), then the gaze direction is a center view (outcome step 11-5), at which point additional logic processing step check further if the direction is Up, Down, or Frontal. To check if the gaze direction is up, the logic processing step 11-5 computes the maximum of the leftEyeScore and rightEyeScore (e.g., maxEyeScore=max(leftEyeScore, rightEyeScore). If the logic processing step 11-8 determines that the maxEyeScore is greater than the upper threshold (UpThreshold) (e.g., maxEyeScore>UpThreshold) (affirmative outcome from step 11-8), then the gaze direction is POSE_UP (outcome step 11-9). However, if the logic processing step 11-8 determines that the maxEyeScore is not greater than the upper threshold (e.g., maxEyeScore<UpThreshold) (negative outcome from step 11-8), then logic processing steps check if the gaze direction is down. To check down gaze direction, the logic processing step 11-10 computes the maxEye value (e.g., maxEye=max (leftEye.y, rightEye.y)) and the maxEar value (e.g., maxEar=max (leftEar.y, rightEar.y)). If the logic processing step 11-11 determines that (maxEar<maxEye && maxEye<nose.y) (affirmative outcome from step 11-11, then the gaze direction is POSE_DOWN (outcome step 11-12). But if the logic processing step 11-11 does not determine that maxEar<maxEye && maxEye<nose.y (negative outcome from step 11-11), then the gaze direction is Frontal or centered, and the processing logic steps end (step 11-13).
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
As will be appreciated, additional or different threshold values can be used. For displays like blurring, saturation or window size, the attention score can be used directly in determining the amount of effect applied.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
At the end of the session, the session statistics file can also be sent to the teacher or presenter for further analysis and to generate valuable information about students/participants. As disclosed herein, the attention statistics in the session statistics file can include information identifying the student(s)/participant(s), the session start and end dates/times, and a listing of individual and cumulative attention scores from a plurality of attention measurement intervals for each student/participant.
In this description, head pose has been used as the metric for determining attention, but other metrics (for example, eye gaze position, eye gaze dwell time, eye movement, micro expressions, non-visual cues (e.g., audio), etc.) may be used in the attention determination calculation. And in other embodiments, the attention determination calculation may weigh several of these metrics. In addition, the attention determination calculation may incorporate time-based metrics (for example, sliding windows of attention, etc.) to determine a score that will be used for the attention level of the participant, and to determine how that participant is presented, allowing the presenter or instructor to know which participants may require intervention.
Here is a description of a few typical use cases:
School Online Courses—the software on the teacher side receives student session data for each class. The software generates the overall summary of student attention statistics for each student on each class. This information should be helpful for the teacher to build student report card.
Meeting participant attention analysis—for senior-level management people, who spend most of the time on online video meetings and the efficiency of those meetings is very important. Apply the technique described in this disclosure to analyze the participant attention in the meetings and help to improve the efficiency of the meetings. For example, consider the case for a recurrent meeting with 12 participants. The organizer collects the attention report for each participant after the meeting. If some participants never show good attention, the organizer might remove those participants from the meeting since they are not interested anyway.
To provide additional details for an improved understanding of selected embodiments of the present disclosure, reference is now made to
Referring now to
The processing unit 1102 can include digital signal processors (DSPs), central processing units (CPUs), graphics processing units (GPUs), dedicated hardware elements, such as neural network accelerators and hardware codecs, and the like in any desired combination.
The flash memory 1104 stores modules of varying functionality in the form of software and firmware, generically programs, for controlling the codec 1100. Illustrated modules include a video codec 1150, camera control 1152, face and body finding 1153, neural network models 1155, framing 1154, other video processing 1156, attention processing 1157, audio codec 1158, audio processing 1160, network operations 1166, user interface 1168 and operating system and various other modules 1170. The RAM 1105 is used for storing any of the modules in the flash memory 1104 when the module is executing, storing video images of video streams and audio samples of audio streams and can be used for scratchpad operation of the processing unit 1102. The face and body finding 1153 and neural network models 1155 are used in the various operations of the codec 1100, such as the face detection and gaze detection. The attention processing module 1157 performs the operations of
The network interface 1108 enables communications between the codec 1100 and other devices and can be wired, wireless or a combination. In one example, the network interface 1108 is connected or coupled to the Internet 1130 to communicate with remote endpoints 1140 in a videoconference. In one or more examples, the general interface 1110 provides data transmission with local devices such as a keyboard, mouse, printer, projector, display, external loudspeakers, additional cameras, and microphone pods, etc.
In one example, the cameras 1116A, 1116B, 1116C and the microphones 1114 capture video and audio, respectively, in the videoconference environment and produce video and audio streams or signals transmitted through the bus 1115 to the processing unit 1102. In at least one example of this disclosure, the processing unit 1102 processes the video and audio using algorithms in the modules stored in the flash memory 1104. Processed audio and video streams can be sent to and received from remote devices coupled to network interface 1108 and devices coupled to general interface 1110. This is just one example of the configuration of a codec 1100.
Referring now to
The processing unit 1202 can include digital signal processors (DSPs), central processing units (CPUs), graphics processing units (GPUs), dedicated hardware elements, such as neural network accelerators and hardware codecs, and the like in any desired combination.
The flash memory 1204 stores modules of varying functionality in the form of software and firmware, generically programs, for controlling the camera 1200. Illustrated modules include camera control 1252, sound source localization 1260 and operating system and various other modules 1270. The RAM 1205 is used for storing any of the modules in the flash memory 1204 when the module is executing, storing video images of video streams and audio samples of audio streams and can be used for scratchpad operation of the processing unit 1202.
In a second configuration, only the main camera 1116B includes the microphone array 1214 and the sound source location module 1260. Cameras 1116A, 1116C are then just simple cameras. In a third configuration, the main camera 1116B is built into the codec 1100, so that the processing unit 1202, the flash memory 1204, RAM 1205 and I/O interface 1210 are those of the codec 1100, with the imager interface 1218 and A/D 1212 connected to the bus 1115.
Other configurations, with differing components and arrangement of components, are well known for both videoconferencing endpoints and for devices used in other manners.
Referring now to
A graphics acceleration module 1324 is connected to the high speed interconnect 1308. A display subsystem 1326 is connected to the high speed interconnect 1308 to allow operation with and connection to various video monitors. A system services block 1332, which includes items such as DMA controllers, memory management units, general purpose I/O's, mailboxes and the like, is provided for normal SoC 1300 operation. A serial connectivity module 1334 is connected to the high speed interconnect 1308 and includes modules as normal in an SoC. A vehicle connectivity module 1336 provides interconnects for external communication interfaces, such as PCIe block 1338, USB block 1340 and an Ethernet switch 1342. A capture/MIPI module 1344 includes a four lane CSI 2 compliant transmit block 1346 and a four lane CSI 2 receive module and hub.
An MCU island 1360 is provided as a secondary subsystem and handles operation of the integrated SoC 1300 when the other components are powered down to save energy. An MCU ARM processor 1362, such as one or more ARM R5F cores, operates as a master and is coupled to the high speed interconnect 1308 through an isolation interface 1361. An MCU general purpose I/O (GPIO) block 1364 operates as a slave. MCU RAM 1366 is provided to act as local memory for the MCU ARM processor 1362. A CAN bus block 1368, an additional external communication interface, is connected to allow operation with a conventional CAN bus environment in a vehicle. An Ethernet MAC (media access control) block 1370 is provided for further connectivity. External memory, generally non-volatile memory (NVM) such as flash memory 104, is connected to the MCU ARM processor 1362 via an external memory interface 1369 to store instructions loaded into the various other memories for execution by the various appropriate processors. The MCU ARM processor 1362 operates as a safety processor, monitoring operations of the SoC 1300 to ensure proper operation of the SoC 1300.
It is understood that this is one example of an SoC provided for explanation and many other SoC examples are possible, with varying numbers of processors, DSPs, accelerators and the like.
By using face finding and gaze detection, attention levels of conference or class participants are developed. The attention levels for each participant are provided on a display for the use of the teacher, instructor or presenter. Numerous options are described for providing the display, one being providing the participants in a gallery view format and color coding the frame of each participant window to the appropriate attention level. Tinting, blurring or saturating the participant window can be used to display the participant attention level. Window size can be varied based on attention level. Multiple color bars can be used to provide the attention level percentages for different time periods for each participant. All of these alternative display formats provide feedback to the instructor, teacher or presenter of the attention level of the participants, allowing the instructor, teacher or presenter to take remedial action as needed.
Computer program instructions may be stored in a non-transitory processor readable memory that can direct a computer or other programmable data processing apparatus, processor or processors, to function in a particular manner, such that the instructions stored in the non-transitory processor readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The present invention is well adapted to attain the advantages mentioned as well as others inherent therein. While the present invention has been depicted, described, and is defined by reference to particular embodiments of the invention, such references do not imply a limitation on the invention, and no such limitation is to be inferred. The invention is capable of considerable modification, alteration, and equivalents in form and function, as will occur to those ordinarily skilled in the pertinent arts. The depicted and described embodiments are examples only and are not exhaustive of the scope of the invention.
Consequently, the invention is intended to be limited only by the spirit and scope of the appended claims, giving full cognizance to equivalents in all respects.
The various examples described are provided by way of illustration and should not be construed to limit the scope of the disclosure. Various modifications and changes can be made to the principles and examples described herein without departing from the scope of the disclosure and without departing from the claims which follow.
This application claims the benefit of U.S. Provisional Patent Application No. 63/260,564 entitled “System and Method for Attention Detection and Visualization,” filed Aug. 25, 2021, which is incorporated by reference in its entirety as is fully set forth herein.
Number | Date | Country | |
---|---|---|---|
63260564 | Aug 2021 | US |