This application claims priority to Swedish Application No. 1950727-6, filed Jun. 14, 2019; the content of which are hereby incorporated by reference.
The present application relates to user gaze detection systems and methods. In particular user gaze detection systems configured to receive user input. In an example, such systems and methods use trained models, such as neural networks, to identify a space that a user of a gaze tracking system is viewing.
Interaction with computing devices is a fundamental action in today's world. Computing devices, such as personal computers, tablets, smartphones, are found throughout daily life. The systems and methods for interacting with such devices define how they are used and what they are used for.
Advances in eye/gaze tracking technology have made it possible to interact with a computer/computing device using a person's gaze information. E.g. the location on a display that the user is gazing at may be used as input to the computing device. This input can be used for interaction solely, or in combination with a contact-based interaction technique (e.g., using a user input device, such as a keyboard, a mouse, a touch screen, or another input/output interface).
The accuracy of a gaze tracking system is highly dependent on the individual using the system. A system may perform extraordinary well on most users, but for some individuals it may have a hard time even getting the gaze roughly right.
Attempts have been made to expand existing gaze tracking techniques to rely on trained models, e.g. neural networks, to perform gaze tracking. However, the accuracy of the gaze tracking varies, and may perform poorly for some specific individuals. The trained model may have a hard time tracking the gaze, and may not even get the gaze estimate roughly right.
A drawback with such conventional gaze tracking systems, is that a gaze signal is always outputted, no matter how poor it is. In other words, a gaze signal or estimate will be provided even when the quality or confidence level of the gaze signal or estimate is so low that it is close to a uniformly random estimate of the gaze. A computer/computing device using the gaze signal or estimate has no means of knowing that the provided gaze signal or estimate is not to be trusted, and may result in unwanted results.
Thus, there is a need for an improved method for performing gaze tracking.
An objective of embodiments of the present invention is to provide a solution which mitigates or solves the drawbacks described above.
The above objective is achieved by the subject matter described herein. Further advantageous implementation forms of the invention are described herein.
According to a first aspect of the invention the objects of the invention is achieved by a method performed by a computer for identifying a space that a user of a gaze tracking system is viewing, the method comprising obtaining gaze tracking sensor data, generating gaze data comprising a probability distribution using the sensor data by processing the sensor data by a trained model, identifying a space that the user is viewing using the probability distribution.
At least one advantage of of the first aspect of the invention is that reliability of user input can be improved by providing gaze tracking applications with a gaze estimate and an associated confidence level.
In a first embodiment of the first aspect, the space comprises a region, wherein the probability distribution is indicative of a plurality of regions, each region having related confidence data indicative of a confidence level that the user is viewing the region.
In a second embodiment according to the first embodiment, the plurality of regions forms a grid representing a display the user is viewing.
In a third embodiment according to the first or second embodiment, identifying the space the user is viewing comprises selecting a region, from the plurality of regions, having a highest confidence level.
In a fourth embodiment according to the second or third embodiment, the method further comprises determining a gaze point using the selected region.
In a fifth embodiment according to the first embodiment, each region of the plurality of regions is arranged spatially separate and representing an object that the user is potentially viewing, wherein said object is a real object and/or a virtual object.
In a sixth embodiment according to the fifth embodiment, identifying the region the user is viewing comprises selecting a region of the plurality of regions having a highest confidence level.
In a seventh embodiment according to the sixth embodiment, the method further comprises selecting an object using the selected region.
In an eighth embodiment according to the seventh embodiment, the method further comprises determining a gaze point using the selected region and/or the selected object.
In a ninth embodiment according to any of the first to the eighth embodiment, the objects are displays and/or input devices, such as mouse or keyboard.
In a tenth embodiment according to any of the first to eighth embodiment, the objects are different interaction objects comprised in a car, such as mirrors, center console and dashboard.
In an eleventh embodiment according to any of the preceding embodiments, the space comprises a gaze point, wherein the probability distribution is indicative of a plurality of gaze points, each gaze point having related confidence data indicative of a confidence level that the user is viewing the gaze point.
In a twelfth embodiment according to the eleventh embodiment, identifying the space the user is viewing comprises selecting a gaze point of the plurality of gaze points having a highest confidence level.
In an thirteenth embodiment according to any of the preceding embodiments, the space comprises a three-dimensional gaze ray defined by a gaze origin and a gaze direction, wherein the probability distribution is indicative of a plurality of gaze rays, each gaze ray having related confidence data indicative of a confidence level that the direction the user is viewing coincides with the gaze direction of a respective gaze ray.
In an fourteenth embodiment according to the thirteenth embodiment, identifying the space the user is viewing comprises selecting a gaze ray of the plurality of gaze rays having a highest confidence level.
In a fifteenth embodiment according to the fourteenth embodiment, the method further comprises determining a gaze point using the selected gaze ray and a surface.
In a fifteenth embodiment according to any of the preceding embodiments, the trained model comprises any one of a neural network, boosting based regressor, a support vector machine, a linear regressor and/or random forest.
In an sixteenth embodiment according to any of the preceding embodiments, the probability distribution comprised by the trained model is selected from any one of a gaussian distribution, a mixture of gaussian distributions, a von Mises distribution, a histogram and/or an array of confidence values.
According to a second aspect of the invention the objects of the invention is achieved by a computer, the computer comprising:
an interface to one or more image sensors, a processor; and
a memory, said memory containing instructions executable by said processor, whereby said computer is operative to perform the method according to first aspect.
According to a third aspect of the invention the objects of the invention is achieved by a computer program comprising computer-executable instructions for causing a computer, when the computer-executable instructions are executed on processing circuitry comprised in the computer, to perform any of the method steps according to the first aspect.
According to a fourth aspect of the invention the objects of the invention is achieved by a computer program product comprising a computer-readable storage medium, the computer-readable storage medium having the computer program according to the third aspect embodied therein.
The scope of the invention is defined by the claims, which are incorporated into this section by reference. A more complete understanding of embodiments of the invention will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. Reference will be made to the appended sheets of drawings that will first be described briefly.
A more complete understanding of embodiments of the invention will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures.
An “or” in this description and the corresponding claims is to be understood as a mathematical OR which covers “and” and “or”, and is not to be understand as an XOR (exclusive OR). The indefinite article “a” in this disclosure and claims is not limited to “one” and can also be understood as “one or more”, i.e., plural.
In the example shown in
The illuminators 211 and 212 may for example, be light emitting diodes emitting light in the infrared frequency band, or in the near infrared frequency band. The image sensor 213 may for example be a camera, such as a complementary metal oxide semiconductor (CMOS) camera or a charged coupled device (CCD) camera. The camera is not limited to be an IR camera or a depth camera or a light-field camera. The shutter mechanism of the image sensor can either be a rolling shutter or a global shutter.
The first illuminator 211 may be arranged coaxially with (or close to) the image sensor 213 so that the image sensor 213 may capture bright pupil images of the user's eyes. Due to the coaxial arrangement of the first illuminator 211 and the image sensor 213, light reflected from the retina of an eye returns back out through the pupil towards the image sensor 213, so that the pupil appears brighter than the iris surrounding it in images where the first illuminator 211 illuminates the eye. The second illuminator 212 is arranged non-coaxially with (or further away from) the image sensor 213 for capturing dark pupil images. Due to the non-coaxial arrangement of the second illuminator 212 and the image sensor 213, light reflected from the retina of an eye does not reach the image sensor 213 and the pupil appears darker than the iris surrounding it in images where the second illuminator 212 illuminates the eye. The illuminators 211 and 212 may for example, take turns to illuminate the eye, so that every first image is a bright pupil image, and every second image is a dark pupil image.
The eye tracking system 200 also comprises processing circuitry 221 (for example including one or more processors) for processing the images captured by the image sensor 213. The circuitry 221 may for example, be connected/communicatively coupled to the image sensor 213 and the illuminators 211 and 212 via a wired or a wireless connection. In another example, the processing circuitry 221 is in the form of one or more processors and may be provided in one or more stacked layers below the light sensitive surface of the image sensor 213.
The computer 220 may further comprise a communications interface 224, e.g. a wireless transceiver 224 and/or a wired/wireless communications network adapter, which is configured to send and/or receive data values or parameters as a signal to or from the processing circuitry 221 to or from other computers and/or to or from other communication network nodes or units, e.g. to/from the at least one image sensor 213 and/or to/from a server. In an embodiment, the communications interface 224 communicates directly between control units, sensors and other communication network nodes or via a communications network. The communications interface 224, such as a transceiver, may be configured for wired and/or wireless communication. In embodiments, the communications interface 224 communicates using wired and/or wireless communication techniques. The wired or wireless communication techniques may comprise any of a CAN bus, Bluetooth, WiFi, GSM, UMTS, LTE or LTE advanced communications network or any other wired or wireless communication network known in the art.
In one or more embodiments, the computer 220 may further comprise a dedicated sensor interface 223, e.g. a wireless transceiver and/or a wired/wireless communications network adapter, which is configured to send and/or receive data values or parameters as a signal to or from the processing circuitry 221, e.g. gaze signals to/from the at least one image sensor 213.
Further, the communications interface 224 may further comprise at least one optional antenna (not shown in figure). The antenna may be coupled to the communications interface 224 and is configured to transmit and/or emit and/or receive a wireless signals in a wireless communication system, e.g. send/receive control signals to/from the one or more sensors or any other control unit or sensor. In embodiments including the sensor interface 223, at least one optional antenna (not shown in figure) may be coupled to the sensor interface 223 configured to transmit and/or emit and/or receive a wireless signals in a wireless communication system.
In one example, the processing circuitry 221 may be any of a selection of processor and/or a central processing unit and/or processor modules and/or multiple processors configured to cooperate with each-other. Further, the computer 220 may further comprise a memory 222.
In one example, the one or more memory 222 may comprise a selection of a hard RAM, disk drive, a floppy disk drive, a magnetic tape drive, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive. The memory 222 may contain instructions executable by the processing circuitry to perform any of the methods and/or method steps described herein.
In one or more embodiments the computer 220 may further comprise an input device 227, configured to receive input or indications from a user and send a user-input signal indicative of the user input or indications to the processing circuitry 221.
In one or more embodiments the computer 220 may further comprise a display 228 configured to receive a display signal indicative of rendered objects, such as text or graphical user input objects, from the processing circuitry 221 and to display the received signal as objects, such as text or graphical user input objects.
In one embodiment the display 228 is integrated with the user input device 227 and is configured to receive a display signal indicative of rendered objects, such as text or graphical user input objects, from the processing circuitry 221 and to display the received signal as objects, such as text or graphical user input objects, and/or configured to receive input or indications from a user and send a user-input signal indicative of the user input or indications to the processing circuitry 221.
In embodiments, the processing circuitry 221 is communicatively coupled to the memory 222 and/or the sensor interface 223 and/or the communications interface 224 and/or the input device 227 and/or the display 228 and/or the at least one image sensor 213. The computer 220 may be configured to receive the sensor data directly from the at least one image sensor 213 or via the wired and/or wireless communications network.
In a further embodiment, the computer 220 may further comprise and/or be coupled to one or more additional sensors (not shown) configured to receive and/or obtain and/or measure physical properties pertaining to the user or environment of the user and send one or more sensor signals indicative of the physical properties to the processing circuitry 221, e.g. sensor data indicative of ambient light.
The computer 760, described herein may comprise all or a selection of the features described in relation to
The server 770, described herein may comprise all or a selection of the features described in relation to
In one embodiment, a computer 220 is provided. The computer 220 comprising an interface 223, 224 to one or more image sensors 213, a processor 221; and a memory 222, said memory 222 containing instructions executable by said processor 221, whereby said computer is operative to perform any method steps of the method described herein.
Step 310: obtaining gaze tracking sensor data.
The image or gaze tracking sensor data may be received comprised in signals or gaze signals, e.g. wireless signals, from the at least one image sensor 213, from the eye tracking unit 210.
Additionally or alternatively, the gaze tracking sensor data may be received from another node or communications node, e.g. from the computer 220. Additionally or alternatively, the gaze tracking sensor data may be retrieved from memory.
Step 320: generating gaze data comprising a probability distribution using the sensor data by processing the sensor data by a trained model.
In one embodiment, the trained model comprises a selection of any of a neural network (such as CNN), boosting based regressor (such as a gradient boosted regressor; gentle boost; adaptive boost), a support vector machine, a linear regressor and/or random forest.
In one embodiment, the probability distribution comprises a selection of any of a gaussian distribution, a mixture of gaussian distributions, a von Mises distribution, a histogram and/or an array of confidence values.
In one example, the wired/wireless signals from the at least one image sensor 213 are received by the computer 220 and the gaze tracking sensor data is extracted from the signals, e.g. by demodulating/decoding an image depicting a user's eyes from the signals. The gaze tracking sensor data is then fed to and processed by the trained model, e.g. a convolutional neural network. The trained model then outputs gaze data, e.g. two-dimensional isotropic Gaussian probability distribution of gaze positions.
This is in contrast to conventional systems that typically provides a single point. Practically this means that, instead of letting the trained model output a 2-dimensional vector for each gaze point (x,y), it outputs a two-dimensional mean vector (
The probability distribution over y can then be described according to the relation:
p(y|x,θ)=(y|μθ(x),σθ(x))
where x is the input, y are the labels (stimulus points) of the trained model and theta T is the trained model parameters. By imposing a prior on the model parameters T, the Maximum A-Posteriori, MAP, loss function can be formulated as
(x,y)=−λp(y|x,θ)p(θ),
where λ is an arbitrary scale parameter. Minimizing this loss function is equivalent to maximizing the mode of the posterior distribution over the model parameters. When deploying the network one can use the outputted mean vector as the gaze signal, and the standard deviation as a measure of confidence.
Step 330: identifying a space that the user is viewing using the probability distribution. In one example, the gaze data comprises a Gaussian probability distribution of gaze positions, where each gaze position or gaze position comprises a mean position vector (
In some embodiments, the space is represented as a region. A typical example is a scenario when a user is viewing a screen, and the screen is at least partially split into a plurality of adjacent non-overlapping regions.
Additionally or alternatively, the space of the method 300 comprises a region, the probability distribution is then indicative of a plurality of regions, each region having related confidence data indicative of a confidence level that the user is viewing the region.
The trained model may be obtained or trained by providing training or calibration data, typically comprises 2D images and corresponding verified gaze data.
In one embodiment, a selection of the method steps described above is performed by a computer, such as a laptop.
In one embodiment, a selection of the method steps described above is performed by a server, such as a cloud server.
In one embodiment, a selection of the method steps described above is performed by a computer 760, such as a laptop, and the remaining steps are performed by the server 770. Data, such as gaze tracking sensor data or gaze data may be exchanged over a communications network 780.
In one embodiment, the space comprises a region. The probability distribution of the gaze data is indicative of a plurality of regions 410, each region having related confidence data indicative of a confidence level that the user is viewing the region.
Additionally or alternatively, the plurality of regions 410 forms a grid representing a display 228 the user is viewing.
Additionally or alternatively, the step 330 of identifying the space the user is viewing comprises selecting a region, from the plurality of regions 410, having a highest confidence level.
In one example, the wired/wireless signals, e.g. sensor signals, from the at least one image sensor 213 are received by the computer 220 and the gaze tracking sensor data is extracted from the signals, e.g. by demodulating/decoding/processing an image depicting a user's eyes from the signals. The gaze tracking sensor data is then fed to and processed by the trained model, e.g. a convolutional neural network. The trained model then outputs gaze data, e.g. a two-dimensional isotropic Gaussian probability distribution of regions in a similar fashion as described in the example above, in relation to step 320, for gaze positions. In other words, a probability distribution is provided comprising associated or aggregated data identifying a region and the confidence level that a user is viewing that region.
Additionally or alternatively, the method further comprises determining a gaze point using the selected region. The gaze point may e.g. be determined as the geometric center of the region or center of mass of the region.
In some embodiments, the plurality of regions are not adjacent but rather arranged spatially separate. This may e.g. be the case in some augmented reality applications or in vehicle related applications of the method described herein.
Additionally or alternatively, each region of the plurality of regions 421-425 may be arranged spatially separate and represent an object 411-415 that the user is potentially viewing. The object 411-415 may be a real object and/or a virtual object or a mixture of real and virtual objects.
Additionally or alternatively, the step 330 of identifying the region the user is viewing may comprise selecting a region of the plurality of regions 421-425 having a highest confidence level.
Additionally or alternatively, the method may further comprise identifying or selecting an object using the selected region. E.g. by selecting the object enclosed by the selected region.
Additionally or alternatively, the method further comprises determining a gaze point or gaze position using the selected region and/or the selected object. The gaze point or gaze position may be e.g. be determined as the geometric center of the selected region and/or the selected object or center of mass of the selected region and/or the selected object.
Additionally or alternatively, the objects may be displays and/or input devices, such as mouse or keyboard.
Additionally or alternatively, the objects are different interaction objects comprised in a car, such as mirrors 411, center console 413 and dashboard with dials 414, 415 and information field 412.
In one embodiment, the identified space comprises or is a gaze point 640, wherein the probability distribution is indicative of a plurality of gaze points 610, each gaze point having related confidence data indicative of a confidence level that the user is viewing the gaze point.
Additionally or alternatively, identifying the space the user is viewing comprises selecting a gaze point 640 of the plurality of gaze points 610 having a highest confidence level.
In one example, the wired/wireless signals, e.g. sensor signals, from the at least one image sensor 213 are received by the computer 220 and the gaze tracking sensor data is extracted from the signals, e.g. by demodulating/decoding/processing an image depicting a user's eyes from the signals. The gaze tracking sensor data is then fed to and processed by the trained model, e.g. a convolutional neural network. The trained model then outputs gaze data, e.g. a two-dimensional isotropic Gaussian probability distribution of gaze points in a similar fashion as described in the example above, in relation to step 320. In other words, a probability distribution is provided comprising associated or aggregated data identifying a gaze point and the confidence level that a user is viewing that gaze point.
Typically, 2D gaze data refers to an X, Y gaze position 730 on a 2D plane 740, e.g. a 2D plane formed by a computer screen viewed by the user 750. In comparison, 3D gaze data refers to not only the X, Y gaze position, but also the Z gaze position. In an example, the gaze ray 710 can be characterized by gaze origin 720 or an eye position in 3D space as the origin and a direction of the 3D gaze from the origin.
As illustrated in
In one embodiment, the space of the method 300 comprises a three-dimensional gaze ray 710. The gaze ray 710 may be defined by a gaze origin 720, e.g. the center of the user's eye, and a gaze direction. The probability distribution may then be indicative of a plurality of gaze rays, each gaze ray having related confidence data indicative of a confidence level that the direction the user is viewing coincides with the gaze direction of a respective gaze ray.
Additionally or alternatively, identifying the space the user is viewing comprises selecting a gaze ray 710 of the plurality of gaze rays having a highest corresponding confidence level. In other words, gaze data comprising a probability distribution is provided comprising associated or aggregated data identifying a gaze ray and a corresponding confidence level that a user is viewing that region.
Additionally or alternatively, the method 300 further comprises determining a gaze point using the selected gaze ray and a surface, e.g. the 2D surface formed by the screen of the computer or computing device 760. Any other surface, such as a 3D surface, could be used to determine the gaze point as an intersection point of the surface and the gaze ray.
In one example, the wired/wireless signals, e.g. sensor signals, from the at least one image sensor 213 are received by the computer 220 and the gaze tracking sensor data is extracted from the signals, e.g. by demodulating/decoding/processing an image depicting a user's eyes from the signals. The gaze tracking sensor data is then fed to and processed by the trained model, e.g. a convolutional neural network. The trained model then outputs gaze data, e.g. a two-dimensional isotropic Gaussian probability distribution of gaze rays in a similar fashion as described in the example above, in relation to step 320, for gaze positions.
The server 770 may send information about the gaze data comprising the probability distribution over the communications network 780 to the computer 760. The computer or computing device 760 uses this information to execute a gaze application that provides a gaze-based computing service to the user 750, e.g. obtaining user input of selecting a visualized object.
Although
In a further example, the computer 760 includes a camera, a screen, and a 3D gaze application. The camera generates gaze tracking sensor data in the form of a 2D image that is a 2D representation of the user's face. This 2D image shows the user eyes while gazing into 3D space. A 3D coordinate system can be defined in association with the camera. For example, the camera is at the origin of this 3D coordinate system. The corresponding X and Y planes can be planes perpendicular to the camera's line-of-sight center direction/main direction. In comparison, the 2D image has a 2D plane that can be defined around a 2D coordinate system local to the 2D representation of the user's face. The camera is associated with a mapping between the 2D space and the 3D space (e.g., between the two coordinate systems formed by the camera and the 2D representation of the user's face). In an example, this mapping includes the camera's back-projection matrix and is stored locally at the computing device 760 (e.g., in storage location associated with the 3D gaze application). The computing device's 760 display may, but need not be, in the X, Y plane of the camera (if not, the relative positions between the two is determined based on the configuration of the computing device 760). The 3D gaze application can process the 2D image for inputting to the trained model (whether remote or local to the computing device 760) and can process the information about the gaze ray 710 to support stereoscopic displays (if also supported by the computing device's 760 display) and 3D applications (e.g., 3D controls and manipulations of displayed objects on the computing device's 760 display based on the tracking sensor data).
In one embodiment, a computer program is provided and comprising computer-executable instructions for causing the computer 220, when the computer-executable instructions are executed on processing circuitry comprised in the computer 220, to perform any of the method steps of the method described herein.
In one embodiment, a computer program product is provided and comprising a computer-readable storage medium, the computer-readable storage medium having the computer program above embodied therein.
In embodiments, the communications network 780 communicate using wired or wireless communication techniques that may include at least one of a Local Area Network (LAN), Metropolitan Area Network (MAN), Global System for Mobile Network (GSM), Enhanced Data GSM Environment (EDGE), Universal Mobile Telecommunications System, Long term evolution, High Speed Downlink Packet Access (HSDPA), Wideband Code Division Multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth®, Zigbee®, Wi-Fi, Voice over Internet Protocol (VoIP), LTE Advanced, IEEE802.16m, WirelessMAN-Advanced, Evolved High-Speed Packet Access (HSPA+), 3GPP Long Term Evolution (LTE), Mobile WiMAX (IEEE 802.16e), Ultra Mobile Broadband (UMB) (formerly Evolution-Data Optimized (EV-DO) Rev. C), Fast Low-latency Access with Seamless Handoff Orthogonal Frequency Division Multiplexing (Flash-OFDM), High Capacity Spatial Division Multiple Access (iBurst®) and Mobile Broadband Wireless Access (MBWA) (IEEE 802.20) systems, High Performance Radio Metropolitan Area Network (HIPERMAN), Beam-Division Multiple Access (BDMA), World Interoperability for Microwave Access (Wi-MAX) and ultrasonic communication, etc., but is not limited thereto.
Moreover, it is realized by the skilled person that the computer 220 may comprise the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the present solution. Examples of other such means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, MSDs, encoder, decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the present solution.
Especially, the processing circuitry 221 of the present disclosure may comprise one or more instances of processor and/or processing means, processor modules and multiple processors configured to cooperate with each-other, Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, a Field-Programmable Gate Array (FPGA) or other processing logic that may interpret and execute instructions. The expression “processing circuitry” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing means may further perform data processing functions for inputting, outputting, and processing of data.
Finally, it should be understood that the invention is not limited to the embodiments described above, but also relates to and incorporates all embodiments within the scope of the appended independent claims.
Number | Date | Country | Kind |
---|---|---|---|
1950727-6 | Jun 2019 | SE | national |