The present disclosure relates to hearing devices, e.g. hearing aids, and in particular to a method to estimate a hearing aid (HA) user head orientation using inertial sensors and eye gaze data.
In an aspect of the present application, a hearing aid adapted for orientation is provided.
The hearing aid may be adapted to provide a frequency dependent gain and/or a level dependent compression and/or a transposition (with or without frequency compression) of one or more frequency ranges to one or more other frequency ranges, e.g. to compensate for a hearing impairment of a user. The hearing aid may comprise a signal processor for enhancing the input signals and providing a processed output signal.
The hearing aid may comprise an output unit for providing a stimulus perceived by the user as an acoustic signal based on a processed electric signal. The output unit may comprise a number of electrodes of a cochlear implant (for a CI type hearing aid) or a vibrator of a bone conducting hearing aid. The output unit may comprise an output transducer. The output transducer may comprise a receiver (loudspeaker) for providing the stimulus as an acoustic signal to the user (e.g. in an acoustic (air conduction based) hearing aid). The output transducer may comprise a vibrator for providing the stimulus as mechanical vibration of a skull bone to the user (e.g. in a bone-attached or bone-anchored hearing aid).
The hearing aid may comprise an input unit for providing an electric input signal representing sound. The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal. The input unit may comprise a wireless receiver for receiving a wireless signal comprising or representing sound and for providing an electric input signal representing said sound. The wireless receiver may e.g. be configured to receive an electromagnetic signal in the radio frequency range (3 kHz to 300 GHz). The wireless receiver may e.g. be configured to receive an electromagnetic signal in a frequency range of light (e.g. infrared light 300 GHz to 430 THz, or visible light, e.g. 430 THz to 770 THz).
The hearing aid may comprise a directional microphone system adapted to spatially filter sounds from the environment, and thereby enhance a target acoustic source among a multitude of acoustic sources in the local environment of the user wearing the hearing aid. The directional system is adapted to detect (such as adaptively detect) from which direction a particular part of the microphone signal originates. This can be achieved in various different ways as e.g. described in the prior art. In hearing aids, a microphone array beamformer is often used for spatially attenuating background noise sources. Many beamformer variants can be found in literature. The minimum variance distortionless response (MVDR) beamformer is widely used in microphone array signal processing. Ideally the MVDR beamformer keeps the signals from the target direction (also referred to as the look direction) unchanged, while attenuating sound signals from other directions maximally. The generalized sidelobe canceller (GSC) structure is an equivalent representation of the MVDR beamformer offering computational and numerical advantages over a direct implementation in its original form.
The hearing aid may comprise antenna and transceiver circuitry (e.g. a wireless receiver) for wirelessly receiving a direct electric input signal from another device, e.g. from an entertainment device (e.g. a TV-set), a communication device, a wireless microphone, or another hearing aid. The direct electric input signal may represent or comprise an audio signal and/or a control signal and/or an information signal. The hearing aid may comprise demodulation circuitry for demodulating the received direct electric input to provide the direct electric input signal representing an audio signal and/or a control signal e.g. for setting an operational parameter (e.g. volume) and/or a processing parameter of the hearing aid. In general, a wireless link established by antenna and transceiver circuitry of the hearing aid can be of any type. The wireless link is established between two devices, e.g. between an entertainment device (e.g. a TV) and the hearing aid, or between two hearing aids, e.g. via a third, intermediate device (e.g. a processing device, such as a remote control device, a smartphone, etc.). The wireless link is used under power constraints, e.g. in that the hearing aid may be constituted by or comprise a portable (typically battery driven) device. The wireless link is a link based on near-field communication, e.g. an inductive link based on an inductive coupling between antenna coils of transmitter and receiver parts. The wireless link may be based on far-field, electromagnetic radiation. The communication via the wireless link is arranged according to a specific modulation scheme, e.g. an analogue modulation scheme, such as FM (frequency modulation) or AM (amplitude modulation) or PM (phase modulation), or a digital modulation scheme, such as ASK (amplitude shift keying), e.g. On-Off keying, FSK (frequency shift keying), PSK (phase shift keying), e.g. MSK (minimum shift keying), or QAM (quadrature amplitude modulation), etc.
The communication between the hearing aid and the other device may be in the base band (audio frequency range, e.g. between 0 and 20 kHz). Preferably, communication between the hearing aid and the other device is based on some sort of modulation at frequencies above 100 kHz. Preferably, frequencies used to establish a communication link between the hearing aid and the other device is below 70 GHz, e.g. located in a range from 50 MHz to 70 GHz, e.g. above 300 MHz, e.g. in an ISM range above 300 MHz, e.g. in the 900 MHz range or in the 2.4 GHz range or in the 5.8 GHz range or in the 60 GHz range (ISM=Industrial, Scientific and Medical, such standardized ranges being e.g. defined by the International Telecommunication Union, ITU). The wireless link may be based on a standardized or proprietary technology. The wireless link may be based on Bluetooth technology (e.g. Bluetooth Low-Energy technology).
The hearing aid and/or the communication device may comprise an electrically small antenna. An ‘electrically small antenna’ is in the present context taken to mean that the spatial extension of the antenna (e.g. the maximum physical dimension in any direction) is much smaller than the wavelength λTx of the transmitted electric signal. The spatial extension of the antenna is a factor of 10, or 50 or 100 or more, or a factor of 1 000 or more, smaller than the carrier wavelength λTx of the transmitted signal. The hearing aid is a relatively small device. The term ‘a relatively small device’ is in the present context taken to mean a device whose maximum physical dimension (and thus of an antenna for providing a wireless interface to the device) is smaller than 10 cm, such as smaller than 5 cm. In the present context, ‘a relatively small device’ may be a device whose maximum physical dimension is much smaller (e.g. more than 3 times, such as more than 10 times smaller, such as more than 20 times small) than the operating wavelength of a wireless interface to which the antenna is intended (ideally an antenna for radiation of electromagnetic waves at a given frequency should be larger than or equal to half the wavelength of the radiated waves at that frequency). At 860 MHz, the wavelength in vacuum is around 35 cm. At 2.4 GHz, the wavelength in vacuum is around 12 cm. The hearing aid may have a maximum outer dimension of the order of 0.15 m (e.g. a handheld mobile telephone). The hearing aid may have a maximum outer dimension of the order of 0.08 m (e.g. a headset). The hearing aid may have a maximum outer dimension of the order of 0.04 m (e.g. a hearing instrument).
The hearing aid may be or form part of a portable (i.e. configured to be wearable) device, e.g. a device comprising a local energy source, e.g. a battery, e.g. a rechargeable battery. The hearing aid may e.g. be a low weight, easily wearable, device, e.g. having a total weight less than 100 g.
The hearing aid may comprise a forward or signal path between an input unit (e.g. an input transducer, such as a microphone or a microphone system and/or direct electric input (e.g. a wireless receiver)) and an output unit, e.g. an output transducer. The signal processor may be located in the forward path. The signal processor may be adapted to provide a frequency dependent gain according to a user's particular needs. The hearing aid may comprise an analysis path comprising functional components for analyzing the input signal (e.g. determining a level, a modulation, a type of signal, an acoustic feedback estimate, etc.). Some or all signal processing of the analysis path and/or the signal path may be conducted in the frequency domain. Some or all signal processing of the analysis path and/or the signal path may be conducted in the time domain.
An analogue electric signal representing an acoustic signal may be converted to a digital audio signal in an analogue-to-digital (AD) conversion process, where the analogue signal is sampled with a predefined sampling frequency or rate fs, fs being e.g. in the range from 8 kHz to 48 kHz (adapted to the particular needs of the application) to provide digital samples xn (or x[n]) at discrete points in time tn (or n), each audio sample representing the value of the acoustic signal at tn by a predefined number Nb of bits, Nb being e.g. in the range from 1 to 48 bits, e.g. 24 bits. Each audio sample is hence quantized using Nb bits (resulting in 2Nb different possible values of the audio sample). A digital sample x has a length in time of 1/fs, e.g. 50 μs, for fs=20 kHz. A number of audio samples may be arranged in a time frame. A time frame may comprise 64 or 128 audio data samples. Other frame lengths may be used depending on the practical application.
The hearing aid may comprise an analogue-to-digital (AD) converter to digitize an analogue input (e.g. from an input transducer, such as a microphone) with a predefined sampling rate, e.g. 20 kHz. The hearing aids may comprise a digital-to-analogue (DA) converter to convert a digital signal to an analogue output signal, e.g. for being presented to a user via an output transducer.
The hearing aid, e.g. the input unit, and or the antenna and transceiver circuitry comprise(s) a TF-conversion unit for providing a time-frequency representation of an input signal. The time-frequency representation may comprise an array or map of corresponding complex or real values of the signal in question in a particular time and frequency range. The TF conversion unit may comprise a filter bank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. The TF conversion unit may comprise a Fourier transformation unit for converting a time variant input signal to a (time variant) signal in the (time-)frequency domain. The frequency range considered by the hearing aid from a minimum frequency fmin to a maximum frequency fmax may comprise a part of the typical human audible frequency range from 20 Hz to 20 kHz, e.g. a part of the range from 20 Hz to 12 kHz. Typically, a sample rate fs is larger than or equal to twice the maximum frequency fmax, fs≥2fmax. A signal of the forward and/or analysis path of the hearing aid may be split into a number NI of frequency bands (e.g. of uniform width), where NI is e.g. larger than 5, such as larger than 10, such as larger than 50, such as larger than 100, such as larger than 500, at least some of which are processed individually. The hearing aid may be adapted to process a signal of the forward and/or analysis path in a number NP of different frequency channels (NP≤NI). The frequency channels may be uniform or non-uniform in width (e.g. increasing in width with frequency), overlapping or non-overlapping.
The hearing aid may be configured to operate in different modes, e.g. a normal mode and one or more specific modes, e.g. selectable by a user, or automatically selectable. A mode of operation may be optimized to a specific acoustic situation or environment. A mode of operation may include a low-power mode, where functionality of the hearing aid is reduced (e.g. to save power), e.g. to disable wireless communication, and/or to disable specific features of the hearing aid.
The hearing aid may comprise a number of detectors configured to provide status signals relating to a current physical environment of the hearing aid (e.g. the current acoustic environment), and/or to a current state of the user wearing the hearing aid, and/or to a current state or mode of operation of the hearing aid. Alternatively or additionally, one or more detectors may form part of an external device in communication (e.g. wirelessly) with the hearing aid. An external device may e.g. comprise another hearing aid, a remote control, and audio delivery device, a telephone (e.g. a smartphone), an external sensor, etc.
One or more of the number of detectors may operate on the full band signal (time domain). One or more of the number of detectors may operate on band split signals ((time-) frequency domain), e.g. in a limited number of frequency bands.
The number of detectors may comprise a level detector for estimating a current level of a signal of the forward path. The detector may be configured to decide whether the current level of a signal of the forward path is above or below a given (L-)threshold value. The level detector operates on the full band signal (time domain). The level detector operates on band split signals ((time-) frequency domain).
The hearing aid may comprise a voice activity detector (VAD) for estimating whether or not (or with what probability) an input signal comprises a voice signal (at a given point in time). A voice signal is in the present context taken to include a speech signal from a human being. It may also include other forms of utterances generated by the human speech system (e.g. singing). The voice activity detector unit is adapted to classify a current acoustic environment of the user as a VOICE or NO-VOICE environment. This has the advantage that time segments of the electric microphone signal comprising human utterances (e.g. speech) in the user's environment can be identified, and thus separated from time segments only (or mainly) comprising other sound sources (e.g. artificially generated noise). The voice activity detector may be adapted to detect as a VOICE also the user's own voice. Alternatively, the voice activity detector may be adapted to exclude a user's own voice from the detection of a VOICE.
The hearing aid may comprise an own voice detector for estimating whether or not (or with what probability) a given input sound (e.g. a voice, e.g. speech) originates from the voice of the user of the system. A microphone system of the hearing aid may be adapted to be able to differentiate between a user's own voice and another person's voice and possibly from NON-voice sounds.
The number of detectors may comprise a movement detector, e.g. an acceleration sensor. The movement detector is configured to detect movement of the user's facial muscles and/or bones, e.g. due to speech or chewing (e.g. jaw movement) and to provide a detector signal indicative thereof.
The hearing aid may comprise a classification unit configured to classify the current situation based on input signals from (at least some of) the detectors, and possibly other inputs as well. In the present context ‘a current situation’ is taken to be defined by one or more of
a) the physical environment (e.g. including the current electromagnetic environment, e.g. the occurrence of electromagnetic signals (e.g. comprising audio and/or control signals) intended or not intended for reception by the hearing aid, or other properties of the current environment than acoustic);
b) the current acoustic situation (input level, feedback, etc.), and
c) the current mode or state of the user (movement, temperature, cognitive load, etc.);
d) the current mode or state of the hearing aid (program selected, time elapsed since last user interaction, etc.) and/or of another device in communication with the hearing aid.
The classification unit may be based on or comprise a neural network, e.g. a rained neural network.
The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Acoustic feedback occurs because the output loudspeaker signal from an audio system providing amplification of a signal picked up by a microphone is partly returned to the microphone via an acoustic coupling through the air or other media. The part of the loudspeaker signal returned to the microphone is then re-amplified by the system before it is re-presented at the loudspeaker, and again returned to the microphone. As this cycle continues, the effect of acoustic feedback becomes audible as artifacts or even worse, howling, when the system becomes unstable. The problem appears typically when the microphone and the loudspeaker are placed closely together, as e.g. in hearing aids or other audio systems. Some other classic situations with feedback problems are telephony, public address systems, headsets, audio conference systems, etc. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms. They both have the property to minimize the error signal in the mean square sense with the NLMS additionally normalizing the filter update with respect to the squared Euclidean norm of some reference signal.
The feedback control system may comprise a feedback estimation unit for providing a feedback signal representative of an estimate of the acoustic feedback path, and a combination unit, e.g. a subtraction unit, for subtracting the feedback signal from a signal of the forward path (e.g. as picked up by an input transducer of the hearing aid). The feedback estimation unit may comprise an update part comprising an adaptive algorithm and a variable filter part for filtering an input signal according to variable filter coefficients determined by said adaptive algorithm, wherein the update part is configured to update said filter coefficients of the variable filter part with a configurable update frequency fupd. The hearing aid is configured to provide that the configurable update frequency fupd has a maximum value fupd,max. The maximum value fupd,max is a fraction of a sampling frequency fs of an AD converter of the hearing aid (fupd,max=fs/D).
The update part of the adaptive filter may comprise an adaptive algorithm for calculating updated filter coefficients for being transferred to the variable filter part of the adaptive filter. The timing of calculation and/or transfer of updated filter coefficients from the update part to the variable filter part may be controlled by the activation control unit. The timing of the update (e.g. its specific point in time, and/or its update frequency) may preferably be influenced by various properties of the signal of the forward path. The update control scheme is preferably supported by one or more detectors of the hearing aid, preferably included in a predefined criterion comprising the detector signals.
The hearing aid may further comprise other relevant functionality for the application in question, e.g. compression, noise reduction, etc.
The hearing aid may comprise a hearing instrument, e.g. a hearing instrument adapted for being located at the ear or fully or partially in the ear canal of a user, e.g. a headset, an earphone, an ear protection device or a combination thereof. The hearing assistance system may comprise a speakerphone (comprising a number of input transducers and a number of output transducers, e.g. for use in an audio conference situation), e.g. comprising a beamformer filtering unit, e.g. providing multiple beamforming capabilities.
In an aspect, use of a hearing aid as described above, in the ‘detailed description of embodiments’ and in the claims, is moreover provided. Use may be provided in a system comprising audio distribution, e.g. a system comprising a microphone and a loudspeaker in sufficiently close proximity of each other to cause feedback from the loudspeaker to the microphone during operation by a user. Use may be provided in a system comprising one or more hearing aids (e.g. hearing instruments), headsets, ear phones, active ear protection systems, etc., e.g. in handsfree telephone systems, teleconferencing systems (e.g. including a speakerphone), public address systems, karaoke systems, classroom amplification systems, etc.
In an aspect, a tangible computer-readable medium (a data carrier) storing a computer program comprising program code means (instructions) for causing a data processing system (a computer) to perform (carry out) at least some (such as a majority or all) of the (steps of the) method described above, in the ‘detailed description of embodiments’ and in the claims, when said computer program is executed on the data processing system is furthermore provided by the present application.
By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Other storage media include storage in DNA (e.g. in synthesized DNA strands). Combinations of the above should also be included within the scope of computer-readable media. In addition to being stored on a tangible medium, the computer program can also be transmitted via a transmission medium such as a wired or wireless link or a network, e.g. the Internet, and loaded into a data processing system for being executed at a location different from that of the tangible medium.
A computer program (product) comprising instructions which, when the program is executed by a computer, cause the computer to carry out (steps of) the method described above, in the ‘detailed description of embodiments’ and in the claims is furthermore provided by the present application.
In an aspect, a data processing system comprising a processor and program code means for causing the processor to perform at least some (such as a majority or all) of the steps of the method described above, in the ‘detailed description of embodiments’ and in the claims is furthermore provided by the present application.
In a further aspect, a hearing system comprising a hearing aid as described above, in the ‘detailed description of embodiments’, and in the claims, AND an auxiliary device is moreover provided.
The hearing system is adapted to establish a communication link between the hearing aid and the auxiliary device to provide that information (e.g. control and status signals, possibly audio signals) can be exchanged or forwarded from one to the other.
The auxiliary device may comprise a remote control, a smartphone, or other portable or wearable electronic device, such as a smartwatch or the like.
The auxiliary device may be constituted by or comprise a remote control for controlling functionality and operation of the hearing aid(s). The function of a remote control is implemented in a smartphone, the smartphone possibly running an APP allowing to control the functionality of the audio processing device via the smartphone (the hearing aid(s) comprising an appropriate wireless interface to the smartphone, e.g. based on Bluetooth or some other standardized or proprietary scheme).
The auxiliary device may be constituted by or comprise an audio gateway device adapted for receiving a multitude of audio signals (e.g. from an entertainment device, e.g. a TV or a music player, a telephone apparatus, e.g. a mobile telephone or a computer, e.g. a PC) and adapted for selecting and/or combining an appropriate one of the received audio signals (or combination of signals) for transmission to the hearing aid.
The auxiliary device may be constituted by or comprise another hearing aid. The hearing system may comprise two hearing aids adapted to implement a binaural hearing system, e.g. a binaural hearing aid system.
In a further aspect, a non-transitory application, termed an APP, is furthermore provided by the present disclosure. The APP comprises executable instructions configured to be executed on an auxiliary device to implement a user interface for a hearing aid or a hearing system described above in the ‘detailed description of embodiments’, and in the claims. The APP is configured to run on cellular phone, e.g. a smartphone, or on another portable device allowing communication with said hearing aid or said hearing system.
In the present context, a hearing aid, e.g. a hearing instrument, refers to a device, which is adapted to improve, augment and/or protect the hearing capability of a user by receiving acoustic signals from the user's surroundings, generating corresponding audio signals, possibly modifying the audio signals and providing the possibly modified audio signals as audible signals to at least one of the user's ears. Such audible signals may e.g. be provided in the form of acoustic signals radiated into the user's outer ears, acoustic signals transferred as mechanical vibrations to the user's inner ears through the bone structure of the user's head and/or through parts of the middle ear as well as electric signals transferred directly or indirectly to the cochlear nerve of the user.
The hearing aid may be configured to be worn in any known way, e.g. as a unit arranged behind the ear with a tube leading radiated acoustic signals into the ear canal or with an output transducer, e.g. a loudspeaker, arranged close to or in the ear canal, as a unit entirely or partly arranged in the pinna and/or in the ear canal, as a unit, e.g. a vibrator, attached to a fixture implanted into the skull bone, as an attachable, or entirely or partly implanted, unit, etc. The hearing aid may comprise a single unit or several units communicating (e.g. acoustically, electrically or optically) with each other. The loudspeaker may be arranged in a housing together with other components of the hearing aid or may be an external unit in itself (possibly in combination with a flexible guiding element, e.g. a dome-like element).
More generally, a hearing aid comprises an input transducer for receiving an acoustic signal from a user's surroundings and providing a corresponding input audio signal and/or a receiver for electronically (i.e. wired or wirelessly) receiving an input audio signal, a (typically configurable) signal processing circuit (e.g. a signal processor, e.g. comprising a configurable (programmable) processor, e.g. a digital signal processor) for processing the input audio signal and an output unit for providing an audible signal to the user in dependence on the processed audio signal. The signal processor may be adapted to process the input signal in the time domain or in a number of frequency bands. In some hearing aids, an amplifier and/or compressor may constitute the signal processing circuit. The signal processing circuit typically comprises one or more (integrated or separate) memory elements for executing programs and/or for storing parameters used (or potentially used) in the processing and/or for storing information relevant for the function of the hearing aid and/or for storing information (e.g. processed information, e.g. provided by the signal processing circuit), e.g. for use in connection with an interface to a user and/or an interface to a programming device. In some hearing aids, the output unit may comprise an output transducer, such as e.g. a loudspeaker for providing an air-borne acoustic signal or a vibrator for providing a structure-borne or liquid-borne acoustic signal. In some hearing aids, the output unit may comprise one or more output electrodes for providing electric signals (e.g. to a multi-electrode array) for electrically stimulating the cochlear nerve (cochlear implant type hearing aid).
In some hearing aids, the vibrator may be adapted to provide a structure-borne acoustic signal transcutaneously or percutaneously to the skull bone. In some hearing aids, the vibrator may be implanted in the middle ear and/or in the inner ear. In some hearing aids, the vibrator may be adapted to provide a structure-borne acoustic signal to a middle-ear bone and/or to the cochlea. In some hearing aids, the vibrator may be adapted to provide a liquid-borne acoustic signal to the cochlear liquid, e.g. through the oval window. In some hearing aids, the output electrodes may be implanted in the cochlea or on the inside of the skull bone and may be adapted to provide the electric signals to the hair cells of the cochlea, to one or more hearing nerves, to the auditory brainstem, to the auditory midbrain, to the auditory cortex and/or to other parts of the cerebral cortex.
A hearing aid may be adapted to a particular user's needs, e.g. a hearing impairment. A configurable signal processing circuit of the hearing aid may be adapted to apply a frequency and level dependent compressive amplification of an input signal. A customized frequency and level dependent gain (amplification or compression) may be determined in a fitting process by a fitting system based on a user's hearing data, e.g. an audiogram, using a fitting rationale (e.g. adapted to speech). The frequency and level dependent gain may e.g. be embodied in processing parameters, e.g. uploaded to the hearing aid via an interface to a programming device (fitting system) and used by a processing algorithm executed by the configurable signal processing circuit of the hearing aid.
A ‘hearing system’ refers to a system comprising one or two hearing aids, and a ‘binaural hearing system’ refers to a system comprising two hearing aids and being adapted to cooperatively provide audible signals to both of the user's ears. Hearing systems or binaural hearing systems may further comprise one or more ‘auxiliary devices’, which communicate with the hearing aid(s) and affect and/or benefit from the function of the hearing aid(s). Such auxiliary devices may include at least one of a remote control, a remote microphone, an audio gateway device, an entertainment device, e.g. a music player, a wireless communication device, e.g. a mobile phone (such as a smartphone) or a tablet or another device, e.g. comprising a graphical interface. Hearing aids, hearing systems or binaural hearing systems may e.g. be used for compensating for a hearing-impaired person's loss of hearing capability, augmenting or protecting a normal-hearing person's hearing capability and/or conveying electronic audio signals to a person. Hearing aids or hearing systems may e.g. form part of or interact with public-address systems, active ear protection systems, handsfree telephone systems, car audio systems, entertainment (e.g. TV, music playing or karaoke) systems, teleconferencing systems, classroom amplification systems, etc.
Embodiments of the disclosure may e.g. be useful in audiological applications such as CI rehabilitation, eye steering (combination with EarEEG), sound source location and balance monitoring.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing will be provided by the USPTO upon request and payment of the necessary fee.
The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
The electronic hardware may include micro-electronic-mechanical systems (MEMS), integrated circuits (e.g. application specific), microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, printed circuit boards (PCB) (e.g. flexible PCBs), and other suitable hardware configured to perform the various functionality described throughout this disclosure, e.g. sensors, e.g. for sensing and/or registering physical properties of the environment, the device, the user, etc. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
The present application relates to the field of hearing aids. Further, it relates to a method to estimate a hearing aid (HA) user head orientation using inertial sensors and eye gaze data
Current hearing aids do not have access to the orientation of the head. Head orientation can be extremely valuable in certain use cases: balance monitoring/prediction, rehabilitation for CI users, eye-steering of beamformers (EarEEG applications) and noise suppression.
Inertial sensors (IMU), 3-axis accelerometers and gyroscopes, either within a HA or an external device mounted to the user's head allows estimation of head linear acceleration, head attitude (pitch and roll) head rotational velocity and short time head orientation in the horizontal plane (yaw). Since gyroscopes contain bias and noise it is not possible to estimate yaw by means of integration and therefore other supporting data is needed. Both gyroscopes and accelerometers are small enough devices and not so much power consuming.
By means of two accelerometers (one for each ear) and some suitable signal processing a virtual gyroscope is provided.
Eye gaze data providing measures of user's gaze angle in the horizontal plane with respect to the head can be obtained by eye-tracking glasses or EarEEG measurements.
Utilization of eye gaze patterns, such as fixations (periods of no, or very small, eye movements), vestibulo-ocular reflex (VOR), VOR is the counter eye movements when the head is rotating which is used to produce a stable image on the retina of the part in the visual scene to which the eyes are pointing, enables bias estimation and drift correction in yaw estimation. See e.g. (Holmqvist, 2015) for more information.
Gyroscopic measurements are defined as
where ω is the angular velocity of the user head, egyr the measurement noise and bgyr the gyroscopic bias. If the bias is not known and unaccounted for, the estimate of orientation will quickly deteriorate. An example of this is shown in
When gaze direction is fix indicates that the head is stationary in the global coordinate if it is assumed that one does not follow a moving object with synced eye and head movements, which is highly unlikely. Such a measurement model would be expressed
where bgyr would be the gyroscope measurements when gaze direction is fixed and egyr is the corresponding measurement noise. The important part here is that the angular velocity is zero at fixation and therefor the gyroscopic bias is measured which can then be used to correct the gyro measurements when the head is rotating.
An example of the method with estimation results are shown in
Assume the distance between the origin of the gaze vector and the origin of the body coordinate system, translation of the body coordinate system to be negligible. A model based on the vestibulo-ocular reflex could be used. The eye movements would be modelled to move in the opposite direction of the head. The angular velocity estimates of roll and pitch would be the input signals to such a system and the model would be
The angle between the gaze direction vector and the body xy-plane is denoted β and the angle between the gaze vector and the body xz-plane is denoted γ. The input ueye=[ωy ωz]T contains the angular velocity of the glasses about its y- and z-axes. The process noises affecting the gaze direction are denoted wβ and wγ. In
The cocktail party problem introduced in 1953 describes the ability to focus auditory attention in a noisy environment epitomized by a cocktail party. An individual with normal hearing uses several cues to unmask talkers of interest, such cues often lacks for people with hearing loss. This thesis explores the possibility to use a pair of glasses equipped with an inertial measurement unit (IMU), monocular camera and eye tacker to estimate an auditory scene and estimate the attention of the person wearing the glasses. Three main areas of interest have been investigated: estimating head orientation of the user; track faces in the scene; determine talker of interest using gaze. Implemented on a hearing aid, this solution could be used to artificially unmask talkers in a noisy environment.
The head orientation of the user has been estimated with an extended Kalman filter (EKF) algorithm, with a constant velocity model and different sets of measurements: accelerometer; gyroscope; monocular visual odometry (MVO); gaze estimated bias (GEB). An intrinsic property of IMU sensors is a drift in yaw. A method using eye data and gyroscope measurements to estimate gyroscope bias has been investigated and is called GEB. The MVO methods investigated use either optical flow to track features in succeeding frames or a key frame approach to match features over multiple frames. Using estimated head orientation and face detection software, faces have been tracked since they can be assumed as regions of interest in a cocktail party environment. A constant position EKF with a nearest neighbor approach has been used for tracking. Further, eye data retrieved from the glasses has been analyzed to investigate the relation between gaze direction and current talker during conversations.
Experiments have been carried out where a person wearing eye tracking glasses has listened to or been taking part in a discussion with three people. The different experiments excited the system in different ways. Results show that the solution performed well in estimating orientation during low angular rates but deteriorated during higher accelerations. During these experiments, the drift in yaw was reduced from 100°/min to approximately +/−20°/min using GEB and fully mitigated during small movements using key frames. The tracker performs well in most cases but during larger dynamics or when detections are to scarce, multiple tracks might occur due to errors in the orientation estimate. The results from the experiments show that tracked faces combined with gaze direction from the eye tracker can help in estimating the attention of the wearer of the glasses.
The cocktail party (CtP) effect, introduced by Cherry in 1953 [8] describes the ability to focus one's auditory attention in a noisy environment, such as a multitalker cocktail party. This is a complex issue and a wide research area. A healthy person uses a plethora of different cues to segment an auditory scene of multiple talkers. Spatial and spectral differences between talkers of interest and masking sound highly influences the intelligibility [6]. Visual stimulus of the face of a speaker also significantly improves hearing capability. This is particularly important under noisy conditions [61] such as in a CtP environment. The art of ventriloquism is a classic example of when visual stimulus heavily influences auditory perception [1].
According to the World Health Organization (WHO) approximately 466 million people suffer from hearing loss with a prognosis of 900 million in the year 2050 [38]. A common complaint among people seeking help due to deficient hearing is difficulty understanding speech. The difficulty often occurs in noisy conditions such as in a cafe or restaurant with multiple talkers. Since the former mentioned auditory cues to process a CtP environment are often lacking for people with hearing loss [35], a traditional hearing aid does not help with this problem in a satisfactory way resulting in people not using the hearing aid due to the amplified background noise. In [27] Kochin explains that one of the prevalent reasons for people not to wear a hearing aid is due to background noise being annoying or distracting.
The objective is to map an auditory scene. This is to be achieved using eye tracking glasses with a front camera to detect and track faces in the environment and identify whether the user is attending any of the faces. If so, which face the user is attending should be determined.
A pair of eye tracking glasses will be used to gather measurements.
With the objective in mind, three research questions are put forth, each with a couple of follow-up questions.
For the orientation estimate, the goal is that the error in yaw should be minimized.
Furthermore, the tracking software should enable tracking of at least three faces simultaneously in an indoor environment on distances that can be expected in a general conversation.
A possible scenario is depicted in
The hardware available which will be used in this thesis are a pair of Tobii Pro 2 glasses, further referred to as the glasses. They are equipped with sensors for eye tracking, orientation estimation, a camera and a microphone. The wearer of the glasses are assumed to be stationary and is only allowed to rotate their head.
The translation of the glasses due to rotation is neglected as well as translational movement of faces in the scene. A direction of where to steer a beamformer will be estimated, but technical performance of the beamformer will not be considered.
On the topic monocular odometry, existing functions and algorithms available in OpenCV for Python will be used.
The CtP problem has been under extensive research since it was introduced.
Within the field of hearing aids a multitude of approaches aiming to solve the CtP problem exist, all with the intent of amplifying a target talker. One is to use directional microphones controlled by head direction [22], another to manually input the direction via a remote, either by pointing in the desired direction or button input [22].
A third approach tested in [22] and [13] is to use eye gaze direction to estimate a desired direction.
Eye gaze direction results are promising with faster response time, better recounts of conversations and easier to use compared to the alternative methods [22]. In [36], two ways to use gaze data for sound source selection are analyzed, a “hard steering” which means that the talker, which is looked upon at every specific moment, is amplified, while the amplification of other talkers is reduced and a “soft steering” which, with a Bayesian approach explained in [24] can amplify several sources depending on the latest couple of seconds of gaze data. Results from [36] point towards that hard steering is preferred. However, more experiments in more varied situations might be needed to get better knowledge of when each kind of steering would be to prefer.
Conversation dynamics are intrinsically fast [46] and a steered hearing aid must be able to, in real time, follow the dynamics and amplify a talker of interest. Consequently, a natural extension to gaze steering is to predict listener focus using more information than just the gaze data. For the CtP problem, talkers are assumed to be of interest and thus face detection and tracking can be used.
Object detection is an extensively research subject for which face detection is a subgroup. Some of the most popular detection algorithms are based on convolutional neural networks (CNN) such as R-CNN [17], Fast R-CNN [16] and Faster R-CNN [45], versions of you only look once (yolo) [42-44] and MobileFaceNets (MFN) [7]. MFN is developed as a real time face detector for mobile use [7] whilst the other mentioned methods are general object detectors that can be trained to detect faces.
To be able to steer efficiently, the direction to sound sources out of sight can be tracked. In a general setting, this requires that the pose of the glasses is estimated but due to the limitation of no translational movement, only rotation is of interest for this thesis. Still, prior work in full pose estimation can still be used. Since both visual and inertial measurements are available, they can be fused to improve pose estimation compared to using only visual or inertial measurements. Multiple solutions to fuse these kinds of measurements exist, in [11] six visual-inertial algorithms are evaluated in how well they can estimate the pose of a flying robot.
Three of the algorithms are based on Kalman filters and three of them are optimization based. Results in [11] show that tightly coupled solutions perform best with the cost of a higher computational burden. A loosely coupled Kalman filter approach was most efficient in terms of low computational power, but had the lowest accuracy among the evaluated algorithms. In [60], combination of visual and inertial measurements from sensors worn by a human to track their motion is performed. In the mentioned study, movements are classified as combined translation and rotation or only rotation.
One way to represent orientation is the unit quaternion. The quaternion representation was first introduced in [52]. In [30], orientation is described using the quaternion vector q=[q0; q1; q2; q3]T where q0 is scalar and q1; q2; q3 are complex with one imaginary axis each. One strength of this representation compared to the commonly used Euler representation is that it is not affected by gimbal lock which is a phenomenon were a degree of freedom is lost.
In [51], the time derivative of orientation expressed in unit quaternion given the angular velocity w=[wx; wy; wz]T is given by
The rotation matrix expressed in q is
Let s=q0 and v=[q1, q2, q3], then an orientation in unit quaternion [s1, v1] can be rotated by a rotation expressed in the unit quaternion [s1, v1] with
A downside using unit quaternion instead of Euler angles for orientation is that it is not as intuitive. Thus, within this thesis, the orientation will be visualized using Euler angles where roll, pitch and yaw, annotated with ϕ, θ and ψ, are positive rotation around the x-, y- and z-axis respectively.
An inertial measurement unit (IMU) is a set of sensors comprising at least an accelerometer and/or at least one gyroscope. The accelerometer is used to measure proper acceleration, the gyroscope measures angular velocity and the magnetometer measures the magnetic field. The IMU is often complemented with a magnetometer which allows estimation of a full 3D orientation. To estimate orientation of the IMU relative to an earth reference frame, two linearly independent vectors, mutual in earth and IMU coordinate systems have to be identified. Using the accelerometer, the gravity vector can be identified and using the magnetometer, the magnetic field of earth can be identified. Knowing these two vectors, the orientation of the IMU relative to earth can be derived [15]. The IMU can be connected to the hearing aid's noise reduction system. An advantage of using accelerometer input(s) to the noise reduction system is that if the user is moving (e g walking, running etc.), the settings of the noise reduction system can be adapted accordingly. For example, a single sound source can be selected based on the accelerometer or IMU input in the hearing aid system.
The IMU measurements contain errors which, for simplicity, can be split into two parts, one independent white noise part and one bias part [57]. For the accelerometer, the bias is assumed to be constant and would lead to an offset in the orientation estimate. The gyroscope bias is assumed to vary and since the angular velocity from the gyroscope is integrated to estimate orientation, the gyroscope bias leads to a drift in orientation. This drift can be compensated for with the absolute orientation estimates retrievable using accelerometer and magnetometer [34]. If using an IMU only, some drift in yaw will occur if no additional measurements can be used.
The Kalman filter (KF), introduced 1960 in [26], is used to optimally estimate states in a linear model by minimizing the estimation error. Real processes are seldom linear, therefore some modifications to the original KF is needed. A nonlinear state-space model for a system without input signals and additive noise can be described by
where f is the dynamic model and h relates the states to the measurements. N is a linear matrix relating the process noises and states. Time is indicated with subscript k and the states, xk, are quantities to be estimated. Measurements are denoted yk, wk are process noises and ek are measurement noises. The noises are assumed to be Gaussian, i.e, wk_N (0; Q) and ek_N (0; R) for a KF. In 1962, Smith et al [53] introduced the extended Kalman filter (EKF) for nonlinear models. An EKF implementation requires a linearization of the nonlinear model for each instance of time.
The EKF algorithm consists of a prediction and a measurement update. The prediction step is
where f{circumflex over ( )}·g indicates that the value is estimated. Pk+1jk and Pkjk are covariances of the prediction and estimate, respectively. Subscript k1jk0 indicates that the value in time k1 is evaluated based on values in time k0.
The measurement update step is performed by
where R is the measurement covariance matrix, yk is the vector containing measured signals and
Monocular visual odometry (MVO) is a collective term for methods to estimate translation and rotation using measurements from a monocular camera. Intrinsic parameters of the camera, achieved through calibration, and image correspondences are used to estimate the translation vector t=[tx; ty tz]T and rotational matrix R between frames. The translation can only be extracted up to an unknown scale through monocular odometry [23]. Calibrated cameras are primarily used to reduce the complexity of the problem. Seven point correspondences are needed to obtain a relative pose from two uncalibrated images, leading to up to three solutions. Stated by Kruppa in [29] (translated from German to English in [14]), the use of camera intrinsic parameters introduces two constraints reducing the number of points needed to five. Kruppa [29] also proved that up to eleven different solutions can be obtained from the five point problem which was later reduced to ten [37]. The primary steps in estimating the orientation between two frames are shown below and theory for each step will be presented later in the section.
1. Detect features in first frame.
2. Find matching features in the subsequent frame.
3. Estimate the essential matrix using the matched features.
4. Decompose the essential matrix.
The steps are similar to those mentioned in [50] but simplified since the only rotation is of interest.
In the scope of this thesis, a feature is defined as a local pattern distinguishable from its immediate neighbors. Image properties often used to extract features are texture, color and intensity [56]. There exists a multitude of different feature detectors. Some of the more popular detection algorithms, included in Open source computer vision (OpenCV), are
Since features are to be compared between frames, the ability to repeatably detect the same features are one of the most important properties of a feature detector. One parameter influencing the repeatability is the feature invariance [56]. Within mathematics, an invariant is a property unchanged when a specific transformation or operation is performed, the opposite is called covariant. For features, this is important to know if the feature will be detectable after a change in pose. Typical transformations that occur between frames, in a static environment, are rotational and translational leading to scale and perspective changes in the image. A rotation of a 2D surface does not make it any smaller or bigger, thus, feature detectors can be assumed rotational invariant. To the contrary, if translational transformation is applied a 2D surface is scaled, thus all features are scale covariant. A scale invariant detector provides scale invariance by normalizing the features with a description. A scale invariant detector is more generic compared to a detector which is only rotational invariant. Therefore, scale invariant detectors should be used where large movements might occur but rotational invariant detectors might be enough for applications with smaller movements [56]. From the mentioned detectors, sift, surf and orb have a descriptor that normalizes the features, thus, making them scale invariant [48, 56]. The detectors, Harris, Shi-Tomasi and fast does not have any descriptor [21, 25, 47], thus making them invariant to only rotation.
After features have been extracted in the first frame the corresponding features should be found in subsequent frames. This can be done either by tracking or matching features. Feature matching uses the descriptions of features in two frames to extract matches between the features, thus feature matching needs descriptions of the features in each frame, implying that non-descriptor-based detectors cannot be used directly without an external descriptor. The computation of a feature descriptor can be computationally expensive [9].
Another method for finding the primary features in the subsequent frame is to track the features. Unlike when using a feature matching approach, as described in Section 2.4.1, for which features needs to be detected and described at each frame. Tracking of features only require detection when the number of tracked features are below a certain threshold. This occurs when too many features get out of frame or are obscured. One method of visual tracking of features is to use optical flow which is defined as the pattern of apparent motion. The underlying assumption for use of optical flow is that the pixel intensities do not change between consecutive frames [33].
The problem formulation for optical flow is as follows. I (x; y; t) is an arbitrary pixel in an image at time t. I (x; y; t) moves a distance of (dx; dy) in the next frame in time t+dt [33]. Under the assumption of constant intensity, the following holds
A Taylor series expansion of the right side of (2.10) results in
Insertion of (2.11) in (2.10)
which can be written as
Redefining (2.13) as
the (x, y) components of the optical flow defined as
One equation and two unknowns, (u; v) are obtained which gives an undetermined system. There exists a multitude of methods to solve this problem, one provided by Bruce D. Lucas and Takeo Kanade introduced in [33] assumes an equal flow of the pixels within an m_m window, where each pixel is numbered. The assumption of an equal flow limits the method to be used where movements between frames are small. The resulting system of equations is
for pixel In, n∈[1, 2, . . . , N], N=m×m, within the window. The result of the assumption of neighbouring pixels is an overdetermined system that can be solved using the least squares approach
for the searched window. Thus, (2.15) is a solution for solving the optical flow problem given the image derivatives in x, y and t [5]. Using the Lucas-Kanade (LK) method for optical flow, a feature can be tracked in subsequent frames given two images and feature points of the first frame.
A natural interpretation of a feature could be a point P=[X; Y; Z]T in 3D space projected on an image as p=[u; v] and the essential matrix relates 3D points projected on two images using epipolar geometry [23]. The essential matrix is expressed
where R is the orientation of the camera and [t]x is the skew-symmetric matrix. The skew-symmetric matrix is defined as
and is a result of a property of the cross product between two vectors. An example with vectors a=[ax ay az]T and b=[bx by bz]T is
Below is a derivation and explanation of the essential matrix.
Use extended vectors
where K is the pinhole camera intrinsic matrix defined using the focal lengths (fx, fy) and the optic center (cx, cy) as
Furthermore, t=[tx, ty, tz]T is the translation vector up to an unknown scale and λ is the scale factor. Additionally, M=K[R|t] is called the camera projection matrix [23] where [R|t] is the column stacked 3×4 matrix of R and t as
With known camera intrinsic matrix, the projection (xx) can be expressed in normalized camera coordinates by multiplication of from the left, resulting in
with a normalized projection matrix ˜M=[Rjt]. Given a point correspondence in two images, the epipolar geometry can be expressed, visualized in
The plane_spanned by the two camera centers (O1;O2) and point P is called the epipolar plane. The line defined by (O1;O2) is called the baseline and the points (e1; e2) where the baseline and the image planes intersect are called the epipoles [23].
Let {tilde over (M)}1=[I|0] and {tilde over (M)}2=[R|t] to be normalized projection matrices for the subsequent frames with
{tilde over (p)}2 is expressed in the first camera coordinate system, i.e., the global coordinate system and can be written
{tilde over (p)}2g and
Which can be written as
which is called the epipolar constraint equation where [t]xR is the sought essential matrix. To estimate the essential matrix, the five point problem mentioned in Section 2.4 needs to be solved. In [37], Nistér introduced an efficient way of solving the five point problem using a RANdom SAmple Consensus (RANSAC) scheme [12].
In the ransac scheme, multiple five point samples of tracked points are randomly extracted and each sample yields a set of hypothetical orientation estimates. Each hypothesis is then statistically tested and scored over all matched points and the best scoring hypothesis is further improved by iterative refinement.
From an essential matrix four different compositions of rotational matrices can be extracted [23]. Assuming ˜M1=[Ij0] is the first camera matrix and ˜M2 the second camera matrix the translation and rotation to the second frame be expressed as one of the following
Where {tilde over (M)}2=[R|t] is the true rotation and translation. {tilde over (M)}2=[R|−t] has reversed translation vector compared to the true, {tilde over (M)}2=[Rb|t] and {tilde over (M)}2=[Rb|−t] are called the “twisted pair” solutions for {tilde over (M)}2=[R|t], and {tilde over (M)}2=[R|−t], respectively. The twisted pair solutions have 180° rotation about the line joining the two camera centers [23].
In this section, theory behind eye movements and gaze tracking is explained. Eye movement theory is presented to get an understanding of how eyes move. A short background to gaze tracking is included to give an overview of how it can be performed.
Movements of the eye can generally be divided into four different types. Saccades, smooth pursuit movement, vergence movement and Vestibulo-ocular movement [10]. Saccades being rapid, ballistic movement of the gaze between points. Both voluntary and non-voluntary. Both the velocity and time of a saccade is highly dependent on the distance covered, a 2° saccade, typical for reading, lasts for about 30 ms whereas a 5° saccade, typical for scene perception, last about 30-40 ms [39]. Smooth pursuit movements are voluntary movements to fixate on and follow objects. Vergence movement is the fixation of both eyes based on distance, i.e, the disjunctive movement to fixate objects closer or further away from the observer. The vestibulo-ocular movements are a reflex to stabilize the eyes due to head movements [31]. The effect results in eye movement in the opposite direction of head movement. Fixation to a point is the most common state for eyes and thus, knowledge of when one fixates is important for accurate classification of eye movements.
To determine which kind of eye movement an individual is performing there are several solutions available. A commonly used method is velocity threshold identification (I-VT) [49]. In [28], several methods to determine eye movement based on gaze data are evaluated and it is concluded that I-VT is performing well in terms of saccade identification. The threshold used significantly affects the performance of the classification and can be varied depending on hardware and situation. A threshold somewhere between 30°/s and 70°/s performs well in terms of identifying saccades in [28].
To measure eye movements in wearable eye trackers, video-oculography (VOG) is often used. In most VOG applications, infrared light is used to provide contrast between the pupil and the rest of the eye and enable tracking in most light conditions [18]. There are two main methods for eye tracking using infrared light, dark pupil and bright pupil tracking. For dark pupil tracking, the camera and light source is offset in angle leading to that none of the light passing through the pupil is reflected back to the camera. With bright pupil tracking, the infrared light source is placed coaxial with the camera causing much of the light passing through the pupil to be reflected into the camera [20]. Both methods aim to measure the position of the pupil which is further used to estimate gaze direction.
Tobii Pro AB.
When the position of the pupil is known, parameters which differ between individuals are needed to estimate gaze direction. These are often obtained through a calibration procedure where the user focuses their gaze to at least one point [58].
The system to be implemented can briefly be described by
Representing the system, several coordinate systems are used to represent different entities of the system.
The origin of the c-frame and IMU-frame coincide with the b-frame, thus, tcb=[0; 0; 0]T in
and the gaze and IMU data is rotated to the b-frame using the rotational matrix
The relationship between the g-frame and the b-frame is defined by the rotational matrix R and the translation vector t. Since the offset between b-frame and g-frame is neglected, the origin of the two coordinate systems is assumed to coincide, thus t=[0; 0; 0]T.
A solution based on MVO processes the visual information from the camera to retrieve orientation measurements and pixel coordinates for faces.
The pipeline for obtaining the rotational matrix uses the OpenCV API and follows the general steps described in Section 2.4. The “true” and the twisted pair rotational matrices are retrieved as described in Section 2.4.4 but the hypothesis testing performed is described in Section 3.3.1. Two different methods were considered for estimating rotation using the camera.
1. Use LK optical flow for tracking features between consecutive frames.
2. Iteratively match descriptors in each frame with a key frame until the number of matches to the key frame is smaller than a certain threshold, whereas the most recent frame is used as key frame. In the new key frame, new features have to be found and described.
The primary reason for using LK optical flow is the computational cost. The optical flow approach does not need a descriptor-based detector, moreover, small translation movement can be assumed since the features are tracked for subsequent frames, reducing the need for scale invariant features. Due to the computational cost of describing features only the three rotational invariant detectors mentioned in Section 2.4.1 are considered with the optical flow method. According to [2], the fast detector is sensitive to noise and is therefore excluded. For the two remaining detectors, Harris and Shi-Tomasi, [3] describes the Shi-Tomasi detector as a modified and improved Harris detector, therefore, the Shi-Tomasi detector is used. The algorithm used for pose estimation using optical flow is described as pseudo code in Algorithm 1.
The second method implemented requires a descriptor-based detector. This reduces the number of choices to three, sift, surf and orb. From these, both sift and surf are patented and not included in the specific OpenCV package used, therefore, they are not considered any further. Algorithm 2 describes the key frame based method in pseudo code.
Compared to the optical flow approach, this will be much more computationally expensive. Primarily due to the fact that features need to be detected at each frame and those features require a description. One advantage of using a description based approach is that it is more robust in terms of that larger movements can be handled and thus a lower sampling rate than when using optical flow can be used. Thus, a combination of them might be preferred. Combining both is investigated in [9], but due to time constraints it is not investigated in this thesis.
Feature detection using Shi-Tomasi corner detection and tracking features using lk optical flow is visualized in
Feature detection and description using orb and matching the descriptors for each frame with a key frame which is visualized with one frame as an example in
This thesis is not a survey of different face detectors, thus, not much focus has been in finding the optimal face detector for the task but several detectors have been considered, mainly those described in Section 1.5. The main parameters considered when choosing face detector was speed and accuracy. In [59] several face detectors were tested for speed and accuracy. Two of the detectors in the test were the MFN and a version of the YOLO detector. MFN was faster by a factor of 10 compared to YOLO but had lower accuracy. Even though it had lower accuracy than the YOLO detector, MFN was picked due to the significant speed difference. The output from the MFN detector is a bounding box. In this thesis, the center pixel coordinates (u; v) of the bounding box is set as a measurement of the position of face. An example frame where three faces are detected is shown in
To filter measurements, two EKF's are implemented to estimate orientation and gaze direction. Their measurements are signals from the eye tracker and IMU and estimated rotation from the computer vision module. The outputs are estimates of orientation and angular velocity of the glasses and the direction and angular velocity of the gaze. Everywhere quaternions are modified, e.g., in the measurement update, they are normalized to represent proper orientation.
To estimate the orientation of the glasses, a nearly constant angular velocity model is used. The use of a constant angular velocity model is also used in [54] where wearable sensors are used to estimate pose. The model is extended with a constant gyroscope bias model,
In (3.1), the state vector consists of the unit quaternion qk=[q0 q1 q2 q3]T representing the orientation of the b-frame relative to the g-frame, the angular velocity ωk=[ωx ωy ωz]T, in radians per second of the b-frame, and the gyroscope bias bkgyr=[bgyr
The process noises wkω=[wω
The IMU placement is visualized in
where R(qk) is the rotational matrix from the g-frame to the b-frame, parametrized using the unit quaternion. Furthermore, ak defines the acceleration of the glasses, g is the gravitation and ekacc the measurement noise, distributed ekacc˜N(0, Racc. Since the use of the IMU is to estimate the orientation only, ∥a∥<<g will be assumed, the measurement model for the accelerometer is reduced to
Furthermore, the influence of large accelerations is mitigated using accelerometer measurements satisfying |g−∥yacc∥|ϵa, where ϵa is a threshold. The gyroscope measurements are defined as
where ωk is the angular velocity of the glasses, bkgyr the gyroscope bias and ekgyr the measurement noise which is distributed ekgyr˜N(0, Rgyr).
The use of gaze data to estimate gyroscope bias (GEB) is investigated. Measurements from the gyroscope when the gaze vector is assumed stationary in the b-frame, i.e., when the gaze direction is fix relative to the head, are used as bias measurements. A gaze direction, fix in the b-frame indicates that the head is stationary in the g-frame, if it is assumed that one does not follow a moving object with synchronised eye and head movements. Such a scenario is assumed rare enough to be disregarded. A measurement model for gyroscope bias would be expressed
where ykbias consist of the gyroscope measurements, bkgyr would be the gyroscope bias and ekGEB is the corresponding measurement noise, distributed ekGEB˜N(0, RGEB). Measurement updates are performed after each gaze sample that indicates a fixed head.
To determine that the gaze is fix in relation to the b-frame, the angular velocity of the gaze vector between every two eye samples is calculated. If this velocity is below a threshold, EGEB, the head is assumed to be stationary and the average of the gyroscope measurements between the samples is used as a bias measurement. This method is similar to I-VT presented in Section 2.5.1 and a threshold is to be chosen. It is of importance that small eye movements are identified and thus, this threshold will have to be chosen low in comparison to when saccades are to be identified as the case is in Section 2.5.1.
Section 3.2 describes the method used for retrieving the two hypotheses to estimate rotation between frames. Let δqa and δqb be the hypotheses expressed in unit quaternion and {circumflex over (q)}−1 be the estimated orientation at the time of the first frame. Each measurement is generated by rotating {circumflex over (q)}−1 with (δqa, δqb) using 2.5, resulting in two hypotheses of the current rotation as measurements, denoted qa and qb respectively. Hypothesis testing is performed within the EKF to decide which, if any, of the measurements should be used.
The hypothesis test is conducted by performing the prediction step in 2.7 and comparing {circumflex over (q)}k|k-1 with both hypothesis
If ∥yMVO−{circumflex over (q)}k|k-1∥<ϵMVO, where ϵMVO is a threshold, a measurement update is performed. Otherwise only the prediction step is performed. The resulting measurement model is
where ekMVO is camera measurement noise which is distributed ekMVO˜N(0, RMVO).
A nearly constant angular velocity model is used to estimate gaze angle and velocity of the gaze vector in the b-frame,
The angle between the gaze direction vector and the b-frame xy-plane is denoted a(alfa) and the angle between the gaze vector and the b-frame xz-plane is denoted b(beta).
The velocity of a is denoted g(gamma) and the velocity of b is denoted d(delta). Physical limits restrict gaze direction, thus a and b are limited to values between +/−90°. The process noises are distributed,
ωkα˜N(0,Qα) and ωkβ˜N(0,Qβ).
Since gaze direction is highly unpredictable and the velocity can vary fast. A constant velocity model might not be the optimal dynamical model to predict gaze. With this in mind, the process noise of the model is set high in comparison to the measurement noise.
As measurements in the gaze model, eye angles are used. Direction _ and _ are calculated from the gaze direction vector (gv), depicted as gaze position 3D in
The measurement model is
Measurements will be restricted to less than ±90° by physical limits. The measurement noise is distributed ekeye˜N (0, Reye).
To be able to analyse and possibly predict gaze patterns of a user, the type of eye movement they perform is of advantage to know. To classify whether the user is in a fixation or in a saccade an i-vt filter described in Section 2.5.1 is used and a threshold of gaze velocity in the g-frame is to be set. If the threshold is exceeded, the movement is classified as a saccade, otherwise it is classified as a fixation. The velocity of the eyes in the g-frame is divided into one horizontal and one vertical angular velocity. The vertical velocity is calculated as the difference between g and wy and the horizontal velocity is calculated as the difference between g and wz. It is assumed that wx does not affect neither g nor d significantly.
The tracking module estimates the position of faces in the g-frame using an EKF given the estimated head orientation from Section 3.3.1 and the position of detected faces obtained as described in Section 3.2.2.
The output from Section 3.2.2 is an image projection of a 3D point. Since no depth data is available and the origin of the g-frame and c-frame are assumed to coincide, a face position is parameterised as a unit vector, f=[fx; fy; fz], in the g-frame. Each face is assumed to be moving at speeds low enough for a constant position model described by
with the process noise wf distributed wf˜N(0, Qf).
A calibrated camera with camera intrinsic matrix K will be used. Using a calibrated camera, normalized camera coordinates mc, defined as
can be used. Where u and v are pixel coordinates of a detected face. From this, a three dimensional unit vector can be obtained as
and the corresponding measurement is
This results in a measurement model for a face as
where R(qk) is the rotational matrix from the g-frame to the b-frame, Rcb the rotational matrix from the c-frame to the b-frame and ef the camera measurement noise. The measurement noise is distribute ekf˜N(0, Rf).
All object detection software will have some degree of false detection. To suppress the impact of these, a couple of data association methods were implemented. The tracking solution was derived in a pragmatic way until it was considered good enough for the situations in which it was to be used. For each detected face in a frame, a measurement yf is generated. Linking yf to a face is done using the nearest neighbor method where the angle
is calculated for all currently tracked faces. Nearest neighbor is one of the simplest ways of associating measurements with tracks [19] and is assumed to be enough for the application. af is used as a distance measurement and if af>Ef for all tracked faces a new track is initiated. If not, the measurement step of nearest neighbor, i.e the track with smallest af is performed. Furthermore, to reduce the number of false detections tracked, a counter for each new track is introduced. For each frame a track does not get any associated measurement, the counter for that track ticks down. If the counter decreases below zero, the track is deleted and if the counter increases to a threshold the track is confirmed. Tracks are also deleted if no measurements can be associated to the track during a set time.
The glasses used were a pair of Tobii Pro 2 glasses, seen in
They are equipped with one front facing monocular camera, eye tracking sensors to record the direction of the eye gaze, an inertial measurement unit (IMU) and a microphone. The scene camera is of type OV2722, a 1080p HD camera from OmniVision. The IMU consists of a gyroscope and accelerometer which are of type L3GD20 and LIS3DH from STMicroelectronics. The eye tracker uses the dark pupil method described in [40]. The glasses provide data using the data structure described in [55].
For ground truth, a Qualisys motion capture (mocap) system was used. The mocap system determines position of reflective markers using cameras. If a rigid body is defined using several markers the position and orientation of objects can be calculated if at least three markers can be located. The Qualisys setup in Visionen laboratory at Linkoping University was used. This setup contains twelve cameras covering a room with dimensions 10 m_10 m_8 m. For synchronisation between the glasses and Qualisys, a hardware synchronisation message was sent to the glasses via a sync cable when the Qualisys recording was started.
For sound recording, hand held microphones were used where each talker had one microphone each. Sound was also recorded with the video from the glasses. For synchronization between the glasses and the microphones, cross-correlation between the recorded audio from the video and the microphones was performed to the extent that was possible. If the cross-correlation sync failed, manual synchronization was used.
As ground truth of the position and orientation of the glasses, six markers placed as in
In the mocap system, the coordinate system of the glasses was defined from the position from where the user was sitting, hence, constant errors might have occurred compared to the estimates if the g-frame and b-frame were not completely aligned when the body was defined in Qualisys.
The position and orientation of the faces were tracked using three different caps with three markers placed on each. For experiments where the subjects were sitting the caps were associated with a certain chair as seen in
To keep in mind, the tracking performance from Qualisys varied a lot and sometimes the rigid bodies had to be redefined, therefore the ground truth should be used conservatively.
In this section, procedures of the performed experiments are explained. For the experiments described in Section 4.4.1 to 4.4.3, four test subjects were used where the one with the glasses will be referred to as the user. The experiments were performed as listed below.
1. Calibrate glasses and start recording on glasses.
2. Start Qualisys recording.
3. Start sound recording.
4. Get into position and start experiment with a clap.
5. Perform experiment.
6. End experiment with a clap.
7. End sound and Qualisys recording.
8. Stop recording on glasses.
The first two experiments consisted of a passive user following a two minute conversation between three subjects as seen in
The second experiment (PSV2) was almost identical to PSV1 with the exception that the user was allowed to rotate their head. This is a more natural way of attending a conversation and subjects were not in FOV at all time challenging the tracking solution. Both PSV1 and PSV2 were performed twice for each subject.
The third experiment (Q&) was comprised of questions and answers for which the subjects asked the user questions from a quiz game. Each subject had five question cards and the user did not know who would ask the next question. The subjects were seated as in
During a normal conversion experiment (NormSp), the subjects and the user were standing and held a normal conversation for a non-specified time, once for each subject. This tests the whole system on the CtP problem in the most realistic environment among the tests performed. The user could attend a conversation with one subject while the other two might be having another conversation. From these experiments, data about how often a user is looking at different subjects could be extracted.
An experiment to excite VOR eye movements (ExpVOR) was performed. The user focused on a point for the whole duration of the experiment while rotating his head back and forth horizontally. The experiment was performed with two distances to the fixation point, one short of about 0:2 m and one longer of about 1:5 m. This experiment was performed to clarify how much the difference between eye and head velocity varied during VOR eye movements and how it is affected by the distance to the point of fixation.
An experiment where the user followed a dot stimulus with their gaze (DotSac) was performed. The stimuli involved a red dot which induced horizontal saccades by changing position instantaneously. The dot stimuli were run on a laptop screen and could be set to either only excite long saccades, more than 3_, or excite both long and short saccades. This experiment was used to investigate eye movement classification. Three experiments were performed with the dot stimuli. In DotSac1 the stimuli which only induced long saccades was used and the user followed the dot with both gaze and head movements. The goal with this experiment was to get information of how well saccades could be identified and separated from VOR eye movements. In DotSac2 the long saccade stimuli was used, but the user rotated his head back and forth for the full duration of the experiment. In DotSac3 the short saccade stimuli was used and the user kept his head still. This experiment was performed to get information of the approximate minimum angle of saccades one could expect to be able to identify.
The results of the head orientation estimation will be presented in this section. First orientation estimates and ground truth for the different experiments are presented, then a dynamic response and the errors it leads to are discussed. Last, how well yaw drift was mitigated estimating gyroscope bias is shown. Through Section 5.1.1 to 5.1.3, the resulting plots for each experiment are from the same test if nothing else is mentioned. For a simpler analysis, ground truth and yaw estimates are set to zero at the start of each test. The experiments that will be presented are from psv1, psv2 and NormSp since they excited the system in different ways. The base EKF uses IMU measurements only and no estimated bias, extensions to the base EKF where different measurements are included are presented with notation in Table 5.1 and described further down.
From psv1, the performance of the different methods of mitigating drift is in focus. For reference, the estimated roll, pitch and yaw from the EKF using only the IMU and no bias estimates during an experiment is shown in
The plot shows a test of approximately two minutes where roll and pitch followed ground truth well but yaw drifted more than 200°. A straightforward way to reduce drift would be with a constant bias (cEKF). From a simple test with stationary glasses, the gyroscope constant biases were estimated to bgyr=[4:066; 1:430; 0.9093]T°/s. With this constant bias, estimates and ground truth are shown in
With a constant bias there were no significant differences in roll and pitch compared to without bias compensation but the drift in yaw was reduced to approximately 40_ in total for the two minute test. For succeeding estimations the constant bias bgyr=[4:066; 1:430; 0.9093]T°/s will be used when initiating the EKF if nothing else is mentioned.
The use of a constant bias was an improvement compared to not estimating bias, but showed that the gyroscope bias varied over time since drift was still significant. The constant bias was estimated a different day than the experiments were performed which might be the reason for the poor performance. This strengthens the assumption that the variation of gyroscope bias significantly worsens the orientation estimate, thus some way for continuous estimation of bias was desired.
A method investigated for estimating the gyroscope biases was to use eye gaze data to estimate when the user was stationary (gEKF) as described in Section 3.3.1. The resulting roll, pitch and yaw estimates from the EKF are shown in
The drift in yaw seemed to decrease throughout the whole test indicating that the bias estimation had not fully converged at the end of the test. To visualize the bias estimation, data from psv1 was used to estimate biases initialized at bgyr=[0; 0; 0]T°/s. Ten seconds before the start clap the GEB was started and the result is shown in
With the front camera, MVO measurements were obtained and used to estimate orientation. Two different methods for MVO measurements were investigated, of (OFEKF) and key frame (KYEKF). Using OF for MVO measurements the resulting roll, pitch and yaw estimations are shown in
No larger difference in roll and pitch can be seen but the drift in yaw clearly varies and the total estimation error was reduced to less than 10_. The varying drift could be due to movement in the scene or poor camera calibration. The second method, to match orb features in a key frame the resulting estimates from the kyEKF are shown in
With a kyEKF, the estimation error in yaw seemed to be fully mitigated and the dynamics were similar to the other estimation methods.
During the psv2 tests, the effect of dynamics could be seen more distinctly compared to psv1. For reference, cEKF with bgyr=[4:066; 1:430; 0:9093]T°/s was used and is shown in
The result of GEKF is seen in
During the NormSp experiments, the subjects and the user were standing which might have induced other dynamics compared to psv1 and psv2. All estimation methods, cEKF, gEKF, of EKF and kyEKF performed similar in roll and pitch with slightly poor following of the ground truth which is seen in
As the results presented in sections 5.1.1, 5.1.2 and 5.1.3 indicate, including MVO measurements improved the orientation estimate, reducing the error in yaw. Two methods based on MVO were investigated, of EKF and kyEKF. Using kyEKF resulted in the best estimate for all tests but varied depending on experimental type. For stationary tests and kyEKF no noticeable error in estimation could be seen while for experiments where the user was free to rotate their head, the estimate in yaw deteriorated at large dynamical events. Such events resulted in loss of key frames and any new key frame might have been poor due to blurry images. Without a key frame, no orientation estimate could be obtained. However, if a good key frame was available, no significant error in yaw should occur since the camera measurements are absolute compared to the state the key frame was obtained in. Performance using of EKF seemed to be more sensitive to external disturbances which was visible during the psv1,
The benefit of using of is the computational power needed. The preferred method would probably be a combination of both key frames and of as mentioned in Section 3.2.1. A method could be to downsample the key frame loop and using of for the other frames. Furthermore, the current solution using kyEKF only keeps one key frame which could be improved using multiple key frames.
∥a∥<<g.
(a is illustrated by the orange dotted line in the lowermost plot of the figure), was not true the estimate in roll and yaw deteriorated. Roll eventually returned to a more accurate estimate but the loss in yaw was permanent since no absolute measurement of yaw was available.
∥a∥<<g.
neglecting accelerations, result in that centripetal and acceleration forces are identified as gravity. This affects roll and pitch estimates since angular rate in z in the b-frame is projected to x and y in the g-frame, thereby leading to errors. To some extent, these disturbances are mitigated by setting a threshold for the normalized accelerometer vector but a too small threshold would reduce the number of samples too much impairing the estimate and would leave it more sensitive to accelerometer calibration errors. Another aspect which would contradict the use of a threshold was that the accelerometer measurements seemed to depend on the battery voltage, but this has to be further investigated. Another method of mitigating the impact from large dynamical events can be to include acceleration in the model. Additionally, if the rotational center and angular acceleration was to be estimated, the influence of centripetal forces and acceleration due to rotation could be mitigated.
In this section, results on how the yaw drift was affected by GEB, described in 3.3.1, are presented. Results from psv1, psv2 and NormSp are depicted to visualize how the bias estimate performed during varying conditions. psv1, where the user was stationary and kept head still, should be less challenging than psv2 and NormSp where the user was allowed to rotate their head. The drift in all plots in this section were calculated as
where the time window (t1□t0) was set to four seconds. This way of calculating drift will lead to that model errors as described in Section 5.1.5 will show as peaks in drift. Thus, when analyzing the following plots one should keep in mind that peaks in drift without corresponding change in bias estimate is probably not due to a poor bias estimate.
The same appearance as in earlier figures can be seen in
−100
0
100
Drift [_/min]
test 1
test 2
test 3
test 4
500 750 1000 1250 1500 1750 2000 2250
°ybias [−]
0
1
2
Gyroscope Bias [_/s]
test 1
test 2
test 3
test 4
From
The performance of the bias estimate varied depending on experiment type and the initialisation of the filter. For most experiments, the drift was reduced below 20°/min with GEB, which point toward that this might be a promising way of estimating bias. The drift generally decreased throughout the tests indicating that the bias estimation did not fully converge at the end of the tests. It would be better to perform longer tests to see if the drift tends to stagnate between +/−20°/min or if it converges towards lower drift. Results show that the gaze cannot be used to estimate that the head is still with complete accuracy, since the estimated bias varied significantly more during tests with moving head. However, the bias seemed to possibly converge to some value. With a lower process noise on the bias in the dynamic model, the variance could possibly be reduced and a more steady drift might be achieved.
In this section the results regarding tracking will be presented. All plots will be of confirmed tracks as described in Section 3.3.2 and will be from experiments accentuating different behaviors and the influence of different orientation estimation methods will be presented. Any ground truth of the tracking is not included in the plots, as can be seen later in Section 5.4 the face detections using camera can be assumed to be accurate since gaze direction and a track frequently coincide. The detection and tracking of subjects are initiated before the tests began. For a psv1 experiment, using a cEKF the tracker was able to keep tracks during the whole duration of the experiments, even though there was significant drift in yaw, which can be seen in
For a NormSp test,
The use of a constant position model seemed to be enough to keep track of the faces when the movement of both user and subjects were small enough. The angular dispersion between faces was large enough to associate detections with the correct face for most parts. If larger movements of user and talkers would be allowed, or if subjects would be closer to each other, there would probably be need for a more stringent tracking solution. For those situations a constant position model might not be enough. Also, other solutions such as face recognition software would probably simplify the data association step. Regarding false detections, the use of a counter seemed to serve it purpose. The presented results show no tracks initiated where there was no face.
This section is to present the results regarding gaze data. Results from ExpVOR and DotSac are displayed. These results are to show how eye movements and head movements correlated as well as how eye and head velocity varied during saccades and fixations.
From
In terms of saccade/fixation classification,
In
In
In
In
The results from the ExpVOR experiments indicate that eye gaze data could be used to support the yaw estimate using VOR, since gaze direction and yaw correlated well during fixation. They also show a dependency of depth of gaze which would be expected. Due to translational movement of the eyes while the head rotates an overestimation of the yaw angle occurred. This overestimation depended on the depth of gaze and was smaller at longer distances. If the rotational center of the head and the depth of gaze could be estimated, a better measurement model could be created. Otherwise, only measurements where the depth of gaze is large enough would be preferred. If gaze is to be used to support the yaw estimate, an accurate fixation classifier is crucial since head movements and gaze only correlate during fixation as mentioned in Section 2.5.1.
The results from DotSac1 showed that, saccades to follow the dot stimuli was performed significantly faster than the rotation of the head. This implies that using eye data compared to only using head direction for steering, could notably improve performance in terms of speed which aligns with previous results from among others [22] and [13]. The detection of saccadic movements was not as clear since it was highly dependent on the amplitude of the saccade. As mentioned in Section 2.5.1, a threshold between 30°/s and 70°/s has performed well in other studies. This would coincide well with results in
With an accurate head orientation estimation, one could have some measure of how much the gaze point moves over a longer time interval to possibly determine if the focus of the wearer is directed to some delimited area instead of accurately trying to classify every saccade and fixation.
Results on the system which aim to illustrate how it can be used to estimate the attention of the user are presented below. Plots show tracked faces along with gaze and head direction. The orientation was estimated with gKyEKF for all plots in this section. To have some kind of reference on where the user might have directed their attention, it will be indicated when the sound level from each subjects microphone exceeded a threshold.
In NORMSP, where the subjects were positioned such that the angles between them were considerably large, a significantly long time could pass without any measurements from the face detector associated to a certain face. This can be seen in
The assumption that gaze data would be beneficial to steer with in terms of speed seemed to hold for most cases. However, as can be seen in
Based on presented results, it should be possible to determine the attention of a user with the system developed for situations like in the experiments. To get an accurate determination of direction, a well performing head orientation estimation would be beneficial. It seemed like, for most of the experiments, the system at hand would perform well enough. However, this was dependent on that no translational movements of neither user nor subjects were allowed. As can be seen in the results, the user tended to direct their gaze towards a face, thus, solutions to steer beamformers like the ones described in [24] could possibly be implemented with the system for further evaluation.
A satisfactory result of the face tracking was achieved using an extended Kalman filter with a constant position model to track faces parametrized as a unit vector to the track. Measurements consisted of face detections retrieved using a MobileFaceNets face detector. The system was used to track up to three faces, but no upper limit of the number of faces possible to track was investigated. The sensitivity to orientation estimate errors was highly dependent on the frequency of the detections resulting in multiple tracks of the same face being created during experiments with less frequent detections and varying orientation errors. False detections were handled successfully meaning no tracks were initiated on false detections.
The use of eye data to support yaw estimate seems promising where two methods has been investigated. Using eye gaze to estimate gyroscope bias has been implemented and reduced yaw drift, further, the use of VOR movements as yaw measurements also looks to have some potential. An important aspect for these methods to work is a robust classifier of eye movements. The use of an I-VT filter to classify eye movements was investigated. Results point towards that such a filter could perform well in identifying saccades. However, the result is highly dependent on the threshold and a choice has to be made of how short saccades one is interested in identifying. Moreover, gaze direction seemed to be a good indication of whether someone was attending a talker since gaze direction and current talker frequently coincided.
The overall objective given the stated limitations has been achieved. The system can be used to track faces in the environment of a user and gaze direction can be used to estimate attention.
It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.
It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.
The claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
Accordingly, the scope should be judged in terms of the claims that follow.
Number | Date | Country | Kind |
---|---|---|---|
20193482.5 | Aug 2020 | EP | regional |