Disclosed are embodiments related to adapting senses datastreams in augmented reality and virtual reality environments. Certain embodiments relate the internet of senses, machine learning, and reinforcement learning.
The internet of senses promises experiences utilizing all senses in virtual or augmented reality (VR/AR) environments, for example as described in Reference [1]. In such environments, users can interact with digital objects using one or more of their senses. From a network perspective, sensory input to a user's VR/AR device are sets of synchronized datastreams (i.e., sound, vision, smell, taste, and touch). In a VR/AR environment the end-user device feeds the data to the sensory actuators for human users to experience.
Considering sensory perception however, people perceive things differently. For example, hyperesthesia is a condition that involves an abnormal increase in sensitivity to stimuli of a particular sense (e.g., touch or sound). Hyperesthesia may have unwanted side-effects, for example some people may hear painfully loud noises where in reality the volume is not as high. Another condition, called misophonia, triggers unreasonable physiological or psychological responses as reaction to certain sounds (for example food munching). Finally, hypoesthesia, being the reverse to hyperesthesia, is a condition where sense perception is reduced. For example, a certain smell or sound may not be as strong. In addition to medical conditions, there are also personal preferences on taste, smell, touch, sound and/or view that may trigger a positive or a negative response.
Further to the above, Reference [9] uses sensory data from users themselves and from users' environment to generate user experiences. Reference is targeted at removing noise from images captured from a camera and correlating haptic feedback with the image content.
When it comes to gauging emotional state of users, there exist several approaches, from invasive to non-invasive. A comprehensive review is discussed in Reference [2]. The following list is approximately ranked from least to most invasive given the current state of the art-potentially in the future more invasive approaches will become easier to attain or such implants may be considered as the norm.
Heart Rate Variability (HRV) measures the time between heartbeats (heartrate), which may require fewer sensors than previous approaches. For example, wearables such as Apple watch can measure HRV.
Skin temperature measurements (SKT) measure the temperature at the skin (e.g., a sweating person would have higher skin temperature) and relate it to human emotional state.
Respiration Rate Analysis (RRA) measures the respiration velocity and depth. It is possible to implement RR with non-contact measurement methods such as video cameras and/or thermal cameras.
Facial Expressions (FE), body posture (BP) and gesture analysis (GA) are non-invasive methods that have been becoming increasingly popular the last few years due to advances in computer vision and specifically visual object detection and the use of convolutional neural networks. In these methods, images of the face and body are used, and algorithms detect and analyze different types of expressions and postures and correlate them with emotional states.
Electrooculography (EOG) uses either electrodes placed above or below the eye and the left and to the right, or special video cameras (such as video oculography systems and camera infrared oculography), to measure eye movement that could indicate an emotional state.
Electroencephalography (EEG) uses a special device called an electroencephalogram to collect EEG signals. This device contains electrodes attached to human scalp using an adhesive material or using a headset. A subsequent analysis of frequency ranges generated from EEG signals may identify different emotional states. Cutting-edge EEG devices may even be portable.
Electrocardiography (ECG) uses fewer sensors (compared to EEG), positioned on the human body—instead of measuring brain waves this method measures the electrical activity of the heart.
Galvanic Skin Response (GSR) measures electrical parameters of the human skin and requires several sensors placed in different parts of the body.
Electromyogram (EMG) uses electrodes to measure neuromuscular abnormalities, which could be triggered as an emotional reaction.
In Reference [9], there is no monitoring of the emotional state of the consumer/user to alternate the intensity of an existing datastream. Instead, contextual information such as foot-related pressure readings, as well as physiological characteristics such as shoulder width, arm length, muscle tone, vein signature, style of movement, etc. is obtained in order to do access control, i.e., permit or block datastreams from reaching the user.
In Reference [10], a marker indicator is embedded in an image frame (and not the emotional state of the user) to remove image noise and render haptic feedback to the user. In addition, haptic feedback is generated and not adjusted.
Accordingly, there is a deficiency in the state of art when it comes to adapting sensory input media streams (also to be referred henceforth as datastreams) in response to the emotional state of the consumer. Every consumer of the datastreams has different sensitivities and personal preferences, however, and embodiments disclosed herein learn those sensitivities and preferences over time in an automated, unsupervised way (i.e., without user feedback).
According to some embodiments, a system and method for learning to enhance or subdue sense-related data contained in datastreams based on the reactions of the user is provided. The embodiments disclosed herein provide an intuitive way of adapting the datastreams, which can improve the overall user experience of AR/VR and mixed reality applications.
In one aspect, a computer-implemented method of processing a stream of sensory data is provided. The method includes obtaining an input stream of sensory data from a source, wherein the input stream of sensory data comprises input for a sensory actuator of a user device. The method includes obtaining state information, wherein the state information comprises information indicating a first state of a user. The method includes determining, using a machine learning model, a desired second state of the user based on the obtained state information. The method includes determining an action to process the input stream of sensory data based on the desired second state of the user. The method includes generating an output stream of sensory data by processing the input stream of sensory data in accordance with the determined action. The method includes rendering the output stream of sensory data to the sensory actuator of the user device.
In another aspect there is provided a device adapted to perform the method. In another aspect there is provided a computer program comprising instructions which when executed by processing circuitry of a device causes the device to perform the methods. In another aspect there is provided a carrier containing the computer program, where the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
One of the advantages made possible by the embodiments disclosed herein is the adaptation of the emotional effect that is produced by a datastream to a desired effect as learned from experience by way of reinforcement learning. This enhances the user Quality of Experience (QoE). Another advantage is improved privacy and preserving privacy of the user since the learning/adaptation may be done on the User Equipment (UE) side. Another advantage is that the embodiments allow growth of Internet of Senses applications to an audience that may be hyper or hypo sensitive to certain or all senses.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
Embodiments disclosed herein relate to systems and methods for learning to enhance or subdue sense-related data contained in datastreams based on the reactions of the user. The embodiments disclosed herein provide an intuitive way of adapting the data streams, which can improve the overall user experience of mixed reality applications.
One of the advantages made possible by the embodiments disclosed herein is the adaptation of the emotional effect that is produced by a datastream to a desired effect as learned from experience by way of reinforcement learning. This enhances the user Quality of Experience (QoE). Another advantage is improved privacy and preserving privacy of the user since the learning/adaptation may be done on the User Equipment (UE) side. Another advantage is that the embodiments allow growth of Internet of Senses applications to an audience that may be hyper or hypo sensitive to certain or all senses.
Raw datastreams that contain unprocessed levels of sense intensity may be provided by a source 108. For example, in a third-generation partnership project (3GPP) network, this source could be a camera, a speaker/headphone, or another party providing data via an eNB/gNB.
The data processing agent 202 may include a set of components used to learn, based on a user's emotional state and personal preferences, how to adjust the intensity of different senses and modify the raw datastreams from source 108 accordingly. In some embodiments, the set of components are logical. They may be in the user's device 100 or can be hosted by a third-party service that is reachable by a user device 100. In some embodiments, data processing agent 202 may utilize machine learning techniques as described herein, and may include, for example, a neural network.
Processed datastreams may be sent from data processing agent 202 to renderer 204, e.g., to control a sensor actuator in accordance with the processed datastreams. For example, renderer 204 may be VR goggles, glasses, display device, a phone, or other device.
Depending on the technique used to gauge users' reactions, different sensors 104 can also be used and even placed on user's body. The reaction receptor 206 may measure a user's emotional state and/or measure environmental qualities and provide such information to datastream processing agent 202. In some embodiments, reaction receptor 206 may aggregate information from one or more sensors 104.
According to some embodiments, the problem may be formulated as a reinforcement learning problem (i.e., as a Markov Decision Process—MDP with unknown transition probabilities). For reasons of simplicity, a finite MDP is used in the definition, where state and action spaces are finite (see below), however continuous state and action spaces may also be used.
An action space may define the possible set of actions to be taken on the raw datastream. These actions may be discrete, and they indicate the level of intensity over, or below, a reference intensity that a datastream should be adjusted. For example, considering an audio datastream, the possible action space for that audio datastream could be:
Accordingly, the audio stream level adjustment can be set from completely muted (−1) to double level than it currently is (1). The raw data stream can also remain unchanged, i.e. at 0 value. When it comes to audio datastreams, the level may mean the amplitude of the audio wave, which could be increased or decreased by a percentage indicated by the action. In another embodiment, the level may indicate the pitch of the audio wave (i.e., the frequency)—a higher pitch would indicate that the period of the frequency is reduced, as per the action space above. In yet another embodiment, it is possible that both pitch and amplitude are adjusted, in this case the action space would be double the size of the one indicated in set 1 above, one action for each frequency, period, and amplitude of the sound wave. Depending on the use case, there can be more granular sets than set 1 with smaller step between adjacent values, or even continuous sets of values.
Following the same example of set 1, adjustments can also be made to images, for example video frames of VR/AR world. Hue, brightness and saturation are all parameters that can modified to adjust colors in the image (e.g., to a lighter color scheme) and can all be considered as level of intensity for visual data streams.
When it comes to taste, an indication of intensity of bitter, sour, sweet, salty and umami can be used, as these five are considered basic tastes from which all other tastes are generated. Again, the indications can use set 1 or other with finer/coarser granularity, which, in the case of taste may include 5-ford the size (based on the current commonly agreed five basic tastes).
For odors and smell, there may be no consensus yet as to what primary odors exist. One categorization considers the following odors, which at various levels of intensity may synthesize any smell: Musky, Putrid, Pungent, Camphoraceous, Ethereal, Floral, Pepperminty (Reference [3]). Accordingly, there may be seven sets similar to set 1, each characterizing the intensity of each basic constituent of taste.
Finally, for touch, according to Reference [4], there are four basic senses: pressure, hot, cold and pain. Other sensations are created by a combination of the other four. For instance: the experience of a tickle is caused by the stimulation of neighboring pressure receptors. The experience of heat is caused by the stimulation of hot and cold receptors. The experience of itching is caused by repeated stimulation of pain receptors. The experience of wetness is caused by repeated stimulation of cold and pressure receptors. Again, as in previous cases, a similar set such as set 1 can be used for each of four basic constituents.
The state space contains a description as to the current emotional state of the person. For characterizing the state space, the broadness of emotional states that can be detected by one or an ensemble of techniques referenced above may be delineated. According to psychologist Paul Eckman, there exist six types of basic emotions: happiness, sadness, fear, disgust, anger and surprise. This categorization may be used as an example to illustrate how the state description can be serialized and presented to the agent:
Set 2 illustrates an example state report, which may be the average of an ensemble of techniques referenced above. Applications in controlled environments might be able to choose more accurate but invasive techniques, whereas applications in public environments might want to use non-invasive techniques.
If automated identification of the basic emotions using sensors is impractical (e.g. visual sensors for facial expressions, body posture and gestures are unavailable), a multi-dimensional analysis of emotional states could be used instead. Multi-dimensional analysis pertains to mapping emotions to a limited set of measurable dimensions, for instance valence and arousal. Valence refers to how positive/pleasant or negative/unpleasant a given experience feels, and arousal refers to how energized/activated the experience feels.
An overview of approaches to emotion recognition and evaluation using techniques such as GSR, HRV, SKT, ECG, and EEG is described in Reference [2]. More or less invasive sensors can be used to measure emotional states along the dimensions: e.g. if the arousal level is increased, the conductance of the skin also increases, the heart rate increases etc.; the latter can be measured using various wearable sensors. What is more, such dimensions and thereby emotional states can be captured using even typical devices such as smartphones via direct user input, for example, using the Mood Meter App.
In addition to leading to more accurate determination of emotional states, multi-dimensional analysis of emotions also potentially reduces the dimensionality of the state space, thus making processing of emotional states computationally cheaper.
The final constituent of the state space, in addition to the emotional states of the user, are environmental qualities that affect the calculation of the reward function. These qualities can be for example the level of ambient noise and lighting, the current temperature, wearables configuration such as the sound of the headset and the brightness of the screen, etc.
For RL models, one design decision is to choose a reward function that directs RL agents towards preferred target states by rewarding those states more. When it comes to assessing what is the user target emotional state for a given current emotional state and environmental state is hard in advance. For example, different people have different emotional reaction for the same environment state. Furthermore, sometimes people want to be more exited, while sometimes they want to feel more relaxed depending on environmental and current emotional state.
According to some embodiments, a supporting ML model may be trained to try to learn a user's desired next emotional state for a given current emotional and environmental state. The environmental state can be observed by the user's wearable devices (e.g., headset) or using other devices such a mobile phone. The aforementioned ML model may be called Desired State ML model (DSML).
According to some embodiments, the input vector to the DSML model is a concatenation of user current emotional state vector and a vector of environment state. For instance, if the user emotional state is suser=[valence=−0.4, arousal=0.7] and the environment state is senv=[noise=65 dB, light=1000 lux, temp=21C, headset_sound_volume=20%, screen_brightness=70%], then the input vector to DSML is scurrent=[noise=65 dB, light=1000 lux, temp=21C, headset_sound_volume=20%, screen_brightness=70%, valence=−0.4, arousal=0.7].
The output of DSML is an emotional state desired by the user represented as a vector of its components. The training data for the DSML is collected from the actions taken by user to adjust parameters of an AR/VR/MR experience during use. When users make adjustments, then parameters of the equipment and possibly content is recorded as vector of current environment state and the characteristics of emotional state are measured before and after the adjustment took place. The measured emotional states are recorded as current and desired emotional state. So, each action taken by user creates one input output tuple for the DSML training.
Note that there could be fluctuations in the emotional state measurements so when we talk about emotional state before and after an action, we are talking about averaged values measured for a period of time before and after the action.
The reward for an action taken by RL agent could be computed using results of DSML model in the following way. For any given current state scurrent RL agent need to suggest an action. The reward each action gets depends on how well the resulting next state snext is close to desired state sdesired predicted by DSML for scurrent. Note that scurrent and snext states contain both emotional and environmental components while sdesired state has only emotional components. Let's call a corresponding component of scurren, snext, sdesired as ccurren, cnext, cdesired. Then for that component the penalty pc is computed as: pc=|cdesired−cnext|/|ccurren−cdesired|
Then the total penalty p for the action is the sum over all components of emotional state vector. Finally, the reward for the action is reciprocal of the penalty, i.e. r=1/p.
While the disclosed embodiments target sense perception, the embodiments disclosed herein can also find another application in countering motion sickness that people may experience, for example on passenger cars. For example, a study from the University of Michigan showed a correlation between physiological measurements and motion sickness of car passengers (head position relative to the torso, heart rate, skin temperature etc.).
To counter motion sickness, studies identified that triggering of certain sounds, such as listening to pleasant music (Reference [5]) and correlating engine sounds and vibration with visual flow speed (Reference [6]) could reduce motion sickness. Therefore, the embodiments disclosed herein can be extended with a mechanism to reduce motion sickness.
Specifically, the state space in case of motion sickness mitigation may consist of physiological measurements that indicate the severity of motion sickness. These may include movement of the head (also known as head roll), which is proportional to the severity of the sickness (observed in Reference [5] and Reference [7]). Head roll and pitch can be measured using accelerometer and gyroscope sensors on a wearable such as AR/VR glasses. An increase in tonic and phasic GSR has also been found to contribute to motion sickness in Reference [7]. Another study also identified that changes in the blinking behavior of the eyes and breathing/respiration also suggest an uncomfortable situation that may be linked to motion sickness (Reference [8]).
A serialization of the state space therefore may include one or more of the following, depending also on the type of sensors present:
In terms of the action space, in the simplest most straightforward case, an audio stream will be transmitted playing back some pre-recorded pleasant music (as observed in Reference [5]). In some embodiments, an audio stream will be transmitted playing back music with a tempo correlating to a speed of a vehicle. In another case, a video stream can be created and correlated with the sounds of the engine and vibration of the car, as well as the vehicle's actual speed. The sounds of the engines, vibration and speed can be retrieved by the datastream processing agent 202 from the headset's microphone, accelerometer and GPS receiver respectively and then a video can be synthesized to provide the sense of speed and flow matching those readings. Additional or alternative actions may be taken to counteract detected motion sickness as well, such as, for example, lowering a window, changing a position of a seat, reducing an experience of content (e.g., from three-dimension to two-dimension), adjusting an air conditioner, and/or adjusting an air recycler.
At 401, agent 202 initializes a target network tn and a deep q network dqn.
At 403, agent 202 initializes an experience cycle buffer B. Steps 401 and 403 may be used to train a machine learning model as discussed above using observations of an old and new state.
At 405, a raw datastream is received at agent 202 from source 108.
At 407, an action a is selected using a selection policy (e.g., e-greedy). The action may be based on machine learning techniques of exploration and exploitation. In exploration, a random action may be selected. In exploitation, a constrained action may be selected.
At 409, the raw datastream is processed based on the action selected at step 407.
At 411, the processed datastream is provided to renderer 204. The processed datastream may control an actuator of a user device 100 to provide a sensory output to a user.
At 413, the react 206 observes a reward r(i) and new state s(i+1) and provides the observations to agent 202.
At 415, the agent 202 stores <s(i+1), s(i), a, r (i)> in the buffer B.
At predetermined iterations, steps 417, 419, and 421 may be performed by agent 202. In some embodiments, steps 417, 419, and 421 are performed using a convolutional neural network and/or mean-square algorithm in agent 202. At step 417, a random minibatch of experiences <s(j+1), s(j), a, r (j)> are selected from the buffer B.
At step 419, y(j) is set equal to r(j)+ymaxQ(s(j+1), a(j+1), tn).
At step 421, a gradient descent step is performed on tn(y(j)−Q(s(j), a(j), dqn)) ↑2.
Example code for generating the flow in
While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/071735 | 8/4/2021 | WO |