This application claims the priority benefit of Italian Application for Patent No. 102021000008915, filed on Apr. 9, 2021, the content of which is hereby incorporated by reference in its entirety to the maximum extent allowable by law.
The description relates to methods and systems for detecting a state of a driver of a vehicle.
One or more embodiments may be used to possibly generating alert signals and/or activating safety procedures (e.g., taking over control of the vehicle) based on the detected state, for instance within the framework of an advanced driver-assistance system (ADAS) or a driver alcohol detection system for safety (DADSS).
A reduced level of attention (e.g., drowsiness) of a driver (before and during driving) of a vehicle may adversely affect driving safety. Driver drowsiness may lead to road traffic accidents involving vehicles. The possibility to detect an attention state of a driver may facilitate evaluation of his/her fitness to drive a vehicle, facilitating prevention of road accidents.
Existing systems for driver attention monitoring are based on recorded images of a driver, in particular a driver's face, during driving.
These “face analysis” solutions based on image-data may suffer from drawbacks such as, for instance: visual noises, for instance glasses worn on the driver's face, which may hinder data processing, or high complexity and difficulty in calibrating and adapting the system for different car drivers.
Alternatively, existing systems may rely on electrophysiological signal processing.
It is known that a correlation exists between attention levels and heart rate variability (HRV), so that estimating HRV of a human can be indicative of drowsiness.
HRV is the physiological signal of variation of time intervals between heartbeats. Thus, HRV is indicative of autonomous nervous system activity state, which is responsible of operating automatic, unconscious and involuntary body activities, such as heartbeat activity.
An HRV value may be obtained via processing of measured electrophysiological signals related to heartbeats, e.g., ElectroCardioGraphy (ECG) and/or PhotoPletysmoGraphy (PPG) signals.
These “physiological analysis” solutions based on electrophysiological signals may suffer from drawbacks such as, for instance: complexity in embedding electrophysiological sensing devices in the vehicle; for instance, installing ECG detectors on the car steering wheel of a vehicle would involve that (both) the driver's hands should be steadily placed on the car steering wheel at those positions where the ECG detectors are located; lengthy data buffering, for instance approximately 8 to 10 minutes of detected ECG time series would be used to provide a robust measure of driver drowsiness, leading to low-dynamic (e.g., slow) change-of-status/alert signaling; and complex frequency-domain signal processing involved in HRV computation, since HRV is linked to the frequency of detected ECG/PPG signals, would imply costly/slow CPU intensive computation.
An extensive activity is carried on and several approaches are proposed in literature, as discussed in the following documents (each of which is incorporated herein by reference):
As mentioned, various solutions proposed in the literature may be exposed to one or more of the following drawbacks: reduced performance when adapted to different people in the driver's place; use of CPU-intensive and time-consuming methods; challenging in embedding these solutions in a vehicle space, for instance due to a difficulty in providing complex hardware architectures onboard a car; reduced performance in low-light conditions, or with visual noises (e.g., glasses worn by driver); and high latency in providing an output, which is hardly compatible with fast reactions of a safety system.
Existing solutions hence suffer from low-speed detection of a change in the state of an attention level of, e.g., a driver of a vehicle, especially while employing relatively cheap and low complexity components.
There is a need in the art to contribute in overcoming the aforementioned drawbacks.
One or more embodiments may relate to a method.
One or more embodiments may relate to a corresponding (processing) system.
An advanced driver assistance system configured to perform the signal processing method as per the present disclosure may be exemplary of such a system.
One or more embodiments may relate to a vehicle equipped with the system according to embodiments.
One or more embodiments combine image processing and electrophysiological signal processing using innovative deep learning and data processing techniques, in a synergistic way. For instance, this facilitates overcoming individual drawbacks of separate processing pipelines, providing a more robust and efficient overall system.
One or more embodiments may facilitate continuous driver drowsiness detection/monitoring without the employ of frequency domain computations as well as without lengthy data-buffering.
One or more embodiments may comprise an ad-hoc (hyper)-filtering pipeline facilitating to extract, e.g., concurrently, various PPG signal features/dynamics.
One or more embodiments may facilitate providing one or more of the following advantages: increased speed; reduced costs and computational complexity; dispensing from fully labelling every object of a training set; dispensing from use of complex systems; streamlining of multiple processing pipelines; innovative domain-adaptation thanks to a self-attention deep neural network; easy adaptability to different car driving scenarios; adaptability to low-light scenario; and robustness against possible temporary unavailability of signals of one of the two pipelines, e.g., due to a misaligned position of the driver with respect to electrophysiological signal sensors.
One or more embodiments will now be described, by way of non-limiting example only, with reference to the annexed Figures, wherein:
In the ensuing description, one or more specific details are illustrated, aimed at providing an in-depth understanding of examples of embodiments of this description. The embodiments may be obtained without one or more of the specific details, or with other methods, components, materials, etc. In other cases, known structures, materials, or operations are not illustrated or described in detail so that certain aspects of embodiments will not be obscured.
Reference to “an embodiment” or “one embodiment” in the framework of the present description is intended to indicate that a particular configuration, structure, or characteristic described in relation to the embodiment is comprised in at least one embodiment. Hence, phrases such as “in an embodiment” or “in one embodiment” that may be present in one or more points of the present description do not necessarily refer to one and the same embodiment.
Moreover, particular conformations, structures, or characteristics may be combined in any adequate way in one or more embodiments.
The references used herein are provided merely for convenience and hence do not define the extent of protection or the scope of the embodiments.
The drawings are in simplified form and are not to precise scale.
The references used herein are provided merely for convenience and hence do not define the extent of protection or the scope of the embodiments.
Throughout the figures annexed herein, like parts or elements are indicated with like references/numerals; for brevity a corresponding description will not be repeated for each and every figure.
Also, throughout this description, the wording “neural network (processing)” as used, for instance, in expressions like artificial neural network (ANN) processing or convolutional neural network (CNN) processing, is intended to designate machine-implemented processing of signals performed via hardware (HW) and/or software (SW) tools.
By way of general reference,
As exemplified in
The signal processing circuitry 10 is coupled to the signal acquisition stage VD, PD and configured to receive the at least one of the driver signals S, P therefrom. The signal processing circuitry 10 is configured to apply artificial neural network (ANN) processing to the at least one driver signal S, P, providing an indicator signal T to user circuits A as a result of the ANN processing.
In one or more embodiments, the user circuits A may comprise an advanced driver assistance system (ADAS), configured to receive the indicator T and to use it in assisting with car driving operations, for instance providing an alert to the driver D of the vehicle V as a result of the indicator T being below or above a certain threshold and/or taking control over the vehicle V in case a drowsy driver state is detected, potentially improving driving safety.
For instance, the results produced by the system can be presented on a display unit A to an operator, e.g., a medical practitioner, with the capability of supporting his activity, e.g., for diagnostic purposes.
As exemplified in
In one or more embodiments, a PPG signal P may be simpler to process according to a method as disclosed herein, as it may be easier to sample in an automotive environment with respect to an ECG signal, due to a reduced invasiveness of the hardware in the limited volume of the vehicle V. For instance, PPG probe circuitry PD may be embedded in the steering wheel of the vehicle V.
Thus, for the sake of simplicity, embodiments are discussed in the following mainly in relation to the processing of a PPG signal as electrophysiological signal P, being otherwise understood that such an electrophysiological signal type is purely exemplary and in no way limiting.
As exemplified in
The signal processing circuitry 10 further comprises a classifier stage 16 coupled to the first 11, 12 processing pipeline and to the second 14, 15 processing pipeline to receive the first Ts indicator signal and/or second Tp indicator signal therefrom. The classifier stage 16 is configured to classify an attention level of the driver D based on the first Ts indicator signal and second Tp indicator signal received (e.g., based on a weighted combination of the first Ts indicator signal and second Tp indicator signal), producing a global indicator signal T indicative of the attention level of the driver D, e.g. a driver drowsiness risk indicator.
For instance, the indicator signal T may be expressed as:
T=k1φ(Ts)+k2ψ(Tp)
where k1 and k2 are scaling weights and φ, ψ are weight functions.
For instance, the system may trigger an alert message/signal to user circuits A based on the indicator signal T reaching or failing to reach one or more attention level thresholds T1, T2.
For instance, the risk signal may be a message displayed on a screen A onboard the vehicle V, the message being indicative of a confidence interval or probability of the detected drowsiness state:
For the sake of simplicity, the case above exemplifies static threshold values. It is noted that this is just one of the possible ways to set threshold values, as in one or more embodiments one or more threshold values T1, T2 may be adjusted “online” or may be set with adaptively.
As exemplified in
Object detection processing as discussed in the reference Viola, et al., “Rapid object detection using a boosted cascade of simple features,” Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 2001, pp. I-I, doi: 10.1109/CVPR.2001.990517 (incorporated by reference), is suitable for use in the first-processing stage 11 to identify the landmark points O. This envisages a machine learning approach for visual object detection comprising: an image representation called the “integral image” which allows the features used by the detector to be computed very quickly; a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set; a method for combining increasingly more complex classifiers in a “cascade”. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. This method may be adapted to be run on a central processing core also in the absence of dedicated graphical processing core, thanks to a reduced number of landmark points being selected.
Such an object detection processing facilitates to reveal variations over time of luminance at the selected set of landmark points, where these variations are indicative of heart pulsatile activity occurring with a variable heart rate, these variations are indicative of an attention level of the driver D of the vehicle V.
For instance, the pre-processed signals Lk can comprise time-series of luminance/intensity data for a respective face landmark point, for instance to obtain a sequence of values of intensity variations frame-by-frame, e.g., relative variation of intensity in an image frame with respect to the preceding image frame.
As exemplified in
As appreciable to those of skill in the art, the encoder-decoder layers may form together a stacked autoencoder (SAE) neural network.
As known to those of skill in the art, a long short-term memory (LSTM) neural network is an artificial recurrent neural network (RNN) architecture having feedback connections among cells therein.
As exemplified in
For the i-th LSTM cell 30 as exemplified in
ft=σ(Wf·[ht−1,xt]+bf)
it=σ(Wi·[ht−1,xt]+bi)
{tilde over (C)}t=tanh(Wc·[ht−1,xt]+bc)
Ct=ft*Ct−1+it*{tilde over (C)}t
ot=σ(Wo[ht−1,xt]+bo)
ht=ot*tanh(Ct)
where: Wf is a respective set of weights of the first gate 30a of the LSTM cell 30; bf is a respective set of bias values of the first gate 30a of the LSTM cell 30; Wi is a respective first sub-set of weights of the second gate 30b of the LSTM cell 30; bi is a respective first sub-set of bias values of the second gate 30b of the LSTM cell 30; WC is a respective second sub-set of weights of the second gate 30b of the LSTM cell 30; bC is a respective second sub-set of bias values of the second gate 30b of the LSTM cell 30; WO is a respective third sub-set of weights of the third gate 30c of the LSTM cell 30; bO is a respective third sub-set of bias values of the third gate 30c of the LSTM cell 30; Xt−1 is a first input; ht−1 is a first output; Ct−1 is a first cell state; Xt is a second input; ht is a second output; and Ct is a second cell state.
As known to those of skill in the art, CNN processing 124 is a kind of deep neural network (DNN) processing suitable to be applied to analyzing images. The name “convolutional neural network” indicates that the network employs convolution operations, in place of general matrix multiplication in at least one of their layers.
As exemplified in
As illustrated, the processing layers 300, 302, 304, 306, 308, may be configured to produce respective feature maps F1, F2, F3, F4. Each such feature map may have a size given by a feature map width L1, L2, L3, L4 times a feature map height (which may be equal to the width L1, L2, L3, L4) times feature map channels (e.g., three channels for a RGB image having red, green and blue colors), times a number of maps.
In one or more embodiments, the processing layers 302, 304, 306, 308 may have a multi-layer perceptron (MLP) architecture, comprising a plurality of processing units indicated as perceptrons.
A single i-th perceptron in the plurality of perceptrons may be identified by a tuple of values comprising weight values wi, offset values bi and an activation function ρi.
As exemplified in
wi∈RC×H×Ti=1, . . . K
where: H represents kernel height; T represents kernel width; K represents number of kernels, e.g., K=1; and C represents a number of input channels, which may be equal to a number of (image color) channels of the input feature map F1.
The output layer 310 may comprise a fully connected layer, that is a type of convolutional layer having connections to all activations in the previous layer.
A convolutional layer such as 302 (again taken as a possible example) may be configured to apply an activation function to a sliding dot product.
Such an operation may be expressed as, for instance:
b=ρ(wiT·a)
where: wiT is a transposed version of the weight vector wi (corresponding to the kernel); a is the input feature vector, e.g., computed by the processing layer 100 preceding the considered one 102; ρ is the activation function of the layer 102; and b is the output resulting from applying the activation function ρ to the product of the kernel and the input feature vector.
The CNN processing 124 as exemplified in
As exemplified in
In one or more embodiments, a known neural network processing configuration currently denominated “ResNet50” may be suitable for use in the RNN processing stage 40. This is a fifty-layers-deep neural network which can be loaded (pre)trained on more than a million images from a databased currently denominated “ImageNet”. The pretrained network can classify images into 1000 object categories.
As exemplified in
where aij is an attention score which may be expressed as:
and where eij is an alignment score which may be expressed as
eij=a(si−1,hj)
In one or more embodiments, enhancement processing 127 may be based on the observation that to improve a “learning” of the ANN processing, it can be advantageous to focus an “attention” of the processing power of the ANN more onto some data points rather than other.
In this application context, “attention” refers to a technique for attending to different parts of an input vector to capture long-term dependencies. This may be seen as analogous to a “human” learning process where more important concepts can be highlighted.
Thus, the enhancement processing 127 may use special weights to enhance some of the features in the set of features F, providing the set of enhanced features F′ where these are “highlighted”, that is weighted differently.
As exemplified in
A concatenating layer 56 is configured to receive the context vector F′ and the embedded signal Yt and apply concatenating processing thereto, for instance appending one signal to another. A pattern recognition processing 57, for instance a further LSTM neural network processing receives the concatenated signal. One or more fully connected layers 58, 59 are configured to receive the output from the LSTM network 57 and to output enhanced features Yt+1 of the set of enhanced features F′.
As exemplified in
In one or more embodiments, a (hyper-filtering) method as discussed in United States Patent Application Publication Nos. 2020/330020 A1 and 2021/068739 A1 (incorporated herein by reference) may be suitable for use in the filtering stage 14 of the second processing pipeline 14, 15, providing the set of filtered signals Pf.
As exemplified in
As exemplified in
For instance, the residual block may comprise a convolutional processing stage with an ad-hoc kernel selected to adjust size of input xi, e.g., making it equal to the size of output xi+1.
For instance, the i-th output xi may be scaled by a scaling factor before being added to the i+1-th output xi+1. For instance, such a scaling factor is equal or multiple of two, preferably being increased (e.g., doubled) for each processing stage 150.
It is noted that while nine ANN stages 150 are represented in
As exemplified in
As discussed herein, a “convolution” is a kind of matrix operation, comprising a kernel (that is a small matrix of weight values) that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.
As discussed herein, a “causal convolution” is a kind of convolution used for temporal data where the prediction emitted by the model at timestep does not depend on any of the future timesteps; it can be described also as equivalent to a masked convolution which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it.
As discussed herein, a “dilated causal convolution” is a causal convolution where the mask or filter is applied over an area larger than its length by skipping input values with a certain step.
As appreciable by direct visual comparison of
In particular, using the dilated mask 80B it is possible to dispense from using so-called “padding” (indicated with dotted lines in
One or more embodiments may present a kernel size of 3×3 and a batch size of 32, for which dilated causal convolution processing 70 facilitates fast processing.
As exemplified in
A third layer 72 is coupled to the second layer 71 to receive normalized data therefrom and configured to apply a second normalization processing 72 thereto. In one or more embodiments, (spatial) dropout regularization technique as, e.g., that discussed in Srivastava, et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research 15(56):1929-1958, 2014 (incorporated by reference), may be suitable for use in the third layer 72, where the dropout regularization drops a unit (along with connections) at training time with a specified probability (e.g., probability p=0.5) in order to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could lead to overfitting.
A fourth layer 73 is coupled to the third layer 72 to receive normalized data therefrom and configured to apply to the normalized data a further dilated causal convolution operation.
A fifth layer 74 is coupled to the fourth layer 73 to receive dilated data therefrom and configured to apply a third normalization processing thereto, e.g., instance or contrast normalization processing.
A sixth layer 75 is coupled to the fifth layer 70 to receive normalized data therefrom and configured to apply an activation function thereto, e.g., a ReLu function.
A seventh layer 76 is coupled to the sixth layer 75 to receive normalized data therefrom and configured to apply a second normalization processing thereto, e.g., (spatial) dropout regularization technique for neural networks.
An optional adjustment layer 77 is configured to receive the i-th input data xi and to apply dimensionality reduction thereto, preferably via one-by-one convolution, that is by applying 1×1 convolutional layer to provide feature map pooling or a projection layer, decreasing the number of feature maps while retaining their salient features. This layer 77 facilitates managing the number of feature maps which often increases proportionally to depth of the network.
A superposition layer 78 is configured to receive the normalized data from the last normalization stage 76 and the input data xi from the input or the optional adjustment layer and to superimpose the input xi to the output, providing an enhanced output value xi+1.
In one or more embodiments, a final layer of the cascade as exemplified in
In one or more embodiments, each processing pipeline may be trained to perform ANN processing using respective training datasets and using respective training methods.
For instance, ANNs may be trained, in a manner per se known, using a stochastic gradient descent (SGD) iterative method, where the user initializes the weights and the process updates the weight vector using one data point. The gradient descent continuously updates it incrementally when an error calculation is completed to improve convergence. The method seeks to determine the steepest descent and it reduces the number of iterations and the time taken to search large quantities of data points. Stochastic gradient descent is being used in neural networks and decreases machine computation time while increasing complexity and performance for large-scale problems.
At least one training dataset used to train ANNs as per the present disclosure may comprise images and electrophysiological signals captured during driving of a same person during driving of a same vehicle.
For instance, the training set may comprise captured images in a RGB 640×489 (VGA) format captured with a frame rate of 40 frames per second (fps), as well as PPG signals collected from a sensor PD in the steering wheel of the vehicle. For instance, the training dataset may be split in a first part, e.g., with 70% of the total dataset, used for training and a second part, e.g., with 30% of the total dataset, used for validation e testing.
A method of processing signals indicative of a level of attention of a human individual (for instance, D) during a (human) activity, as exemplified herein, comprises:
As exemplified herein, images of said face of the human in the time series of captured images comprise a set of landmark points of said face of the human (for instance, eyes, nose, ears, etc.).
As exemplified herein, applying said first ANN processing pipeline comprises:
As exemplified herein, applying said second ANN processing pipeline comprises:
applying filtering (preferably, hyper-filtering) to said sensed electrophysiological signal, producing a set of filtered signals (for instance, Pf) as a result;
As exemplified herein, at least one CNN processing stage (for instance, 150) in said cascade of CNN processing stages comprises applying a dilated causal convolution (for instance, 70).
A system for processing signals indicative of a level of attention of a human individual (for instance, D) during an activity (for instance, driving a vehicle, preferably a car), as exemplified herein, comprises:
As exemplified herein, the image capturing circuitry comprises a smart-phone having at least one camera, preferably a low frame-rate camera.
As exemplified herein, the sensing circuitry comprises PPG sensing circuitry configured to sense at least one PPG signal indicative of said level of attention of the human during said activity.
As exemplified herein, a vehicle (for instance, V) is equipped with a system as exemplified herein in combination with at least one driver assistance device (for instance, A), the driver assistance device configured to operate as a function of said risk indicator reaching or failing to reach at least one attention level threshold.
A computer program product as exemplified herein is loadable in the memory of at least one processing circuit and includes software code portions for executing the steps of the method as exemplified herein when the product is run on at least one processing circuit (for instance, 10).
It will be otherwise understood that the various individual implementing options exemplified throughout the figures accompanying this description are not necessarily intended to be adopted in the same combinations exemplified in the figures. One or more embodiments may thus adopt these (otherwise non-mandatory) options individually and/or in different combinations with respect to the combination exemplified in the accompanying figures.
The claims are an integral part of the technical teaching provided herein with reference to the embodiments.
Without prejudice to the underlying principles, the details and embodiments may vary, even significantly, with respect to what has been described by way of example only, without departing from the extent of protection. The extent of protection is defined by the annexed claims.
Number | Date | Country | Kind |
---|---|---|---|
102021000008915 | Apr 2021 | IT | national |
Number | Name | Date | Kind |
---|---|---|---|
20190117096 | Rundo | Apr 2019 | A1 |
20190159735 | Rundo | May 2019 | A1 |
20200057487 | Sicconi | Feb 2020 | A1 |
20200214614 | Rundo et al. | Jul 2020 | A1 |
20200330020 | Rundo | Oct 2020 | A1 |
20210068739 | Rundo | Mar 2021 | A1 |
20210221396 | Awano | Jul 2021 | A1 |
20220327845 | Rundo | Oct 2022 | A1 |
Entry |
---|
IT Search Report and Written Opinion for priority application, IT Appl. No. 102021000008915, report dated Dec. 2, 2021, 8 pgs. |
“Driver Drowsiness Detection based on Multimodal using Fusion of Visual-feature and Bio-signal”, 2018 International Conference on Information and Communication Technology Convergence (ICTC) , IEEE, Oct. 17, 2018 (Oct. 17, 2018), pp. 1249-1251, XP033447928. |
“Real-time physiological and vision monitoring of vehicle driver for non-intrusive drowsiness detection”, IET Communications, The Institution of Engineering and Technology, GB, vol. 5, No. 17, Nov. 25, 2011 (Nov. 25, 2011), pp. 2461-2469, XP006039467. |
Awais, et al., “Automated eye blink detection and tracking using template matching,” 2013 IEEE Student Conference on Research and Development, Putrajaya, Malaysia, 2013, pp. 79-83, doi: 10.1109/SCOReD.2013.7002546. |
Haq, et al., “Eye-blink rate detection for fatigue determination,” 2016 1st India International Conference on Information Processing (IICIP), Delhi, India, 2016, pp. 1-5, doi: 10.1109/IICIP.2016.7975348. |
Kurylyak, et al., “Detection of the eye blinks for human's fatigue monitoring,” 2012 IEEE International Symposium on Medical Measurements and Applications Proceedings, Budapest, Hungary, 2012, pp. 1-4, doi: 10.1109/MeMeA.2012.6226666. |
Nacer, et al., “Vigilance detection by analyzing eyes blinking,” 2014 World Symposium on Computer Applications & Research (WSCAR), Sousse, Tunisia, 2014, pp. 1-5, doi: 10.1109/WSCAR.2014.6916844. |
Sanyal, et al., “Two Stream Deep Convolutional Neural Network for Eye State Recognition and Blink Detection,” 2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech), Kolkata, India, 2019, pp. 1-8, doi: 10.1109/IEMENTech48150.2019.8981102. |
Veena, “Efficient Method of Driver Alertness Using Hybrid Approach of Eye Movements and Bio-signals,” 2014 International Conference on Intelligent Computing Applications, Coimbatore, India, 2014, pp. 78-80, doi: 10.1109/ICICA.2014.25. |
Number | Date | Country | |
---|---|---|---|
20220327845 A1 | Oct 2022 | US |