SYSTEM AND METHOD FOR USER RECOGNITION USING MOTION SENSOR DATA

TECHNICAL FIELD OF THE INVENTION

The present invention relates to systems and methods for capturing and characterizing motion sensor data. In particular, the present invention relates to systems and methods for capturing motion sensor data using motion sensors embedded in a mobile device and characterizing the motion sensor data into features for user recognition.

BACKGROUND OF THE INVENTION

Nowadays, common mobile device authentication mechanisms such as PINs, graphical passwords, and fingerprint scans offer limited security. These mechanisms are susceptible to guessing (or spoofing in the case of fingerprint scans) and to side channel attacks such as smudge, reflection, and video capture attacks. On top of this, a fundamental limitation of PINs, passwords, and fingerprint scans is that these mechanisms require explicit user interaction. Hence, these mechanisms are typically used for one-time authentication to authenticate users at login. This renders them ineffective in and of themselves when the smartphone is accessed by an adversary user after login.

Continuous authentication (or active authentication) addresses some of these challenges by periodically and unobtrusively authenticating the user via behavioral biometric signals, such as touchscreen interactions, hand movements, gait, voice, phone location, etc. The main advantage of continuous authentication mechanisms is that they do not require explicit user interaction.

One-time or continuous user identity verification (authentication) based on data collected by the motion sensors of a mobile device during the interaction of the user with the respective mobile device is a recently studied problem that emerged after the introduction of motion sensors into commonly used mobile devices. Samsung in 2005 and Apple Inc. in 2007 were among first companies to introduce hand-held mobile devices (smartphones) equipped with a sensor, more specifically an accelerometer, capable of recording motion data.

The earliest studies in continuous authentication of mobile phone users focused on keystroke dynamics, because these devices had a hardware keyboard to interface with the user. The first research article to propose the analysis of accelerometer data in order to recognize the gait of a mobile device user was in 2006. Since 2006, many other research works explored the task of user identity verification (authentication) based on data collected by the motion sensors. One commonly employed approach is to directly measure the similarity between the sample of signal recorded during authentication and previously-recorded sample of signal which is known to pertain to the user. The samples are compared based on statistical features extracted in time domain or frequency domain or both. Other works approach the task of user authentication based on motion sensors data as a classification problem. These works apply a standard machine learning methodology based on two steps: (i) extracting statistical features from the recorded motion signals in time domain or frequency domain or both and (ii) applying a standard machine learning classifier.

However, these methods which use continuous authentication for mobile devices have lower accuracy rates as compared with authentication methods that utilize PINs, passwords, fingerprints, and the like. As such, there is a need for user authentication methods and systems with improved accuracy and flexibility, and that address the issues of guessing, spoofing, and other types of presentation attacks associated with conventional authentication methods. These and other challenges (e.g., presentation attacks) are addressed by the systems and methods of the present application.

SUMMARY OF THE INVENTION

Technologies are presented herein in support of a system and method for user recognition using motion sensor data.

According to a first aspect, a method for user recognition by a mobile device using a motion signal of a user captured by at least one motion sensor is provided. The mobile device has a storage medium, instructions stored on the storage medium, and a processor configured by executing the instructions. In the method, the captured motion signal is partitioned, with the processor, into segments. One or more respective sets of features are then extracted from the segments, with the processor applying a plurality of feature extraction algorithms to the segments, wherein a given set of features includes features extracted from one or more segments by a respective feature extraction algorithm among the plurality of feature extraction algorithms. The plurality of feature extraction algorithms include Convolutional Vision Transformers (CVT) and convolutional gated recurrent units (convGRU) model algorithms. A subset of discriminative features is then selected, with the processor using a feature selection algorithm, from the one or more sets of extracted features, wherein the feature selection algorithm comprises a principal component analysis (PCA) algorithm. The user is then classified, with the processor using a classification algorithm, as a genuine user or an imposter user based on a classification score generated by the classification algorithm from an analysis of the subset of discriminative features.

In another aspect, each of the plurality of feature extraction algorithms is applied to each segment independently.

In another aspect, the motion signal is partitioned into 5 segments, and wherein the plurality of feature extraction algorithms include Mel Frequency Cepstral Coefficients (MFCC) and histogram of oriented gradients (HOG).

In another aspect, the motion signal is partitioned into 25 segments, and wherein the plurality of feature extraction algorithms include Convolutional Neural Networks (CNN).

In another aspect, the segments overlap.

In another aspect, the PCA algorithm is trained on individual users. In another aspect, the PCA algorithm comprises a first model configured to recognize shallow features and a second model configured to recognize deep features.

In another aspect, the classification algorithm comprises a stacked generalization technique, and wherein the stacked generalization technique utilizes one or more of the following classifiers: (1) Naïve Bayes classifier, (2) Support Vector Machine (SVM) classifier, (3) Multi-layer Perception classifier, (4) Random Forest classifier, (5) and Kernel Ridge Regression (KRR).

In another aspect, the at least one motion sensor comprises an accelerometer and a gyroscope.

In another aspect, a plurality of sets of features are extracted from the segments, and the sets of extracted features are combined or concatenated into feature matrices, wherein the subset of discriminative features is selected from the feature matrices.

In a second aspect, a system for analyzing a motion signal captured by a mobile device having at least one motion sensor is provided. The system includes a network communication interface, a computer-readable storage medium; and a processor configured to interact with the network communication interface and the computer readable storage medium and execute one or more software modules stored on the storage medium. The one or more software modules include a feature extraction module that when executed configures the processor to (i) partition the captured motion signal into segments, and (ii) extract one or more respective sets of features from the captured motion signal by applying a plurality of feature extraction algorithms to the segments, wherein a given set of features includes features extracted from one or more segments by a respective feature extraction algorithm among the plurality of feature extraction algorithms. The plurality of feature extraction algorithms include Convolutional Vision Transformers (CVT) and convolutional gated recurrent units (convGRU) model algorithms. The software modules also include a feature selection module that when executed configures the processor to select a subset of discriminative features from the one or more respective extracted sets of features, wherein the feature selection module comprises a principal component analysis (PCA) algorithm. The software modules further include a classification module that when executed configures the processor to classify a user as a genuine user or an imposter user based on a classification score generated by one or more classifiers of the classification module from an analysis of the subset of discriminative features.

In another aspect, the feature extraction module when executed configures the processor to apply each of the plurality of feature extraction algorithms to each segment independently.

In another aspect, the feature extraction module when executed configures the processor to partition the motion signal into 5 segments, and wherein the plurality of feature extraction algorithms include Mel Frequency Cepstral Coefficients (MFCC) and histogram of oriented gradients (HOG).

In another aspect, the feature extraction module when executed configures the processor to partition the motion signal into 25 segments, and wherein the plurality of feature extraction algorithms include Convolutional Neural Networks (CNN).

In another aspect, the feature selection module when executed configures the processor to retrain the PCA algorithm in response to a predetermined number of classifications of a specific user.

In another aspect, the classification module when executed configures the processor to classify the subset of discriminative features using a stacked generalization technique, and wherein the stacked generalization technique utilizes one or more of the following classifiers: (1) Naïve Bayes classifier, (2) Support Vector Machine (SVM) classifier, (3) Multi-layer Perception classifier, (4) Random Forest classifier, (5) and Kernel Ridge Regression (KRR).

In another aspect, the at least one motion sensor comprises an accelerometer and a gyroscope.

In another aspect, the feature extraction module when executed further configures the processor to extract a plurality of sets of features from the segments, and combine or concatenate the sets of extracted features into feature matrices, and wherein the features selection module when executes configures the processor to select the subset of discriminative features from feature matrices.

In a third aspect, a method for user recognition using a motion signal of a user captured by at least one motion sensor of a mobile device is provided. In the method, the captured motion signal is partitioned into segments using a processor of a computing device having a storage medium, instructions stored on the storage medium, and wherein the processor is configured by executing the instructions. A respective set of features is then extract from at least one segment among the segments with the processor applying a plurality of feature extraction algorithms to the at least one segment, wherein the feature extraction algorithms comprise Convolutional Vision Transformer (CVT) and convolutional gated recurrent units (convGRU) model algorithms. The user is then classified as either a genuine user or an imposter with the processor applying a classification algorithm to the one or more of the features in the respective set.

In another aspect, a subset of discriminative features is selected from the one or more sets of extracted features, with the processor using a feature selection algorithm, wherein the feature selection algorithm comprises a principal component analysis (PCA) algorithm, and wherein the classification algorithm is applied to the subset of discriminative features.

In a fourth aspect, a method for user recognition using a motion signal of a user captured by at least one motion sensor of a mobile device is provided. In the method, a segment of the motion signal is provided at a processor of a computing device having a storage medium, instructions stored on the storage medium, and wherein the processor is configured by executing the instructions. The segment of the motion signal is then converted into a format suitable for processing using an image-based feature extraction algorithm. A respective set of features is then extracted from converted segment with the processor applying image-based feature extraction algorithms, wherein the feature extraction algorithms comprise at least one of Convolutional Vision Transformers (CVT) and convolutional gated recurrent units (convGRU) model algorithms. The user is then classified as a genuine user or an imposter with the processor applying a classification algorithm to the one or more of the features in the respective set.

In another aspect, the converting step comprises building an input image from the motion signal.

In another aspect, the converting step comprises: modifying the motion signal to resemble a spatial structure of an image.

These and other aspects, features, and advantages can be appreciated from the accompanying description of certain embodiments of the invention and the accompanying drawing figures and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high-level machine learning pipeline for classification, which shows a routine for data collection, feature extraction, feature selection, and classification processes in accordance with at least one embodiment disclosed herein;

FIG. 2 is a block diagram showing a routine for a MFCC computation process in accordance with at least one embodiment disclosed herein;

FIG. 3 is a diagram depicting a computation of SDC feature vector at frame N for parameters N-d-P-k in accordance with at least one embodiment disclosed herein;

FIG. 4 is a block diagram showing a computation flow of HOG feature vector applied on a generic motion detect signal in accordance with at least one embodiment disclosed herein;

FIG. 5 displays an exemplary input image for convolutional neural networks constructed from motion signals recorded by two 3-axis mobile device sensors (accelerometer and gyroscope) in accordance with at least one embodiment herein;

FIG. 6 displays a table showing a residual block with maintaining depth dimension (ResNetBlockMaintain—RNBM) in accordance with at least one embodiment herein;

FIG. 7 displays a table showing a residual block with increasing depth dimension (ResNetBlocklncrease—RNBI) in accordance with at least one embodiment herein;

FIG. 8 displays a CNN architecture with residual blocks in accordance with at least one embodiment herein;

FIG. 9 is a diagram depicting a spatial pyramid technique applied on two-dimensional signal in accordance with at least one embodiment disclosed herein;

FIG. 10 is a diagram depicting a sliding window in accordance with at least one embodiment disclosed herein;

FIG. 11 is a block diagram showing a computation flow of a feature extraction method in accordance with at least one embodiment disclosed herein;

FIGS. 12A-12B are block diagrams showing a computation flow for verifying a user based on interaction with a mobile device measured through mobile sensors in accordance with at least one embodiment disclosed herein;

FIG. 13 discloses a high-level diagram of a system for user recognition using motion sensor data in accordance with at least one embodiment disclosed herein;

FIG. 14A is a block diagram of a computer system for user recognition using motion sensor data in accordance with at least one embodiment disclosed herein;

FIG. 14B is a block diagram of a software modules for user recognition using motion sensor data in accordance with at least one embodiment disclosed herein;

FIG. 14C is a block diagram of a computer system for user recognition using motion sensor data in accordance with at least one embodiment disclosed herein; and

FIGS. 15A-15B show a computation flow for verifying a user based on a motion signal measured through mobile sensors where the motion signal is partitioned or segmented in accordance with at least one embodiment disclosed herein.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS OF THE INVENTION

Disclosed herein are exemplary systems and methods for one-time or continuous user identity verification (authentication) by analyzing the data collected by the motion sensors (e.g. accelerometer and gyroscope) of a mobile device. Data collection can occur during a specific interaction of the user with the respective mobile device, e.g., during a biometric authentication, or during non-specific interactions. The exemplary systems and methods can be applied for both implicit and explicit interactions. Common approaches for user identification based on data collected using mobile device sensors are based on two steps: (i) extracting statistical features from the recorded signals and (ii) applying a standard machine learning classifier. In some embodiments disclosed herein, the disclosed method is based on three steps. In the first step (feature extraction), along with the commonly used statistical features, the system is configured to extract an extended and unique set of features which are typically used in other signal processing domains. For example, these are: Mel Frequency Cepstral Coefficients (usually applied in voice recognition), Shifted Delta Cepstrum Coefficients (usually applied in voice recognition), Histogram of Oriented Gradients (usually applied in object detection from images), Markov Transition Matrix, and deep embeddings learned with Convolutional Neural Networks (usually applied in computer vision), Convolutional Vision Transformers, and convolutional Gated Recurrent Units. In the end, the present system is configured to obtain a high-dimensional (e.g., large number of features) feature vector for each one-dimensional (single-axis) sample of a motion signal. All of these features have never been applied for user identification based on mobile device sensors. In the second step (feature selection), the system is configured to apply Principal Component Analysis to reduce the dimension of the feature space (i.e., to reduce the number of features) by keeping the more relevant (discriminative) features. In the third step (classification), the present system is configured to train a meta-classifier that uses as features the classification scores and the labels of several binary (two-class) classifiers (Support Vector Machines, Naive Bayes, Random Forests, Feed-forward Neural Networks, and Kernel Ridge Regression), as well as the classification scores and the labels of a one-class classifier (one-class Support Vector Machines). Employing a meta-classifier which uses the class labels and the score returned by both one-class and two-class classifiers is an original approach that improves the user identity verification accuracy. The present systems and methods achieve considerably higher accuracy in identifying the user compared to the common approach.

By way of example and for the purpose of overview and introduction, embodiments of the present invention are described below which concern systems and methods for user recognition using motion sensor data. In particular, the present application discloses systems and methods for analyzing user gestures or interactions with a computing device (e.g., mobile device) based on motion sensors on the computing device. This analysis can be performed in a manner that is agnostic to the context of the gesture or interaction (e.g., explicit or implicit interactions). The methods and systems of the present application are based in part on machine learning techniques, which identify characteristics relating to how a user interacts with a mobile device (e.g. movements of the device) using two multi-axis motion sensors—an accelerometer and a gyroscope.

By applying machine learning, the present systems and methods are configured to create and provide a general pipeline for verifying the identity of a person regardless of an explicit context (e.g. signature in air) or implicit context (e.g. phone tapping) of the interaction. For example, the methods and systems disclosed herein are configured to capture user-specific features, such as an involuntary hand shaking specific to the user or a particular way of holding the mobile device in the hand, without being specifically programmed to identify those particular types of features. In other words, the present systems and methods are designed to identify discriminative features in the motion sensor data of the user without regard to the corresponding interactions or gestures that the user is making. As such, the present systems and methods do not require the user to perform a specific gesture in order to verify the identity of the user, but rather can analyzes various interactions of the user (implicit or explicit or both) over a time period and identify the user on the basis of discriminative features extracted from the motion signals associated with the interactions and/or gesture(s).

In some implementations, the present system includes a cloud-based system server platform that communicates with fixed PCs, servers, and devices such as laptops, tablets and smartphones operated by users. As the user attempts to access a networked environment that is access controlled (for example, a website which requires a secure login), the user can be authenticated using the user's preregistered mobile device.

The present systems and methods are now described in further detail, along with practical applications of the techniques and other practical scenarios where the systems and methods can be applied for user verification by analyzing the gestures and/or movements captured by mobile motion sensors.

FIG. 1 presents a high-level diagram of a standard machine learning pipeline for classification, which shows a routine for data collection, feature extraction, feature selection, and classification in accordance with at least one embodiment disclosed herein. It should be understood that the exemplary systems and methods for performing user identity verification (authentication) from data collected by mobile device motion sensors, can be implemented using one or more data-processing and computing devices operating independently or in a coordinated fashion. Such computing devices can include for example, mobile devices (e.g., smartphones and tablets), laptops, work-stations and server computers devices. Exemplary systems and methods for user authentication based on biometrics and other sensor data collected using mobile devices are further described herein and in co-pending and commonly assigned U.S. patent application Ser. No. 15/006,234 entitled SYSTEM AND METHOD FOR GENERATING A BIOMETRIC IDENTIFIER filed on Jan. 26, 2016 and U.S. patent application Ser. No. 14/995,769 entitled “SYSTEM AND METHOD FOR AUTHORIZING ACCESS TO ACCESS-CONTROLLED ENVIRONMENTS” and filed on Jan. 14, 2016, each of which are hereby incorporated by reference as if set forth in their respective entireties herein.

With reference to FIG. 1, the process begins at step S105, where the processor of the mobile device is configured by executing one or more software modules to cause the one or more motion sensors (e.g., accelerometer, gyroscope) of the mobile device to collect (capture) data from the user in the form of one or more motion signals.

One of the problems that the present system is configured to address is a verification problem, and thus the system is configured to find features that are unique for an individual user to be verified. In the context of this problem, a goal is to identify users through their interaction with a device. The interaction, which is defined in a broad sense as a “gesture,” is a physical movement, e.g. finger tapping or hand shake, generated by the muscular system. To capture this physical phenomenon, the present system is configured to collect multi-axis signals (motion signals) corresponding to the physical movement of the user during a specified time domain from motion sensors (e.g. accelerometer and gyroscope) of the mobile device. In the present system, the mobile device can be configured to process these signals using a broad and diverse range of feature extraction techniques, as discussed in greater detail below. A goal of the present system is to obtain a rich feature set from motion signals from which the system can select discriminative features.

For example, the accelerometer and the gyroscope can collect motion signals corresponding to the movement, orientation, and acceleration of the mobile device as it is manipulated by the user. The motion sensors can also collect data (motion signals) corresponding to the user's explicit or implicit interactions with or around the mobile device. For example, the motion sensors can collect or capture motion signals corresponding to the user writing their signature in the air (explicit interaction) or the user tapping their phone (implicit interaction). In one or more embodiments, the collection of motion signals by the motion sensors of the mobile device can be performed during one or more predetermined time windows. The time windows are preferably short time windows, such as approximately 2 seconds. For instance, the mobile device can be configured to prompt a user via a user interface of the mobile device to make one or more explicit gestures in front of the motion sensors (e.g., draw the user's signature in the air). In one or more embodiments, the mobile device can be configured to collect (capture) motion signals from the user without prompting the user, such that the collected motion signals represent implicit gestures or interactions of the user with the mobile device.

Again, in contrast with prior systems and methods, the present systems and methods do not require the user to perform a specific gesture in order to verify the identity of the user, but rather can analyzes various interactions of the user (implicit or explicit or both) over a period of time and identify the user on the basis of discriminative features extracted from the motion signals associated with those user interactions.

In one or more embodiments, the processor of the mobile device can be configured to examine the collected motion signals and measure the quality of those signals. For example, for an explicit gesture or interaction, motion signals of the user corresponding to the explicit gesture can be measured against sample motion signals for that specific explicit gesture. If the quality of the motion signals collected from the user falls below a predetermined threshold, the user may be prompted via the user interface of the mobile device to repeat the collection step by performing another explicit gesture, for example.

After the collection of the data (motion signals), at step S110 the processor of the mobile device is configured by executing one or more software modules, including preferably the feature extraction module, to apply one or more feature extraction algorithms to the collection motion signal(s). As such, the processor, applying the feature extraction algorithms, is configured to extract one or more respective sets of features from the collected motion signals. The feature extraction module comprises one or more feature extraction algorithms. In one or more implementations, the processor of the mobile device is configured to extract a respective set of features for each of the feature algorithms, where the feature extraction algorithms (techniques) are chosen from the following: (1) a statistical analysis feature extraction technique, (2) a correlation features extraction technique, (3) Mel Frequency Cepstral Coefficients (MFCC), (4) Shifted Delta Cepstral (SDC), (5) Histogram of Oriented Gradients (HOG), (6) Markov Transition Matrix, (7) deep embeddings extracted with Convolutional Neural Networks (CNN), (8) Convolutional Vision Transformers (CVT), and (9) convolutional Gated Recurrent Units. The one or more feature extraction techniques or algorithms each operate on the same collected motion signals and are independently applied on the collected motion signals. In one or more embodiments, the one more respective set of features extracted from the motion signal(s) includes discriminative and non-discriminative features extracted using one or more of the above feature extraction algorithms.

The processor is configured to run the one or more feature extraction techniques or algorithms in parallel on the same set of collected motion signals. In at least one implementation, all of the above feature extraction techniques are utilized to extract respective sets of features for each technique from the collected motion signals. Thus, in this embodiment, seven respective sets of features are extracted, as each of the seven algorithms is independently applied in parallel on the set of collected motion signals. The implementations of these feature extraction techniques are explained in further detail below.

Feature Extraction

In some embodiments, the mobile device is configured to implement an approach for feature extraction that is based on statistical analysis (statistical analysis feature extraction technique), which tries to characterize the physical process. The statistical approaches that are used in one or more methods of the present application include but are not limited to the following: the mean of the signal, the minimum value of the signal, the maximum value of the signal, the variance of the signal, the length of the signal, the skewness of the signal, the kurtosis of the signal, the L₂-norm of the signal, and the quantiles of the distribution of signal values. Methods based on this statistical approach have good performance levels in the context of verifying a person who does the same gesture, e.g. signature in air, at different moments of time. Here, the disclosed embodiments provide a general approach suitable for different practical applications of user verification (authentication) while interacting with a mobile device, such continuous user authentication based on implicit and unconstrained interactions, i.e. multiple and different gestures. Statistical methods such as “G. Bailador, C. Sanchez-Avila, J. Guerra-Casanova, A. de Santos Sierra. Analysis of pattern recognition techniques for in-air signature biometrics. Pattern Recognition, vol. 44, no. 10-11, pp. 2468-2478, 2011” and “C. Shen, T. Yu, S. Yuan, S., Y Li, X. Guan. Performance analysis of motion-sensor behavior for user authentication on smartphones. Sensors, vol. 16, no. 3, pp. 345-365, 2016” are generally well-suited for user verification from a specific gesture. In some cases, however, the implementation of only one feature extraction technique, including statistical analysis feature extraction technique, is not discriminative enough on its own to be used in a more general context.

Another set of useful statics can be extracted by analyzing the correlation patterns among the motion signals corresponding to independent axes of the motion sensors (correlation features extraction technique). In one or more embodiments of the present application, to measure the correlation between every pair of motion signals, two correlation coefficients are employed: the Pearson correlation coefficient and the Kendall Tau correlation coefficient. The Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y, in our case two 1D signals. It is computed as the covariance of the two 1D signals divided by the product of their standard deviations. The Kendall Tau correlation coefficient is a statistic used to measure the ordinal association between two measured quantities. It is based on dividing the difference between the number of concordant pairs and the number of discordant pairs by the total number of pairs. A pair of observations are said to be concordant if the ranks for both elements agree (they are in the same order). A pair of observations are said to be discordant if the ranks for the elements disagree (they are not in the same order). It is noted that the Kendall Tau correlation coefficient has never been used to measure the correlation of 1D signals recorded by motion sensors.

Since a user can perform the same interaction (gesture) with a device in slightly different ways, there are unavoidable variations in the interaction. These variations are significant enough to pose a real problem for user verification. To address this issue, the system is configured to implement a variety of signal processing techniques from other technical domains that are specifically adapted to properly address the problem at hand. In some embodiments, the system and methods disclosed herein implement techniques adapted from the audio processing domain, more specifically the speech and voice recognition family of problems, achieving beneficial results that are unexpected. Modern state-of-the-art speaker recognition systems verify users by using short utterances and by applying the i-vector framework, as described in “Kanagasundaram, Ahilan, et al. I-vector based speaker recognition on short utterances. Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2011.”.

The goal of a speaker verification (voice recognition) system is to find discriminative characteristics of the human speech production system so that users can be verified. The system is by nature very flexible allowing production of several variants of neutral speech, as shown in “Kenny, Patrick, et al. A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 16.5 (2008): 980-988”. In the real world, the system also needs to verify the speaker by having access to only limited duration speech data, thus short utterances being a key consideration for development.

By analogy to the speech production (vocal folds) system, with the (upper limb) muscular system involved with gestures, it can be assumed that a user's gesture, performed multiple times in the context of (implicitly) interacting with a mobile device, can have a similar degree of variation as short utterances produced by the vocal folds of a person (user) while pronouncing the same word multiple times. From a real-world perspective, as with the speaker recognition system, the general approach of the disclosed systems and methods preferably are configured to verify interactions than can have a limited duration, e.g., sometimes a gesture being performed by the user in a time window of, say, 2 seconds. In this context, feature extraction methods that are used in a speaker recognition system are adapted for use with the present systems and methods for the purpose of characterizing interactions of a user with the mobile device.

In some embodiments disclosed herein, the exemplary systems and methods implement a technique, which is a feature extraction approach first developed for automatic speech and speaker recognition systems namely, Mel Frequency Cepstral Coefficients (MFCC), which model the human hearing mechanism. MFCC were introduced in early 1980s for speech recognition and then adopted in speaker recognition systems. Even if various alternatives features have been developed, this feature extraction method is difficult to be outperformed in practice. A thorough study on different technique used in speaker recognition system can be found in “Kinnunen, Tomi, and Haizhou Li. An overview of text-independent speaker recognition: From features to supervectors. Speech communication, vol. 52, no. 1: pp. 12-40, 2010.”

In the MFCC computation process for speech signals, the speech signal is passed through several triangular filters which are spaced linearly in a perceptual Mel scale. The Mel filter log energy (MFLE) of each filter are calculated. The cepstral coefficients are computed using linear transformations of the log energy filters. These linear transformations are essential for characterizing the voice of a user. These linear transformations can also be used in our approach for characterizing gestures in different contexts, e.g. during implicit interactions. The major reasons for applying linear transformations are: (a) improving robustness of MFLE. The energy filters are susceptible to small changes in signal characteristics due to noise and other unwanted variabilities, (b) decorrelation: the log energy coefficients are highly correlated whereas uncorrelated features are preferred for pattern recognition systems.

From a physiological perspective, when the MFCC technique is used in a speaker recognition system, there is an implicit assumption that the human hearing mechanism is the optimal speaker recognizer. In contrast, in adapting this technique to gesture recognition as disclosed herein for user verification based on interactions with a mobile device, the MFCC technique can operate on an implicit assumption that the motion sensors (accelerometer and gyroscope) represent the optimal interaction recognizer.

In some embodiments of the disclosed method and system, the MFCC technique is tuned using several parameters: sample rate, window length, window shift size, minimum and maximum frequency rate, number of MLFE and so on. The first change that is implemented to adapt this technique to gesture signals captured with the mobile devices is related to the sample rate used to capture an interaction using the accelerometer and gyroscope mobile sensors. In comparison with the sampling rate used for speaker recognition systems, where signals are recorded at a 4, 8, 16 kHz, a standard sample rate used to develop real time mobile applications based on user device interactions is around 100 Hz, for example. Since the sampling rate is three orders of magnitude lower, for example, the features resulting from the motion signals are very different than those resulting from voice signals.

Secondly, the exemplary systems and methods are designed to take into consideration the dynamics of the signals. The voice signals have a high variation in a very short period of time, thus the window length configured to crop the signals and apply the MFCC technique is between 20 and 40 milliseconds. In this time frame the voice signal does not change its characteristics, the cropped signal being statistically stationary. For example, if a voice signal is recorded at a 16 kHz sample rate and the window length is configured to crop the signal with an interval of 25 milliseconds, the time frame on which MFCC is applied on has 400 sample points. In one or more embodiments, the variation of gesture signals is three orders of magnitude lower than the variation of voice signals and the sample rate at which the interaction is recorded as well, 100 Hz in comparison with 16 kHz. As such, the window length is adapted accordingly. For example, and without limitation, values of the window length, for which the cropped signals have presented good performance levels in terms of characterizing the signal properties, range between 1 and 2 seconds. This time frame, for a signal with a sample rate of 100 Hz, corresponds to a cropped signal ranging between 100 and 200 sample points.

The window shift size, which dictates the percentage of overlap between two consecutive windows, is also adapted as well. In the context of voice recorded signals, the window overlap percentage generally has values in the range of 40%-60% in one or more embodiments disclosed herein. For example, in case of a window length of 20 milliseconds used for voice signals, the window shift size is chosen to be 10 milliseconds. This value range is influenced in a certain amount by three factors: (1) sample rate, (2) high variation of voice signals, and also (3) by the practical performance levels. In contrast, signals recorded by the motion sensors during the interaction between a user and a mobile device do not present high variations over short periods of time (compared to the voice signals) and also the sample rate used to capture the gesture is significantly lower than in the case of recorded voice signals. Taking into consideration these two factors and measuring the performance levels in practical experimentation, for the present system, the window overlap percentage for gesture recorded signals has values in the range of 10%-40%.

The other configuration parameters of the MFCC technique have been used with standard values applied to develop speaker and voice recognition systems.

FIG. 2 presents the block diagram of the exemplary MFCC computation process in accordance with at least one embodiment disclosed herein. The signal goes through a pre-emphasis filter; then gets sliced into (overlapping) frames and a window function is applied on each frame. Next, a Discrete Fourier Transform is applied on each frame and the power spectrum is computed. This results in a Mel filter bank. To obtain the MFCC, a Discrete Cosine Transform is applied to the filter bank retaining a number of the resulting coefficients while the rest are discarded. Finally, the Delta Energy and the Spectrum are computed.

Another class of features that characterize speech is the prosodic features, which have been studied in “D. R. González, J. R. Calvo de Lara. Speaker verification with shifted delta cepstral features: Its Pseudo-Prosodic Behaviour. In: Proceedings of I Iberian SLTech, 2009”. Prosodic is a collective term used to describe variations found in human speech recordings, e.g. pitch, loudness, tempo, intonation. In our context, a user can perform the same interaction with a device in slightly different ways, e.g. movement speed, grip, tremor. These variations of a gesture performed by a user can be characterized by using same class of prosodic features.

In some speaker recognition systems, the prosodic features are extracted by using Shifted Delta Cepstral (SDC) technique. In comparison with MFCC, this method is applied on voice signals to incorporate additional temporal information the feature vector. For the present system, since the interaction of a user is recorded by using mobile motion sensors, accelerometer and gyroscope, which record the physical change of the gesture over time, the present systems and methods can be configured to similarly apply SDC techniques in the context of user identification based on sensor data to capture the temporal information.

The SDC technique is configured by a set of 4 parameters, (N, d, P, k), where:

- N—number of cepstral coefficients computed at each frame;
- d—time advance and delay for the delta computation;
- P—time shift between consecutive blocks; and
- k—number of blocks whose delta coefficients are concatenated to form the final feature vector.

In an exemplary approach to SDC feature extraction disclosed herein, the system can be configured to use SDC with the (N, d, P, k) parameter configuration (7, 1, 3, 7).

FIG. 3 presents an exemplary computation of the SDC feature vector at a frame N in accordance with at least one embodiment disclosed herein. First, an N-dimensional cepstral feature vector is computed in each frame t of the signal. Next, each coefficient c is subtracted using spaced td frames to obtain the delta features. Finally, k different delta features, spaced at P frames apart, are stacked to form a SDC feature vector for each frame. The SDC vector at frame t is given by the concatenation from i=0 to k−1 blocks of all the Δc(t+iP).

As shown in FIG. 1 and noted above, subsequent to the steps for feature extraction, the system can be further configured to perform steps for user identification based on data collected using mobile device sensors. In particular, as shown in FIG. 1 the system can be configured to perform the operation of feature selection (step S115), for instance, using Principal Component Analysis, so as to identify the discriminative feature information resulting from extraction. Furthermore, the system can then perform classification of the so processed data (step S120). For instance, classification can be performed using a meta-classifier that uses as features the classification scores and labels of several binary (two-class) classifiers (Support Vector Machines, Naive Bayes, Random Forrest, Feed-forward Neural Networks, Kernel Ridge Regression) and a one-class classifier (one-class Support Vector Machines).

Another goal of the approach as disclosed herein is to verify a user based on his or her interaction with a mobile device by using the device sensors to record the interaction. Up until now, the present disclosure has discussed the term “interaction” in a general sense. A “user interaction” as used in the systems and methods disclosed herein can be defined as: (1) in a one-time interaction context, e.g., a tap on the touchscreen, or (2) in a continuous verification context, e.g. a sequence of multiple and distinctive gestures, such as a tap on the touchscreen followed by a slide gesture on the touchscreen and a handshake. Furthermore, depending on the one-time verification process, a user can also perform a sequence of multiple and distinctive gestures with a device, for instance when the verification of a user is done by using multiple steps, such as biometric authentication followed by SMS code verification. Thus, a user interaction is defined as being composed of a sequence of one or multiple consecutive interactions with the mobile device measured by sensors e.g., accelerometer, gyroscope. The consecutive and typically shorter interactions that form a single interaction are called “local interactions.”

Analyzing the interactions of the same user in different contexts, the inventors have determined that a local interaction can be described by the variation of the measured signal during a period of time, e.g., one second for tapping. The signal variation can be characterized in terms of distribution of movement intensity or direction. The three feature extraction methods described above (statistical features, MFCC, SDC) are agnostic to the definition of interaction described above. Therefore, the systems and methods described herein utilize other domains in order to take into account this specific definition of interaction.

In accordance with at least one embodiment described herein, a feature extraction method that can be used to describe the dynamics of the “user interaction” is the histogram of oriented gradients (HOG), which it is used as a standard technique for object recognition systems in the computer vision field. The idea behind HOG is that local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions. To make an analogy, the local shape of an object can be viewed as a local interaction during a user verification session, where the intensity and direction can be used to describe the shape of the signal variation during the local interaction.

The HOG feature descriptor also presents a number of advantages, in comparison with other feature descriptors, those being: (1) invariant to some geometric transformations (e.g., translations) and (2) invariant to photometric transformations (e.g. noise, small distortions), except for object orientation. More details, comparisons with other descriptors and properties of the technique can be found in the study “N. Dalal, T. Bill. Histograms of oriented gradients for human detection. Computer Vision and Pattern Recognition, vol. 1, pp. 886-893, 2005”. When the HOG feature descriptor is used to describe the signal corresponding to a local interaction, its properties come in handy. Being invariant to noise transformations, the HOG descriptor can encode the generic trend of the local signal, while removing small noise variations that are introduced by sensors or by the user's hand tremor. The fact that HOG is not invariant to object orientations—in the case of the present systems and methods, the generic trend of the signal—is helpful. For example, if a user has higher intensity changes in the beginning of the motion signal recorded during a finger tap, it is preferable not to use a descriptor that provides the same encoding for a different signal with higher intensity changes near the end. In accordance with at least one embodiment described herein, the general processing flow for applying HOG as a feature descriptor on an image is:

- Calculate the horizontal and vertical gradients of the image. The gradients are generally computed using 2D filters, e.g. Sobel filters.
- Divide the image into cells of p×p pixels. The standard cell size is of 8×8 pixels.
- For each cell, calculate the intensity and orientation of the gradient in each pixel in the cell.
- For each cell, the orientation values are quantized into a n-bin histogram. The typical choice for n is 8 or 9.
- The next step is block normalization using a block size of m×m adjacent cells. The blocks are usually formed of 2×2 cells.
- For each block, the histograms of the corresponding cells are concatenated.
- For each block, calculate the L₂-norm of the concatenated histograms.
- The HOG descriptor is obtained by concatenating all blocks into one vector.
  
  FIG. 4 presents the processing flow of HOG feature extraction technique for one-dimension (single-axis) motion signals in accordance with at least one embodiment described herein.

In order to apply the HOG descriptor on time-domain signals recorded by motion sensors, the HOG approach is adapted from two-dimensional (2D) discrete signals (images) to one-dimensional (1D) discrete motion signals. It is noted that a 1D motion signal is used for axis of the motion sensors. In accordance with one or more embodiments disclosed herein, the present systems and methods make the following changes to the HOG approach in order to use it on motion signals:

- A 2D cell used in the image domain corresponds to a short 1D timeframe of the one-dimension signal, with the size of p elements, not p×p pixels.
- In the motion signal domain, a block is a group of m adjacent timeframes instead of m×m adjacent cells (as in the image domain).
- Gradients of the 1D signal (motion signal) are calculated only in one direction given by time axis, different from the image domain, in which gradients are computed in the two spatial directions (horizontal and vertical) of the image.
- For gradient calculation, a 1D filter is applied instead of two (vertical and horizontal) 2D filters. The resulted gradient vector is the first derivate of the 1D motion signal.

For an image domain, HOG is usually based on 8 or 9 gradient orientations. In contrast the HOG version adapted for the signal domain in the present systems and methods uses only two (2) gradient orientations. As described above, the present systems and methods employs multiple changes to HOG techniques to adapt the HOG feature extraction technique for motion signals.

One study that has applied HOG as feature extraction method in time-series classification is “J. Zhao, L. Itti. Classifying time series using local descriptors with hybrid sampling. IEEE Transactions on Knowledge and Data Engineering 28, no. 3, pp. 623-637, 2017”. Regarding this study, it should be noted that it presents an algorithm used for general time-series classification problem, not the usage of HOG in time-series classification. To our knowledge, HOG has not been applied as a feature extraction method in the context of user behavior verification.

The feature extraction methods described above characterize the interaction process of a user with a mobile device from two perspectives: (1) using statistical analysis and (2) signal processing. Both perspectives are based on interpreting the interaction process (e.g. movement) as a deterministic process, in which no randomness is involved in the evolution of the interaction. However, an interaction is not necessarily a deterministic process. For example, depending on the speed movement of a gesture at a certain moment of time t during the interaction, the user can accelerate or decelerate the movement at time t+1, e.g. putting the phone down on the table. Hence, it is more natural to take into consideration that the interaction process can be modeled as a stochastic process.

Based on this interpretation of the physical interaction process, in at least one embodiment described herein, the present systems and methods can characterize stochastic processes using discrete states. In this context, a discrete state is defined as a short interval in the amplitude of the signal. The model considered to be a good fit for describing the interaction is the Markov Chain process. The idea behind this modelling technique is to characterize changes between system's state as transitions. The model associates a probability for each possible transition from the current state to a future state. The probability values are stored in a probability transition matrix, which is termed as the Markov Transition Matrix. The transition matrix can naturally be interpreted as a finite state machine. By applying the Markov Chain process model in the context of the present systems and methods, the information given by the transition matrix can be used as features characterizing the stochastic component of the interaction process with a mobile device. More information regarding the Markov Chain process can be found in the study “S. Karlin. A first course in stochastic processes. Academic press, pp. 27-60, 2014”.

To calculate the Markov Transition Matrix, a transformation technique is applied to convert the discrete signals, resulted from the measurements of the mobile sensors, into a finite-state machine. In one or more embodiments, the conversion process is based on the following steps:

- Configure the number of discrete states q of the finite-state machine.
- For each discrete signal divide the amplitude into q quantiles.
- Set the range between each two consecutive quantiles as a discrete state.
- For each amplitude value of the signal, associate the state that corresponds to the respective value, i.e. if the amplitude value fits in the corresponding range.
- The corresponding states for each amplitude value are recorded in a state vector, keeping the temporal order of the signal readings provided by the motion sensors.
- The state vector is used to build the q×q Markov Transition Matrix by counting the changes between consecutive states.
- Each row in the Markov Transition Matrix is normalized to transform the count values into probabilities.
- The final feature vector is obtained by linearizing the Markov Transition Matrix.

For configuring the number of quantiles, no best practice or standard used in the research community was found, this value being dependent on the application and the shape of the signals. In one or more embodiments of the systems and methods described herein, it has been determined that a value of 16 states is a good choice for motion signals recorded by motion sensors. This value has been determined through experiments, starting from 2 quantiles up to 64, using powers of 2 as possible values. After transforming the discrete motion signal into a finite-state machine, the Markov Chain model algorithm has been applied to create the probability transition matrix.

Each of the features described so far are obtained through an engineered process that encapsulates knowledge and intuition gained in the field of machine learning and related fields of study. However, computer vision researchers have found that a different paradigm, in which features are not engineered but automatically learned from data into an end-to-end fashion, provides much better performance in object recognition from images and related tasks. Indeed, this paradigm, known as deep learning, has been widely adopted by the computer vision community in the recent years, due to their success in recognizing objects, as illustrated in “A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of NIPS, pp. 1106-1114, 2012.” and in “K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of CVPR, pp. 770-778, 2016.”

The state-of-the-art approach in computer vision is represented by deep convolutional neural networks (CNN). Convolutional neural networks are a particular type of feed-forward neural networks that are designed to efficiently process images through the use of a special kind of layer inspired by the human visual cortex, namely the convolutional layer. The information through the network moves in only one direction, from the input layer, through the hidden layers and to the output layers, without forming any cycles. Convolutional neural network for multi-class image classification (a task also known as object recognition in images) are typically trained by using Stochastic Gradient Descent (SGD) or other variants of the Gradient Descent algorithm in order to minimize a loss function. The training process is based on alternating two steps, a forward pass and backward pass, until the model's prediction error is sufficiently low. The forward pass consists of passing the training data through the model in order to predict the class labels. In the backward pass, the error given by the current predictions is used to update the model in order to improve the model and reduce its error. In order to update the model's weights, the errors are back-propagated through the network using the back-propagation algorithm described in “D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning representations by back-propagating error. Nature, vol. 323, no. 9, pp. 533-536, 1986”.

After several iterations (epochs) over the training data, the algorithm is supposed to find the model's weights that minimize the prediction error on the training set. This is done by making small adjustments to the model's weights that move it along the gradient (slope) of the loss function down towards a minimum error value. If the loss function is non-convex, which is usually the case, the algorithm will only find a local minimum of the loss function. However, there are many practical tricks that help the network in avoiding local minima solutions. For example, one approach is to split the training set into small batches, called mini-batches, and execute the forward and backward steps on each mini-batch. As each and every mini-batch contains a different subset of training samples, the gradient directions will be different each time. Eventually, this variation can help the algorithm to escape local minima.

Convolutional neural networks have a specific architecture inspired by the human visual cortex, a resemblance that is confirmed by “S. Dehaene. Reading in the brain: The new science of how we read. Penguin, 2009”. In the former layers (closer to the input), the CNN model learns to detect low-level visual features such as edges, corners and contours. In the latter layers (closer to the output), these low-level features are combined into high-level features that resemble object parts such as car wheels, bird beaks, human legs, and so on. Hence, the model learns a hierarchy of features that helps to recognize objects in images. Such low-level or high-level features are encoded by convolutional filters that are automatically learned from data. The filters are organized into layers known as convolutional layers.

To use convolutional neural networks on a different data type (motion signals instead of images) in the present system and method, an input image is built from the motion signals recorded by the mobile device motion sensors. The present system adopts two strategies. The first strategy is to stack the recorded signals (represented as row vectors) vertically and obtain a matrix in which the number of rows coincides with the number of signals. For instance, in an embodiment in which there are 3-axis recordings of the accelerometer and the gyroscope sensors, then the corresponding matrix has 6 rows. The second strategy is based on stacking the recorded signals for multiple times, such that every two signals can be seen together in a vertical window of 2 rows. To generate the order in which the signals should be stacked, a de Bruijn sequence is used, as described in “N. G. de Bruijn, Acknowledgement of Priority to C. Flye Sainte-Marie on the counting of circular arrangements of 2n zeros and ones that show each n-letter word exactly once. T.H.-Report 75-WSK-06. Technological University Eindhoven, 1975”. The second strategy aims to ensure that the convolutional filters from the first convolutional layer can learn correlations between every possible pair of signals. For instance, in an embodiment in which there are 3-axis recordings of the accelerometer and the gyroscope sensors, then the corresponding matrix has 36 rows. For both strategies, the input signals are resampled to fixed length for each and every input example. The resampling is based on bilinear interpolation.

FIG. 5 illustrates an exemplary input image constructed by applying the second strategy of generating examples for the convolutional neural networks.

Most CNN architectures used in computer vision are based on several convolutional-transfer-pooling blocks followed by a few fully-connected (standard) layers and the softmax classification layer. Our CNN architecture is based on the same structure. The architecture described in “K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of CVPR, pp. 770-778, 2016” diverges from this approach by adding residual connections between blocks and by using batch normalization. A similar CNN architecture is adopted in the present method, which includes residual connects and batch normalization. Two types of blocks with residual connections are used, one that keeps the number filters (example depicted in FIG. 6) and one that doubles the number of filters (example depicted in FIG. 7). In both cases, the Exponential Linear Unit (ELU) transfer function and average pooling are used.

FIG. 8 presents an example generic architecture of the convolutional neural networks with residual connections in accordance with one or more embodiments. From this generic CNN architecture, 5 particular CNN architectures are derived that have slight variations, e.g. different kernel shapes (3×7 or 6×7), strides (3×2 or 2×2), number of residual blocks (from 3 to 5). Despite these variations, all CNN architectures are trained on multi-class motion signal classification task, using the classical softmax loss. Each network is trained on mini-batches of 80 examples for 50-100 epochs, using a learning rate of 0.005. The chosen optimization algorithm is SGD with momentum set to 0.9. After the training process is finished, the last three layers named Dropout2, Softmax and SoftmaxLoss are removed. The output of the last remaining layer (a fully-connected layer with 100 neurons named Embedding) is then used as a feature vector that is automatically learned from the input motion signals. Given that 5 CNN models are independently trained, a total of 500 deep features are obtained. These features can also be interpreted as an embedding of the motion signals into a 500-dimensional vector space, in which the users can be classified more easily.

To recap the feature extraction techniques of the present systems and methods disclosed herein, a broad diversity of techniques have been applied, ranging from standard techniques used for analyzing time-series, e.g., (1) statistical features and (2) correlation features, to feature extraction methods adapted from the speaker and voice recognition domain, e.g. (3) Mel Frequency Cepstral Coefficients and (4) Shifted Delta Cepstral, and feature extraction methods adapted from the computer vision domain, e.g. (5) Histogram of Oriented Gradients and (6) deep embeddings extracted with Convolutional Neural Networks. A feature extraction method adapted from stochastic process analysis has also been applied, namely the (7) Markov Transition Matrix.

Different from standard methods, another important and distinctive feature of the system and method disclosed herein is the use of such a broad and diverse set of features. To our knowledge, there are no methods or systems that incorporate such a broad set of features. A challenge in incorporating so many different features is to be able to effectively train a classification model with only a few examples, e.g. 10-100, per user. First, the feature values are in different ranges, which can negatively impact the classifier. To solve this problem, the present system independently normalizes each set of features listed above. Secondly, there are far more features (thousands) than the number of examples, and even a simple linear model can output multiple solutions that fit the data. To prevent this problem, the present system applies a feature selection technique, Principal Component Analysis, before the classification stage, as discussed in further detail below.

It should be noted that, in one or more embodiments disclosed herein, every feature extraction method is applied on the entire signal, in order to characterize the global features of the signals, and also on shorter timeframes of the signals, in order to characterize the local patterns in the signal. Depending on the feature set, two approaches are used for extracting shorter timeframes from the signal. One approach is based on recursively dividing the signal into bins, which generates a pyramid representation of the signal. In the first level of the pyramid, one bin that spans the entire signal is used. In the second level of the pyramid, the signal is divided into two bins. In the third level of the pyramid, each bin is divided from the second level into two other bins, resulting in a total of 4 bins. In the fourth level of the pyramid, the divisive process continues and 8 bins are obtained. This approach can be visualized using a pyramid representation with four levels, with 1, 2, 4, and 8 bins on each level, respectively. This process is inspired by the spatial pyramid representation presented in “S. Lazebnik, C. Schmid, J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169-2178, 2006”, which is commonly used in computer vision to recover spatial information in the bag-of-visual-words model, as illustrated in the paper “R. T. Ionescu, M. Popescu, C. Grozea. Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition. In Proceedings of Workshop on Challenges in Representation Learning, ICML, 2013”. The pyramid representation is used to extract statistical features, correlation features and Markov Transition Matrix features. On the other hand, a different approach is employed for computing shorter timeframes when the MFCC and SDC techniques are used to extract features. This approach is also inspired from the computer vision field, more specifically by the common sliding window approach used in object detection, which is presented in “C. Papageorgiou, T. Poggio. A trainable system for object detection. International Journal of Computer Vision, vol. 38, no. 1, pp. 15-33, 2000”. Instead of sliding a 2D window over an image, a 1D window is slid over the motion signal. For each window, the MFCC and the SDC features are extracted. In the sliding window approach, the windows can have significant amount of overlap. In at least one embodiment described herein, the overlap allows one to employ multiple and larger windows, which are necessary for the MFCC and SDC processing steps. Different from the sliding window approach, it is noted that the pyramid representation generates disjointed (non-overlapping) bins. Finally, it should be noted that the spatial pyramid representation or the sliding window algorithm has never been used in related art on biometric user authentication based on motion sensors.

FIG. 9 displays an exemplary spatial pyramid technique applied to on a 2D signal and

FIG. 10 displays an exemplary sliding window, in accordance with one or more embodiments described herein.

FIG. 11 presents the computation flow of the feature extraction step (S110) of the present method of verifying a user based on the interaction with a device by measuring it with mobile sensors, e.g. accelerometer and gyroscope, in accordance with one or more embodiments described herein. As shown in FIG. 11, in step S110 of the present method, the processor of the mobile device is configured by executing one or more software modules, including one or more feature extraction algorithms, to extract a one respective set of features from the collected motion signal(s) 1105. A given set of features can include discriminative and non-discriminative features extracted from the motion signal 1105 by a given feature extraction algorithm among the one or more algorithms. To extract the respective sets of features, the processor analyzes the motion signals using the one or more feature extraction algorithms, which are chosen from the following: statistical feature extraction technique 1110, correlation features extraction technique 1115, Mel Frequency Cepstral Coefficients (MFCC) 1120, Shifted Delta Cepstral (SDC) 1125, Histogram of Oriented Gradients (HOG) 1130, Markov Transition Matrix 1135 and deep embeddings extracted with Convolutional Neural Networks (CNN) 1140. The respective sets of features extracted from the collected motion signals can be in the form of concatenate feature vectors 1145. The processor can then be configured to select one or more subsets of features (features vectors 1145) from the respective sets of features as explained in further detail below.

Feature Selection

It is noted that the present systems and methods, at least in part, address the general problem of user verification based on the motion patterns recorded during a generic interaction with a mobile device. Accordingly, the present systems and methods use a general approach for verifying the user, which is independent of the verification context: explicit, implicit, one-time verification or continuous verification. The interaction is also defined as being composed of one or more different gestures, depending on the context. The types of gestures performed by the user and measured with the mobile phone sensors are not constrained and can vary in multiple ways. Therefore, the approaches of the present systems and methods have a high level of flexibility in characterizing the interaction of a user using the mobile device. For this reason, an extended set of features (feature vectors 1145, FIG. 11) are extracted that contains discriminative features for various types of gestures and hand movements. More precisely, it is noted that each feature extraction technique can provide a different type of information about the recorded signal, e.g. statistical information or frequency information. In a scenario where an interaction is composed of more than one gesture, the applied feature techniques will not have the same importance in characterizing each type of gesture. A gesture, in this case, can be characterized better by a combination of features which is a subset of the entire set of features, and this combination of features may not necessarily work best for another gesture.

To adapt the features of the present systems and methods for a more specific set of interactions, e.g. implicit one-time verification, a feature selection algorithm is employed. Specifically, referring again to FIG. 1, at step S115 the processor of the mobile device is configured by executing one or more software modules, including preferably a feature selection module, to select a subset of discriminative features from the set of extracted features of the user. In one or more embodiments, the feature selection module employs the feature selection algorithm. The role of the feature selection algorithm is to select the most representative features that characterize a specific set of interactions composed of multiple gesture and, in the same time, the most discriminative features used for verifying the actual user against different impersonators whom are replicating the interaction. In one or more embodiments of the present systems and methods, the technique that is incorporated in the feature selection algorithm is Principal Component Analysis (PCA), a feature selection approach used in the field of machine learning.

Principal Component Analysis performs dimensionality reduction by finding a projection matrix which embeds the original feature space, where the feature vectors reside, into a new feature space with less dimensions. The PCA algorithm has two properties that assist with the subsequent classification step: (1) the calculated dimensions are orthogonal and (2) the dimensions selected by the algorithm are ranked according to the variance of the original features, in descending order. The orthogonal property assures that the dimensions of the embedded feature space are independent of each other. For example, if in the original space the features have high covariance, meaning that the calculated features are correlated, then the system employs the algorithm to calculate the dimensions so that the features projected in the new space can be represented as a linear combination. Thus, the system, by way of the feature selection algorithm, eliminates any correlation between the features, e.g. one feature X will not influence another feature Y in the new space. The ranking according to the variance assures that the dimensions of the new space are the ones that can best describe the original data. The information quantity projected into the new space, measured in terms of variance, can vary depending on the number of dimensions selected to be calculated by the PCA algorithm. Thus, the number of dimensions has a direct influence on the quantity of information preserved in the new projected space. The second property allows one to find the number of dimensions that provides the most representative and discriminative features. This value has been determined through experimental runs, by starting from 50 dimensions, up to 300 dimensions, with a step of 50. The best results obtained are in the range of 100 to 250 dimensions, depending on the context of the interaction. In one or more embodiments, the number of dimensions that gives good results captures about 80% of the variability in the original space. The analysis denotes the fact the rest of 20% is provided by redundant features which are eliminated by PCA.

As such, in one or more embodiments, in the step of feature selection (S115) the processor of the mobile device is configured by executing the feature selection module to rank the extracted features based on the level of variability between users and to select the feature with the highest levels of variability to form the subset of discriminative features. A small and diverse (orthogonal) set of features with high variance can make the classification task less complex, i.e., the classifier selects the optimal weights for a smaller set of features, those that are more discriminative for the task at hand. The discriminative features are selected after combining each kind of features into a single set of features. In other words, PCA is not applied independently on each set of features from the respective feature extraction algorithms, but rather it is applied on a single set of features that made by combining the features from each feature extraction algorithm.

Classification

With continued reference to FIG. 1, at step S120 the processor of the mobile device is configured by executing one or more software modules, including preferably a classification module (classification algorithm(s)), to classify the user as a genuine user or an imposter user based on a classification score generated by the classification algorithm(s) (i.e., classifiers) from an analysis of the subset of discriminative features. In one or more embodiments, for step S120 an ensemble learning approach is used by combining different types of classifiers.

The technique used in certain biometric verification approaches is a meta-learning method known as stacked generalization. Stacked generalization (or stacking), as introduced in “D. H. Wolpert. Stacked generalization. Neural Networks, vol. 5, pp. 241-259, 1992”, is based on training a number of base learners (classifiers) on the same data set of samples. The outputs of base classifiers are subsequently being used for a higher-level learning problem, building a meta-learner that links the outcomes of the base learners to the target label. The meta-learner then produces the final target outcome. The method has been proven to be effective for many machine learning problems, especially in the case when the combined base learners are sufficiently different from each other and make distinct kinds of errors. Meta-learning aims to reduce the overall error by eliminating the specific errors of the individual (base) classifiers.

Due to the high level of generality desired by the present systems and methods in order to address a high variability of possible gestures, different types of base classifiers can be applied for modeling all the dynamics that a user interaction process has. The stacked generalization technique, a meta-classifier, improves generalization performance, and this represents an important criterion when modelling processes using machine learning techniques.

In at least one embodiment described herein, the meta-learning approach at step S120 is organized in two layers. The first layer provides multiple classifications of the user interaction using the features selected by the PCA algorithm, while the second layer classifies the user interaction using the information (output) given by the first layer. It should be note that, different from the standard approach, the features used in the second layer are composed of both the predicted labels (−1 or +1) and the classification scores (continuous real values) produced by the classifiers from the first layer. In prior approaches, the second layer received as features only the predicted labels of the base classifiers. In the present systems and methods, the classification scores are used as well, but they are interpreted as unnormalized log-probabilities and they are transformed as follows:

s*=(e^s/(e^s+e^−s))*2−1,

where e is the Euler number, s is the classification score of a base classifier and s* is the score normalized between −1 and 1.

In at least one embodiment disclosed herein, as classification techniques, the present systems and methods use binary classifiers that distinguish between two classes, a positive (+1) class corresponding to the Genuine User and a negative (−1) class corresponding to Impostor Users. The Genuine User class represents the user to be verified, while the Impostor User class represents the attackers who try to impersonate the actual user during the verification process.

For the first layer of the stacked generalization technique, the following classifiers can be used:

- Support Vector Machines (SVM)—Support Vector Machines try to find the vector of weights that defines the hyperplane that maximally separates the training examples belonging to the two classes. The training samples that fall inside the maximal margin are called support vectors.
- Naïve Bayes Classifier (NB)—The NB classification technique can be applied to binary classification problems as well as multi-class problems. The method is based on Bayes Theorem with an assumption of independence among predictors. A NB classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For some types of probability models, NB classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for NB models is based on the maximum likelihood method. Despite of its simplicity, Naïve Bayes can often outperform more sophisticated classification methods.
- Multi-Layer Perceptron (MLP)—The Multi-Layer Perceptron, also known as feed-forward neural networks, is organized into sequential layers of perceptron units. The information through the network moves in only one direction, from the input layer, through the hidden layers and to the output layers, without forming any cycles. Neural networks for multi-class classification problems can be trained using gradient descent or variants of the gradient descent algorithm in order to minimize a loss function. The training process is based on alternating two steps, a forward pass and backward pass, until the model's prediction error is sufficiently low. The forward pass consists of passing the training data through the model in order to predict the class labels. In the backward pass, the error given by the current predictions is used to update the model in order to improve the model and reduce its error. In order to update the model's weights, the errors are back-propagated through the network. After several iterations (epochs) over the training data, the algorithm finds the model's weights that minimize the prediction error on the training set. This is done by making small adjustments to the model's weights that move it along the gradient (slope) of the loss function down towards a minimum error value.
- Random Forest Classifier (RF)—The Random Forest Classifier is an ensemble learning method used for binary and multi-class classification problems, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes. A decision tree (as a predictive model) goes from observations about an item (represented in the branches) to conclusions about the item's class label (represented in the leaves).
- Kernel Ridge Regression (KRR)—Kernel Ridge Regression is technique that combines Ridge Regression with the kernel trick, thus learning a linear function in the space induced by a kernel function. Kernel Ridge Regression selects the vector of weights that simultaneously has small empirical error and small norm in the Reproducing Kernel Hilbert Space generated by the kernel function.

As a meta-classifier, the present systems and methods can use Support Vector Machines classifiers in accordance with at least one embodiment. It yields good performance in terms of accuracy, False Acceptance Rate (FAR) and False Rejection Rate (FRR). It is noted that the stacked generalization technique boosts the accuracy by around 1-2% over the best base classifier. The base classifiers are trained independently, using specific optimization techniques. For training, a standard supervised learning process is used in which a classifier is trained on a set of feature vectors with corresponding labels (indicating the user that produced the motion signal from which the feature vector is obtained by feature extraction and selection) such that the classifier learns to predict, as accurately as possible, the target labels. In this regard, for example, the SVM classifier is trained using Sequential Minimal Optimization, the NB model is trained using Maximum Likelihood Estimation, the MLP is trained using Stochastic Gradient Descent with Momentum, the RF classifier is constructed based on Gini Impurity, and KRR is trained by Cholesky decomposition.

FIGS. 12A-12B presents the computation flow of the approach to verify a user based on interaction with a mobile device measuring it through mobile sensors in accordance with one or more embodiments disclosed herein. In particular, FIGS. 12A-12B display exemplary feature extraction (S110), feature selection (S115), and classification (S120) steps in accordance with one of more embodiments of the present method.

In FIG. 12A (as discussed above for FIG. 11), at step S110, the processor of the mobile device is configured by executing one or more software modules, including the feature extraction module, to extract a set of features from the collected motion signals 1105 using one or more of: statistical feature extraction technique 1110, correlation features extraction technique 1115, Mel Frequency Cepstral Coefficients (MFCC) 1120, Shifted Delta Cepstral (SDC) 1125, Histogram of Oriented Gradients (HOG) 1130, Markov Transition Matrix 1135 and deep embeddings extracted with Convolutional Neural Networks (CNN) 1140. The set of features extracted from the collected motion signals can be in the form of concatenate feature vectors 1145.

At step S115 the processor of the mobile device is configured by executing the feature selection module, to select a subset of discriminative features from the set of extracted features (feature vectors 1145) of the user. The feature selection module utilizes the Principal Component Analysis approach to rank the extracted features based on their respective levels of variability among users.

Turning now to FIG. 12B, once the subset of discriminative features has been selected, at step S120, the processor of the mobile device is configured by executing one or more software modules, including the classification module, to classify the user as a genuine user or an imposter user based on a classification score generated by the classification algorithm(s) from an analysis of the subset of discriminative features. One or more of the following classifiers are used as classification algorithms for step S120: Naïve Bayes classifier 1305, Support Vector Machine (SVM) classifier 1310, Multi-layer Perception classifier 1315, Random Forest classifier 1320, and Kernel Ridge Regression (KRR) 1325. The classification of subset of discriminative features results in the generation of a classification score 1330 for the user. This classification score is specific to the captured motion signals of the user. At step S120, the classification score 1330 can also be stored in the storage or database of the mobile device or a system server operatively connected to the mobile device via a network. In one or more embodiments, the classification score can be determined from via an analysis of one or more scores generated by each of the classification algorithms.

As discussed above with reference to FIG. 1 and FIGS. 12A-12B, steps S105-S120 can be performed in accordance with an enrollment stage and an authentication stage. Specifically, in the enrollment stage, motion sensor data of a particular user are collected by the user's mobile device. This motion sensor data is analyzed and processed to extract features (or characteristics) present in the data and to generate classification score 1330, which is later useable to authenticate the user in an authentication stage. For instance, in an authentication stage, steps S105-S120 can be performed again in order to determine, based on the classification score, whether the user is a genuine user or an imposter user.

As discussed above, the present methods can be implemented using one or more aspects of the present system as exemplified in FIG. 13. FIG. 13 discloses a high-level diagram of the present system 1400 for user recognition using motion sensor data in accordance with one or more embodiments. In some implementations, the system includes a cloud-based system server platform that communicates with fixed PC's, servers, and devices such as smartphones, tablets, and laptops operated by users. As the user attempts to access a networked environment that is access controlled, for example a website which requires a secure login, the user is prompted to authenticate using the user's mobile device. Authentication can then include verifying (authenticate) the user's identity based on the mobile sensor data captured by the mobile device.

In one arrangement, the system 1400 consists of a system server 1405 and user devices including a mobile device 1401a and a user computing device 1401b. The system 1400 can also include one or more remote computing devices 1402.

The system server 1405 can be practically any computing device and/or data processing apparatus capable of communicating with the user devices and remote computing devices and receiving, transmitting and storing electronic information and processing requests as further described herein. Similarly, the remote computing device 1402 can be practically any computing device and/or data processing apparatus capable of communicating with the system server and/or the user devices and receiving, transmitting and storing electronic information and processing requests as further described herein. It should also be understood that the system server and/or remote computing device can be a number of networked or cloud-based computing devices.

In one or more embodiments, the user devices—mobile device 1401a and user computing device 1401b—can be configured to communicate with one another, the system server 105 and/or remote computing device 102, transmitting electronic information thereto and receiving electronic information therefrom. The user devices can be configured capture and process motion signals from the user, for example, corresponding to one or more gestures (interactions) from a user 1424.

The mobile device 1401a can be any mobile computing devices and/or data processing apparatus capable of embodying the systems and/or methods described herein, including but not limited to a personal computer, tablet computer, personal digital assistant, mobile electronic device, cellular telephone or smart phone device and the like. The computing device 1401b is intended to represent various forms of computing devices that a user can interact with, such as workstations, a personal computer, laptop computer, access control devices or other appropriate digital computers.

It should be noted that while FIG. 13 depicts the system 1400 for user recognition with respect to a mobile device 1401a and a user computing device 1401b and a remote computing device 1402, it should be understood that any number of such devices can interact with the system in the manner described herein. It should also be noted that while FIG. 13 depicts a system 1400 for user recognition with respect to the user 1424, it should be understood that any number of users can interact with the system in the manner described herein.

It should be further understood that while the various computing devices and machines referenced herein, including but not limited to mobile device 1401a and system server 1405 and remote computing device 1402 are referred to herein as individual/single devices and/or machines, in certain implementations the referenced devices and machines, and their associated and/or accompanying operations, features, and/or functionalities can be combined or arranged or otherwise employed across a number of such devices and/or machines, such as over a network connection or wired connection, as is known to those of skill in the art.

It should also be understood that the exemplary systems and methods described herein in the context of the mobile device 1401a (also referred to as a smartphone) are not specifically limited to the mobile device and can be implemented using other enabled computing devices (e.g., the user computing device 1402b).

With reference now to FIG. 14A, mobile device 1401a of the system 1400, includes various hardware and software components that serve to enable operation of the system, including one or more processors 1410, a memory 1420, a microphone 1425, a display 1440, a camera 1445, an audio output 1455, a storage 1490 and a communication interface 1450. Processor 1410 serves to execute a client application in the form of software instructions that can be loaded into memory 1420. Processor 1410 can be a number of processors, a central processing unit CPU, a graphics processing unit GPU, a multi-processor core, or any other type of processor, depending on the particular implementation.

Preferably, the memory 1420 and/or the storage 1490 are accessible by the processor 1410, thereby enabling the processor to receive and execute instructions encoded in the memory and/or on the storage so as to cause the mobile device and its various hardware components to carry out operations for aspects of the systems and methods as will be described in greater detail below. Memory can be, for example, a random access memory (RAM) or any other suitable volatile or non-volatile computer readable storage medium. In addition, the memory can be fixed or removable. The storage 1490 can take various forms, depending on the particular implementation. For example, the storage can contain one or more components or devices such as a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. Storage also can be fixed or removable.

One or more software modules 1430 are encoded in the storage 1490 and/or in the memory 1420. The software modules 1430 can comprise one or more software programs or applications having computer program code, or a set of instructions executed in the processor 1410. As depicted in FIG. 14B, preferably, included among the software modules 1430 is a user interface module 1470, a feature extraction module 1472, a feature selection module 1474, a classification module 1475 an enrollment module 1476, a database module 1478, a recognition module 1480 and a communication module 1482 that are executed by processor 1410. Such computer program code or instructions configure the processor 1410 to carry out operations of the systems and methods disclosed herein and can be written in any combination of one or more programming languages.

The program code can execute entirely on mobile device 1401a, as a stand-alone software package, partly on mobile device, partly on system server 1405, or entirely on system server or another remote computer/device. In the latter scenario, the remote computer can be connected to mobile device 1401a through any type of network, including a local area network (LAN) or a wide area network (WAN), mobile communications network, cellular network, or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider).

It can also be said that the program code of software modules 1430 and one or more computer readable storage devices (such as memory 1420 and/or storage 1490) form a computer program product that can be manufactured and/or distributed in accordance with the present invention, as is known to those of ordinary skill in the art.

It should be understood that in some illustrative embodiments, one or more of the software modules 1430 can be downloaded over a network to storage 1490 from another device or system via communication interface 1450 for use within the system 1400. In addition, it should be noted that other information and/or data relevant to the operation of the present systems and methods (such as database 1485) can also be stored on storage. Preferably, such information is stored on an encrypted data-store that is specifically allocated so as to securely store information collected or generated by the processor executing the secure authentication application. Preferably, encryption measures are used to store the information locally on the mobile device storage and transmit information to the system server 105. For example, such data can be encrypted using a 1024 bit polymorphic cipher, or, depending on the export controls, an AES 256 bit encryption method. Furthermore, encryption can be performed using remote key (seeds) or local keys (seeds). Alternative encryption methods can be used as would be understood by those skilled in the art, for example, SHA256.

In addition, data stored on the mobile device 1401a and/or system server 1405 can be encrypted using a user's motion sensor data or mobile device information as an encryption key. In some implementations, a combination of the foregoing can be used to create a complex unique key for the user that can be encrypted on the mobile device using Elliptic Curve Cryptography, preferably at least 384 bits in length. In addition, that key can be used to secure the user data stored on the mobile device and/or the system server.

Also, in one or more embodiments, a database 1485 is stored on storage 1490. As will be described in greater detail below, the database contains and/or maintains various data items and elements that are utilized throughout the various operations of the system and method 1400 for user recognition. The information stored in database can include but is not limited to user motion sensor data templates and profile information, as will be described in greater detail herein. It should be noted that although database is depicted as being configured locally to mobile device 1401a, in certain implementations the database and/or various of the data elements stored therein can, in addition or alternatively, be located remotely (such as on a remote device 1402 or system server 1405—not shown) and connected to mobile device through a network in a manner known to those of ordinary skill in the art.

A user interface 1415 is also operatively connected to the processor. The interface can be one or more input or output device(s) such as switch(es), button(s), key(s), a touch-screen, microphone, etc. as would be understood in the art of electronic computing devices. User interface 1415 serves to facilitate the capture of commands from the user such as an on-off commands or user information and settings related to operation of the system 1400 for user recognition. For example, in at least one embodiment, the interface 1415 can serves to facilitate the capture of certain information from the mobile device 1401a such as personal user information for enrolling with the system so as to create a user profile.

The computing device 1401a can also include a display 1440 which is also operatively connected to processor the processor 1410. The display includes a screen or any other such presentation device which enables the system to instruct or otherwise provide feedback to the user regarding the operation of the system 1400 for user recognition. By way of example, the display can be a digital display such as a dot matrix display or other 2-dimensional display.

By way of further example, the interface and the display can be integrated into a touch screen display. Accordingly, the display is also used to show a graphical user interface, which can display various data and provide “forms” that include fields that allow for the entry of information by the user. Touching the touch screen at locations corresponding to the display of a graphical user interface allows the person to interact with the device to enter data, change settings, control functions, etc. So, when the touch screen is touched, user interface communicates this change to processor, and settings can be changed, or user entered information can be captured and stored in the memory.

Mobile device 1401a also includes a camera 1445 capable of capturing digital images. The mobile device 1401a and/or the camera 1445 can also include one or more light or signal emitters (e.g., LEDs, not shown) for example, a visible light emitter and/or infrared light emitter and the like. The camera can be integrated into the mobile device, such as a front-facing camera or rear facing camera that incorporates a sensor, for example and without limitation a CCD or CMOS sensor. As would be understood by those in the art, camera 1445 can also include additional hardware such as lenses, light meters (e.g., lux meters) and other conventional hardware and software features that are useable to adjust image capture settings such as zoom, focus, aperture, exposure, shutter speed and the like. Alternatively, the camera can be external to the mobile device 1401a. The possible variations of the camera and light emitters would be understood by those skilled in the art. In addition, the mobile device can also include one or more microphones 1425 for capturing audio recordings as would be understood by those skilled in the art.

Audio output 1455 is also operatively connected to the processor 1410. Audio output can be any type of speaker system that is configured to play electronic audio files as would be understood by those skilled in the art. Audio output can be integrated into the mobile device 1401a or external to the mobile device 1401a.

Various hardware devices/sensors 1460 are also operatively connected to the processor. The sensors 1460 can include: an on-board clock to track time of day, etc.; a GPS enabled device to determine a location of the mobile device; Gravity magnetometer to detect the Earth's magnetic field to determine the 3-dimensional orientation of the mobile device; proximity sensors to detect a distance between the mobile device and other objects; RF radiation sensors to detect the RF radiation levels; and other such devices as would be understood by those skilled in the art.

As discussed above, the mobile device 1401a also comprises an accelerometer 1462 and a gyroscope 1464, which are configured to capture motion signals from the user 1424. In at least one embodiment, the accelerometer can also be configured to track the orientation and acceleration of the mobile device. The mobile device 1401a can be set (configured) to provide the accelerometer and gyroscope values to the processor 1410 executing the various software modules 1430, including the feature extraction module 1472, feature selection module 1474, and classification module 1475.

Communication interface 1450 is also operatively connected to the processor 1410 and can be any interface that enables communication between the mobile device 101a and external devices, machines and/or elements including system server 1405. Preferably, communication interface includes, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver (e.g., Bluetooth, cellular, NFC), a satellite communication transmitter/receiver, an infrared port, a USB connection, and/or any other such interfaces for connecting the mobile device to other computing devices and/or communication networks such as private networks and the Internet. Such connections can include a wired connection or a wireless connection (e.g. using the 802.11 standard) though it should be understood that communication interface can be practically any interface that enables communication to/from the mobile device.

At various points during the operation of the system 1400 for user recognition, the mobile device 1401a can communicate with one or more computing devices, such as system server 1405, user computing device 1401b and/or remote computing device 1402. Such computing devices transmit and/or receive data to/from mobile device 101a, thereby preferably initiating maintaining, and/or enhancing the operation of the system 1400, as will be described in greater detail below.

FIG. 14C is a block diagram illustrating an exemplary configuration of system server 1405. System server 1405 can include a processor 1510 which is operatively connected to various hardware and software components that serve to enable operation of the system 1400 for user recognition. The processor 1510 serves to execute instructions to perform various operations relating to user recognition as will be described in greater detail below. The processor 1510 can be a number of processors, a multi-processor core, or some other type of processor, depending on the particular implementation.

In certain implementations, a memory 1520 and/or a storage medium 290 are accessible by the processor 1510, thereby enabling the processor 1510 to receive and execute instructions stored on the memory 1520 and/or on the storage 1590. The memory 1520 can be, for example, a random access memory (RAM) or any other suitable volatile or non-volatile computer readable storage medium. In addition, the memory 1520 can be fixed or removable. The storage 1590 can take various forms, depending on the particular implementation. For example, the storage 1590 can contain one or more components or devices such as a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. The storage 1590 also can be fixed or removable.

One or more of the software modules 1530 are encoded in the storage 1590 and/or in the memory 1520. One or more of the software modules 1530 can comprise one or more software programs or applications (collectively referred to as the “secure authentication server application”) having computer program code or a set of instructions executed in the processor 1510. Such computer program code or instructions for carrying out operations for aspects of the systems and methods disclosed herein can be written in any combination of one or more programming languages, as would be understood by those skilled in the art. The program code can execute entirely on the system server 1405 as a stand-alone software package, partly on the system server 1405 and partly on a remote computing device, such as a remote computing device 1402, mobile device 1401a and/or user computing device 1401b, or entirely on such remote computing devices. As depicted in FIG. 14B, preferably, included among the software modules 1530 are a feature selection module 1474, a classification module 1475 an enrollment module 1476, a database module 1478, a recognition module 1480 and a communication module 1482, that are executed by the system server's processor 1510.

Also preferably stored on the storage 1590 is a database 1580. As will be described in greater detail below, the database 1580 contains and/or maintains various data items and elements that are utilized throughout the various operations of the system 1400, including but not limited to, user profiles as will be described in greater detail herein. It should be noted that although the database 1580 is depicted as being configured locally to the computing device 1405, in certain implementations the database 1580 and/or various of the data elements stored therein can be stored on a computer readable memory or storage medium that is located remotely and connected to the system server 1405 through a network (not shown), in a manner known to those of ordinary skill in the art.

A communication interface 1550 is also operatively connected to the processor 1510. The communication interface 1550 can be any interface that enables communication between the system server 1405 and external devices, machines and/or elements. In certain implementations, the communication interface 1550 includes, but is not limited to, a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency transmitter/receiver (e.g., Bluetooth, cellular, NFC), a satellite communication transmitter/receiver, an infrared port, a USB connection, and/or any other such interfaces for connecting the computing device 1405 to other computing devices and/or communication networks, such as private networks and the Internet. Such connections can include a wired connection or a wireless connection (e.g., using the 802.11 standard) though it should be understood that communication interface 1550 can be practically any interface that enables communication to/from the processor 1510.

The operation of the system 1400 and its various elements and components can be further appreciated with reference to the methods for user recognition using motion sensor data as described above for FIGS. 1-12. The processes depicted herein are shown from the perspective of the mobile device 1401a and/or the system server 1405, however, it should be understood that the processes can be performed, in whole or in part, by the mobile device 1401a, the system server 1405 and/or other computing devices (e.g., remote computing device 1402 and/or user computing device 1401b) or any combination of the foregoing. It should be appreciated that more or fewer operations can be performed than shown in the figures and described herein. These operations can also be performed in a different order than those described herein. It should also be understood that one or more of the steps can be performed by the mobile device 1401a and/or on other computing devices (e.g. computing device 1401b, system server 1405 and remote computing device 1402).

The following discussion describes further exemplary approaches for analyzing motion signals for user recognition or authentication, in accordance with one or more disclosed embodiments, described here in the context of mobile devices with a motion sensor.

Specifically, in accordance with one or more embodiments, the captured motions signals can be partitioned into multiple “chunks” or segments, and then features of the user can be extracted from these chunks using one or more feature extraction algorithms as described throughout the present disclosure. In at least one embodiment, the feature extraction algorithms can include deep features extractors based on Convolutional Gated Recurrent Units (convGRU) and Convolutional Vision Transformers (CVT). In embodiments in which the motion signals are partitioned into chunks or segments, the feature extraction algorithms process the motion signal chunks instead of the entire captured motion signal (e.g., motion signals recorded during a complete authentication session). Additionally, in at least one embodiment, the feature selection algorithm used for selecting a subset of discriminative features from among the features extracted using the feature extraction algorithm is a principal component analysis (PCA) algorithm that has been trained on individual users such that it provides improved selection of discriminative features for use in identifying particular users.

Analysis Based on Shorter Signal Chunks

The systems and methods of the present application can easily recognize hand gestures or actions performed by the user in the time frame during which the motion signals are captured by the motion sensor(s) (e.g., gyroscope and accelerator). Thus, in certain embodiments, the present systems and methods recognizes or authenticates a user based on certain ample actions taken by the user.

In one or more embodiments, the present systems and methods further recognizes or authenticates a user with a greater focus on other user-specific patterns, such as a tremor of the hand, as compared with ample actions. More specifically, as described in further detail below, the present systems and methods diminishes the emphasis on features that correspond to ample gestures by partitioning the captured motion signal into multiple “chunks” or segments. In one or more embodiments, the motion signal can be partitioned during an authentication session. In one or more embodiments, the motion signal can be partitioned before the step of feature extraction in the present methods.

For example, FIGS. 15A-15B display block diagrams showing a computation flow 200 for verifying a user based on interaction with a mobile device measured through mobile sensors in accordance with at least one embodiment disclosed herein. In one or more embodiments, steps of the method exemplified in FIGS. 15A-15B that correspond to steps exemplified in FIGS. 11, 12A, and 12B are performed in a substantially the same fashion as the corresponding steps of FIGS. 11, 12A, and 12B described above, unless otherwise noted. As exemplified in FIG. 15A, at step S205 captured motion signals (e.g., motion signals captured via motion sensors, such as an accelerometer and/or a gyroscope) are partitioned into chunks, e.g. of 0.5 seconds. Depending on the length of the captured motion signal and the number of chunks, the resulting signal chunks may overlap. In at least one embodiment, the number of signal chunks can depend on the type of feature algorithms, for example, in certain implementations, 5 chunks can be used for CNNs, 10 chunks can be used for MFCC, and so on. This is to ensure an optimal trade-off between speed and accuracy, namely choosing less chunks implies higher time efficiency, and more chunks implies higher accuracy. In at least one embodiment, the number of chunks can be set to 5 for optimal trade-off for the shallow features (e.g. statistical features, HOG features, Markov Transition Matrix [MTM] features, etc.), and in such embodiments, after partitioning, feature extraction can be performed via Mel Frequency Cepstral Coefficients (MFCC) and histogram of oriented gradients (HOG), for example. Similarly, in at least one embodiment, the number of chunks can be set to 25 for an optimal analysis of deep features (e.g. those generated by CNN, convGRU or transformer models), and in such embodiments, after partitioning, feature extraction can be performed via Convolutional Neural Networks (CNN), Convolutional Gated Recurrent Units (convGRU), and/or Convolutional vision Transformers (CvT), for example, which rely on fast GPU processing. In other embodiments, the number of signal chunks or segments can vary. In one or more embodiment, the number of chunks can be set to a range of 5-40 chunks, 10-35 chunks, 15-30 chunks, 20-25 chunks, for example.

With reference now to FIG. 15B, upon dividing or partitioning the motion signal into chunks, at steps S210 the processor of the mobile device is configured by executing one or more software modules, including preferably the feature extraction module, to apply one or more feature extraction algorithms to each signal chunk. As such, the processor, applying the feature extraction algorithms, is configured to extract one or more respective sets of features from the partitioned chunks or segments. In one or more embodiments, each chunk is independently processed by each feature extraction algorithm in the module. As with the methods described in relation to FIGS. 11, 12A, 12B, in one or more embodiments, the feature extraction algorithms of step S210 in the module can comprise one or more of: statistical feature extraction algorithms, correlation feature extraction algorithms, Mel Frequency Cepstral Coefficients, Shifted Delta Cepstral, Histogram of Oriented Gradients, deep embeddings extracted with Convolutional Neural Networks (CNN), and Markov Transition Matrix. In at least one embodiment, the one or more feature extraction algorithms can include deep features extractors based on Convolutional Gated Recurrent Units (convGRU) and/or Convolutional Vision Transformers (CVT), as described in further detail below. The resulting sets of extracted features are then optionally combined via concatenation into one or more feature matrices, for efficient processing. In at least one embodiment, the resulting sets of features are combined via concatenation into two features matrices, keeping shallow and deep features apart from each other to accommodate for processing the shallow features on CPU and deep features on GPU, respectively. In at least one embodiment, the sets of extracted features can be combined into one or more feature matrices via processes other than concatenation.

At step S215, the resulting sets of features (e.g., in the form of respective feature matrices or combined feature matrices) are subject to feature selection, via Principal Component Analysis (PCA) for example, in a similar fashion as described above in relation to FIGS. 11, 12A, 12B (step S115) to identify the discriminative feature information resulting from extraction. At step S220, the system can then perform the classification of the discriminative feature information resulting from the feature selection step in a substantially the same manner as described above in relation to FIGS. 11, 12A, 12B (step S120). For example, classification can be performed via a classification algorithm that comprises a stacked generalization technique. The stacked generalization technique can utilize more than one of the following classifiers: (1) Naïve Bayes classifier, (2) Support Vector Machine (SVM) classifier, (3) Multi-layer Perception classifier, (4) Random Forest classifier, (5) and Kernel Ridge Regression (KRR).

Additional features of the present method related to step S210 (feature extraction algorithms including CVT and convGRU) and step S215 (a modified user-oriented PCA) are described in further detail below.

Features Based on Neural Transformers

As mentioned above, in one or more embodiments, the one or more feature extraction algorithms can include a Convolutional Vision Transformer (CVT) algorithm. For example, the Convolutional vision Transformer (CvT) is a neural architecture introduced in “Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of ICCV, pages 22-31, 2021” which aims to jointly harness the ability of capturing local spatial information in images via convolutional layers, and the ability of capturing global information via self-attention layers. The main idea behind CvT is to divide the input image into patches and apply a convolutional backbone network to process these patches locally. This allows the model to capture spatial information and local patterns effectively, like CNNs. After the convolutional processing, the patches are reshaped into sequences and fed into a transformer architecture, employing global self-attention to capture long-range dependencies among tokens. A token is a vector of features obtained by projecting a local image patch through a linear projection neural layer. The CvT architecture typically consists of several blocks. In each block, the input image patches undergo a convolutional processing step, followed by reshaping into sequences and passing through a transformer encoder.

CvT has previously been trained and applied on natural images. However, in the presently disclosed methods and system, a CvT algorithm is adapted for motion signals. For instance, in one or more embodiments, the presently disclosed systems and methods reorganize the motion signals from the motion sensor(s) to resemble the spatial structure of an image using de Brujin sequences. In at least one implementation, the raw inputs from the motion sensor(s) measure the rates of change in velocity and rotation along the 3 axes. The neural network input is obtained by using the axes of the motion signals as different rows in a matrix. Moreover, different permutations of these axes are applied via de Brujin sequences, yielding a final matrix of 42 rows, for example. The resulting matrix, which resembles an image, is given as input to the CvT.

In accordance with one or more embodiments of the presently disclosed systems and methods, a number of CvT models can be trained in a supervised manner on a data set consisting of motions signals from multiple users, and the training task is to classify the signals according to the user IDs. In at least one embodiment, the data set can be privately owned. After training, the CvT models are used as global feature extraction algorithms. The final embedding given by a CvT model for an input signal is the representation of the [CLS] token, taken just before the classification layer. The [CLS] token, also known as the class token, is a special token among the set of tokens processed by the transformer, which retains features that are useful for classification tasks.

In a series of experiments, the training of CvT models was performed for 30 epochs using an Adam optimizer with a learning rate of 5e⁻⁴. In these experiments, each CvT instance had three stages of different depths: the depth of the first stage was 1, the depth of the second stage was 2, and the depth of the third stage was 10. The first stage used patches of size 7×7, while the rest of the stages used patches of size 3×3.

Features Based on Convolutional Gated Recurrent Units

In one or more embodiments of the presently disclosed systems and methods, the one or more feature extraction algorithms can include a convolutional gated recurrent units (convGRU) model. The convGRU network generally includes one GRU layer, followed by two or three convolutional-pooling blocks, two or three fully connected layers, and a Softmax classification layer, in this order. In certain embodiments, some layers can be repeated more than two times. GRUs are a variant of recurrent neural networks (RNNs) that are designed to use gating mechanisms. The capability of recurrent neural networks (RNNs) to capture temporal relationships between successive events makes them suitable candidates for feature extraction in motion signal applications. By incorporating feedback connections or memory cells between consecutive computations, RNNs can retain nonlinear patterns that best characterize the behavior of the users.

One drawback of RNNs is the vanishing gradient problem, which occurs when gradients become equal to zero due to the high number of recurrent operations, where each operation requires the multiplication of gradients via the chain rule. To avoid the vanishing gradient problem, which prevents RNNs from capturing long-term dependencies in sequences, this particular feature extraction algorithm (convGRU) integrates a specific kind of recurrent layer known as Gated Recurrent Units (GRUs). GRUs are a variant of RNNs which are designed to use gating mechanisms. They have two gates, namely the “update” gate and the “reset” gate. The role of the update gate is to decide what information to eliminate and what information to retain, while the role of the reset gate is to decide how much of the past information to “forget”.

In one or more embodiments of the present systems and methods, one or more convGRU models are trained and used as a feature extraction algorithm, following a substantially similar method for training as carried out for the CNN-based and transformer-based feature extractors, as previously described herein.

However, in at least one embodiment, a model selection procedure is introduced in which multiple models (e.g., convGRU models) are trained, but only the best performing ones are kept as feature extractors. An exemplary convGRU architecture is detailed in Table 1. In this example, after the training stage, the two dropout layers, the classification layer and the Softmax layer were removed from the kept models. The output of the last remaining layer in each convGRU is then used as a feature vector. Since there are 5 neural networks in total, the resulting number of features is 500.

convGRU Example

A series of experiments were performed on a dataset formed of motion signals collected from five mobile devices (smartphones). Five volunteers performed multiple explicit authentications, while varying the body pose during these authentications (e.g., stand up, sit down, using the right or the left hand), on each of the five smartphones. This manner of collecting the motion signal data generates an evaluation scenario where there is one genuine user and four imposters for each smartphone.

20 convGRU models were trained to solve the multi-way classification task of signals. Each convGRU network was trained with mini-batches of 80 signal chunks for 30 epochs, using a learning rate of 0.001 and the Softmax loss. Out of the total number of trained convGRU neural networks, the top 5 models were kept for the feature extraction. The described selection procedure brings performance gains that are higher than 2% in terms of accuracy, compared with a standard procedure based on training 5 models without selection.

TABLE 1

Proposed convGRU architecture. The final two layers are removed during

inference.

Input
Number of
Kernel

Output

Index
Type
Name
shape
filters
shape
Stride
Padding
shape
Details

0
GRU
GRU
(*, *, 6)
—

(*, 100,

100)

1
Dropout
Dropout
(*, 100,
—

(*, 100,
0.5

1
100)

100)

2
Reshape
Reshape
(*, 100,
—

(*, 100,

100)

100, 1)

3
Conv2D
Conv1
(*, 100,
16
3 × 3
1

(*, 100,

100, 1)

100, 16)

4
MaxPooling2D
MaxPool1
(*, 100,
—

Valid
(*, 50,

100, 16)

50, 16)

5
Conv2D
Conv2
(*, 50,
32
3 × 3
1

(*, 50,

50, 16)

50, 32)

6
MaxPooling2D
MaxPool2
(*, 50,
—

Valid
(*, 25,

50, 32)

25, 32)

7
Dropout
Dropout
(*, 25,
—

(*, 25,
0.5

2
25, 32)

25, 32)

8
Flatten
Flatten
(*, 25,
—

(*,

25, 32)

20000)

9
Dense
Embedding
(*,
—

(*, 100)

20000)

10
Dense
Classification
(*, 100)
—

(*, 5)

11
Softmax
Softmax
5

1

User-Oriented PCA

As mentioned above, in one or more embodiments, the feature selection algorithm comprises a Principal Component Analysis. In one or more embodiments, as described above, the PCA model is trained globally, for all users, to select discriminative features.

In at least one embodiment, however, the PCA model is trained on each individual user (a “user-oriented PCA”), thus selecting user-oriented features during the feature selection stage. As discussed above, in at least one embodiment, the number of signal chunks is different for shallow and deep features, because the features are processed in parallel, on CPU and GPU, to achieve faster inference. Thus, in embodiments in which the motion signals are partitioned into chunks and the PCA is a user-oriented PCA, due to the variable number of signal chunks, the user-oriented PCA can be trained using two separate PCA models, one for shallow (engineered) features and one for deep features.

In at least one embodiment, a user triggers a retraining of PCA models after a pre-established number of authentications, for example 10. Additionally, with each PCA retraining session, the number of principal components can be increased with the number of available samples. Once a sufficiently large number of data samples (signals) is reached, the number of principal components can stay fixed, since there is no further performance (accuracy) benefit. Alternatively, the number of PCA components can be fixed from the beginning. In Table 2, an example illustrates how the number of PCA components can be increased based on a pre-established schedule.

TABLE 2

Potential schedule for increasing the number of principal

components of PCA models trained on individual users.

Number of
Number of

Number of principal

motion signals
signal chunks
Dataset size
components

10
50
50 × 100
50

20
100
100 × 100
100

30
150
150 × 100
150

40
200
200 × 100
200

50
250
250 × 100
250

60
300
300 × 100
250

At this juncture, it should be noted that although much of the foregoing description has been directed to systems and methods for user recognition using motion sensor data, the systems and methods disclosed herein can be similarly deployed and/or implemented in scenarios, situations, and settings beyond the referenced scenarios.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any implementation or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular implementations. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It should be noted that use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. It is to be understood that like numerals in the drawings represent like elements through the several figures, and that not all components and/or steps described and illustrated with reference to the figures are required for all embodiments or arrangements.

Thus, illustrative embodiments and arrangements of the present systems and methods provide a computer implemented method, computer system, and computer program product for user recognition using motion sensor data. The flowchart and block diagrams in the FIGS. illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments and arrangements. In this regard, each block in the flowchart or block diagrams can represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The subject matter described above is provided by way of illustration only and should not be construed as limiting. Various modifications and changes can be made to the subject matter described herein without following the example embodiments and applications illustrated and described, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.

	Number	Date	Country
	62644125	Mar 2018	US
	62652114	Apr 2018	US

	Number	Date	Country
Parent	16356399	Mar 2019	US
Child	18335748		US

	Number	Date	Country
Parent	18335748	Jun 2023	US
Child	18524878		US

SYSTEM AND METHOD FOR USER RECOGNITION USING MOTION SENSOR DATA

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

Provisional Applications (2)

Continuations (1)

Continuation in Parts (1)