The present invention relates to a system, method and non-transitory computer-readable medium for predicting a destination of a user based on user behavior learning.
Being able to predict a destination of a user in a vehicle, e.g. smart navigation and notification for route information, is an important aspect of vehicle and mobile device applications for providing services to users. The destination can be predicted well by learning user behavior.
The recommender engine of an industrial prediction system usually requires a variety of models to extract different aspects of both user demographics and context features. User demographics contain information such as gender, income, affordance, category preference, etc., all from the user behaviors, trying to capture both long-term and short-term user interests. Meanwhile, context features are normally represented by the temporal and spatial information such as when and where a user takes an action. Then it builds different models for each recommendation scenario using those extracted features, either continuous or categorical. These are multi-step tasks and it is difficult to optimize them jointly.
Since user trip data is a sequence of visited locations ordered by timestamps, a Recurrent Neural Network (RNN) is commonly applied to model user behaviors. RNN is a class of deep learning network that consists of a directed graph along a temporal sequence. Such architecture enables dynamic behavior modeling of the sequential data input such as trip data, but the process is slow and hardly runs in parallel.
In contextual and personal services such as predicted trip destination, it is expected to model the context, e.g. temporal and spatial information, to model user behavior, and then build a predicted model. The present invention provides a low-complexity algorithmic framework for destination prediction based on user mobility behavior learning aiming to enable scalable service. The framework can capture and extend user behavior from other areas too.
RNN is the most commonly used sequence modeling through encoding user behaviors through a temporal sequential order. However, RNN has a number of disadvantages. First, RNN is hard to parallelize in the prediction phase, which is a problem for scalable service requiring fast processing time upon extensive requests from clients. Second, the RNN embedding of the user behaviors is a fixed-sized, aggregated status, which is not well suited for modeling both long and short behavior sequences. RNN can easily fail to preserve specific behavior information when used in downstream applications.
Regarding context modeling, the context information is normally processed into a key-value pair first such as (‘weather’: ‘sunny’) or (day of ‘week’: ‘Monday’) and then converted into categorical variables. Although such modeling can support targeted downstream tasks with a rule-based approach, it is very difficult to quantitatively analyze the context feature itself in the statistical modeling approach due to the discrete value of data. Therefore, the learned features are hardly reusable in other applications.
Regarding the representation learning method, the attention-based frameworks are proposed, e.g., “ATRank” for online purchase recommendation system. The result shows potential of adapting it to predicted destination. See, e.g., https://arxiv.org/pdf/1711.06632.pdf. However, there are problems associated with this framework. Unless the order of user behavior is invariable, the collection of sequential data makes the computation very time-consuming, especially toward a large-scale periodically-updated data feed in commercial service. Also, the temporal information only contains the relative timestamp to now, hence the corresponding contextual information is not fully leveraged.
In the present invention, we provide an advanced algorithmic framework for predicted destination that aims to, given certain contextual information, predict the most possible destination considering both downstream performance and algorithm scalability. This framework has low complexity while modeling rich semantics of both context and content information recorded along user trip history.
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description of one or more preferred embodiments when considered in conjunction with the accompanying drawings.
An objective of the present invention is to provide a predicted destination to a user for any given time and any departure location whenever it is requested by the user. According to exemplary embodiments of the present invention, the algorithmic framework can be used to provide a time-to-leave that reminds a user to leave at the right time due to real time traffic change, smart preconditioning that reminds the user to precondition a vehicle or automatically precondition the vehicle before the trip (e.g., heating/cooling), and smart search and navigation in which one or more destinations are recommend to the user when the user starts to drive based on the user's context.
At a high level, the system processes the data into features organized by a feature processing layer, a context modeling layer, a user modeling layer, and similarity measurement layers. The system is trainable based on the loss (error) defined by a prediction task, e.g., a predicted destination.
As illustrated in
The feature processing layer is further described below. A user trip is defined as visiting a certain location at a given temporal context. All visited locations L and temporal context C are modeled to construct the feature processing layer consisting of the raw input. Here, we further decomposed temporal context C into two features: day of week D and hour of day H.
In order to encode spatial and temporal information, representation learning is applied to generate an “embedding” vector for different objects. An embedding is a mapping of a discrete categorical variable to a vector of continuous numbers. Therefore, the semantic meaning can be enriched for objects such as locations, hour of day, and day of week. Normally, embedding can be trained in a data-driven framework to preserve the semantic meaning of objects. Here we embed features of location set L, day of week set D, and hour of day set H as follows:
E({L})=[[L1,1, L1,2, . . . , L1,S], . . . , [LQ,1, LQ,2, . . . , LQ,S]]
E({D})=[[D1,1, D1,2, . . . , D1,S], . . . , [D7,1, D7,2, . . . , D7,S]]
E({H})=[[H1,1, H1,2, . . . , H1,S], . . . , [H24,1, H24,2, . . . , H24,S]]
where S is the pre-defined feature size of embedding vector and Q is the size of locations.
Therefore, we can encode any trip (l, d, h) in which the user visited the l-th location on the d-th day of the week at the h-th hour of the day.
E(l, d, h)=(lookuplE ({L}), lookupdE ({D}), lookuphE({H}))
where lookupi(E) is an operation that extracts the i-th row from the embedding matrix E, and the extracted vector is of (S, 1)-size.
In the context modeling layer, our goal is, given processed contextual information E(l, d, h) of trip r, to construct its semantic embedding. We let Ei=lookupi(E) represent the lookup i-th row operation for embedding matrix E, and l, d, h be the lookup index of location, day of week, and hour of day, respectively. Here we introduced the proposed embedding modeling as shown in
The detailed calculation of the behavior embedding for trip r is as follows:
r=(concatenateaxis=1(E(Ll),E(Dd),E(Hh))×w+b)
where concatenateaxis=1( ) is an operation that concatenates the three (S, 1)-size vectors along the second axis to generate an (S, 3)-size matrix, and w and b are linear transformation parameters that need to be trained.
The user modeling layer is further described below. Given the user's trip record r through context modeling, we have user trip sequence Rseq that consists of a sequence of user trip data ordered by timestamps. Our target is to learn the temporal pattern and regularity from the context data embedded in the feature, e.g. day of week and time of day, rather sequence order by time. Thus, we are able to precompute the user behavior offline, and reuse it for prediction purposes, which reduces the memory and computation cost.
Assuming the user has T number of trips, we concatenate all r along axis t to generate an (S, T)-size matrix, i.e., Rseq=(r1r2 . . . rT)t. Rseq requires a sequential data collection due to the nature of the permutation variance. For example, (r1r2 . . . )t≠(r2r1 . . . )t as we swap any two position of behavior records.
Unlike conventional sequential modeling methods, we give all possible permutations to the network such that it learns a pattern to be permutation invariant as f (r1r2 . . . rT)=f(pi(r1r2 . . . rT)) for any permutation pi, which reduces computation cost. In practice, such learning can be done through the proposed attention-based network and completely offline. Therefore, the user trip history Rhist can be generated as a set of given all permutations using elements of Rseq, i.e., Rhist={pi (Rseq)} where pi represents all possible permutations. In practice, this operation can be simplified through shuffling.
Meanwhile, we modeled the target trip in a similar way as the trip record r. The contextual information of the target trip went through all the aforementioned layers to generate a target trip rtgt.
r
tgt=(concatenateaxis=1(E(Ll),E(Dd),E(Hh))×w+b)
Given the user trip history Rhist that is (S, T)-size, we can apply a self-attention-mechanism based network, such as the one developed by Google in 2017 (https://arxiv.org/abs/1706.03762) to model the user embedding U. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. The calculation of user embedding U is simplified as follows.
where Q, K, V represents the query, key, and value, respectively, which are concepts used in the attention mechanism. Here Q=K=V=Rhist. After calculation, the output is an (S, T)-size matrix that represents personal embedding based on the user's trip history.
The similarity measurement layer is further described below. Given the model user embedding U and target trip embedding rtgt, we can use two simple one-dense-layer neural networks to map two embeddings into common semantic space and then compute the similarity sim(U, rtgt). The output of each neural network is the following:
Z
u
=ReLU(U×w+b)
Z
r
=ReLU(rtgt×w+b)
sim(U,rtgt)=dist(Zu,Zr
where ReLU is an activation function defined as the positive part of its argument relu(x)=max(0, x+), dist( )represents the distance measurement such as Euclidean distance, and w and b are the parameters that need to be trained. Therefore, we predict the target trip {tilde over (r)}tgt through choosing the highest similarity score among all candidates.
We explored the deployment of the proposed model on trip pattern prediction task that predicts which location user will visit at certain time given his/her trip history. The dataset includes user location tracking including driving. Raw features include, for example, the following: <user ID, location_gps_grid_ID, timestamp), 100 users, 778 locations through a 200 m×200 m grid by map segmentation, over a 20-week period.
We assume we have user trip records for week w that include the following:
Iw=
{(visit location i0 at time t0), . . . , (visit location iT at time tT)}, t ∈w, where we aim to predict Iw+i. We use the first 19 weeks as a set of data for training, where the data contains both location i and timestamp t information for the visit, and use the last week as the test set.
We applied I-best matching accuracy that is widely used in recommendation systems to measure the performance. Meanwhile, the parameter number and prediction time were reported to indicate the scalability. The following table shows a performance comparison with aforementioned prior art model “ATRank” and different kernel of proposed layers regarding the prediction accuracy and prediction time.
0.9 sec
The results show that the algorithmic framework and model according to the present invention achieve better prediction performance with outstanding operation/computation efficiency.
In another exemplary embodiment of the present invention, a non-transitory computer-readable medium is encoded with a computer program that performs the above-described method. Common forms of non-transitory computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
The present invention provides a number of significant advantages over conventional systems and methods. As described above, the present invention provides a context-aware learning framework for predicted destination. We are able to leverage context information to model user trip patterns. Not only the final prediction retains an improved performance, but also the intermedia output such as object embedding and user embedding can be the critical features for other downstream tasks, e.g., segmentation.
The algorithmic framework according to the present invention has low complexity. Instead of jointly modeling history trips and a target trip, we are able to separate them, pre-calculate and store the user embedding offline, and estimate the target trip online only. This dramatically decreases the complexity as online target trip calculation is much cheaper than user embedding that has to go through the all historical trips in the database.
The present invention also provides rich semantic modeling. By using embeddings, we expand the capabilities of previous Natural Language Processing (NLP) methods by creating contextual representations based on the surrounding context which leads to richer semantic models. Further, the algorithm according to the present invention outperforms conventional models in both higher prediction performance and much lower computation cost.
The foregoing disclosure has been set forth merely to illustrate the invention and is not intended to be limiting. Since modifications of the disclosed embodiments incorporating the spirit and substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the appended claims and equivalents thereof.