The present application relates to the technical field of anomaly recognition of tracks, and in particular to an anomaly recognition method and an anomaly recognition system for tracks of trucks, which are used for loan-oriented risk management.
With development of global positioning, cloud computing and other technologies, a large amount of track data with spatiotemporal location information can be collected, stored and calculated, such that anomaly detection based on big track data has become a hot issue, and some scholars at home and abroad have carried out certain research. Traditional track anomaly detection methods include outlier detection based on distances between objects and anomaly detection based on similarity calculation of historical tracks. Traditional anomaly detection techniques often ignore features of time dimensions of tracks, which is difficult to dynamically evaluate users' abnormal behavior in post-loan monitoring. With development of machine learning, there are some abnormal track recognition methods based on classification or clustering algorithms, which still can't consider the spatiotemporal correlation of tracks. Moreover, these methods depend on feature engineering to a great extent, which have high requirements on expert experience or experiments.
A global positioning system (GPS) pre-mounted on a truck can collect information such as latitude and longitude coordinates, time stamps, instantaneous speeds and directions of the truck at certain time intervals. A large number of interrelated track points constitute a vehicle track sequence. How to mine track features from such multi-dimensional spatiotemporal sequences and express same in the form of structured data is the key problem to recognize abnormal tracks.
In addition, the GPS track of the truck often has the characteristics of a wide moving range, skewness distribution, a large data scale and a fast update. Different from private cars or taxis, trucks tend to move around the country, and existing models struggle to represent a nationwide high-density track map by means of a network model. Moreover, the moving tracks of the trucks usually have the skewed distribution features such as periodic and uneven distribution. Commercial vehicles also have the features of long-term operation, which leads to a large data scale and a fast update speed, which puts forward high requirements for space complexity and time complexity of an algorithm.
In view of this, the present application puts forward an anomaly recognition method and an anomaly recognition system for tracks of trucks, so as to adapt to the complexity requirements of a large data size and a fast update speed and improve universality of the method.
According to one aspect of the present application, an anomaly recognition method for tracks of trucks is provided. The method includes:
step S5) determining stability according to a distance between the vectors, and classifying points with the stability lower than a set threshold as abnormal tracks.
The track embedding model is realized by employing a Skipgram model on the basis of a graph2vec algorithm.
Optionally, the running track T in the step S1) satisfies the following formula:
In the formulas, N represents the total number of points of the motion track, and Pn represents data of the nth point, which includes four dimensions xn, yn, tn, vn, namely longitude, latitude, time and an instantaneous speed respectively.
Optionally, the step S2) specifically includes:
Optionally, the step S3) specifically includes:
Optionally, a processing process of the track embedding model in the step S4) specifically includes:
Optionally, the extracting a rooted subgraph of each node from the network graph G specifically includes:
Optionally, the Skipgram model includes an input layer, a hidden layer and an output layer, where the output layer is a softmax regression classifier, an input of the Skipgram model is a subgraph of each node of a network graph G, and the output is probability distribution of a subgraph set, so as to obtain an embedding vector of the corresponding network graph G.
Optionally, the step S5) specifically includes:
Optionally, the method further includes a training step of the track embedding model, specifically:
According to another aspect of the present application, an anomaly recognition system for tracks of trucks is provided. The system includes: a running track obtainment module, a compression module, a clustering algorithm module, a vector output module and an anomaly recognition module.
The running track obtainment module is used for obtaining a running track T according to GPS data of running of a truck to be recognized.
The compression module is used for employing a track compression algorithm for the running track T to obtain a compressed track set C.
The clustering algorithm module is used for employing a density-based clustering algorithm and performing grouping according to set time periods to obtain a network graph G representing a motion track of each time period.
The vector output module is used for inputting the network graph G into a track embedding model established and trained in advance to obtain an explicit embedding vector corresponding to each network graph.
The anomaly recognition module is used for determining stability according to a distance between the vectors, and classifying points with the stability lower than a set threshold as abnormal tracks.
The track embedding model is realized by employing a Skipgram model on the basis of a graph2vec algorithm.
According to the technical solutions of the present application, an anomaly recognition model based on graph representation learning for tracks of trucks is provided. The model can transform a large number of spatiotemporal track sequences into track network graphs, and embed the track network as vectors by means of neural network training, quantify the stability of tracks by means of vector calculation, and recognize abnormal tracks by setting the stability threshold.
According to the technical solutions of the present application, the method has strong robustness to non-uniform and noisy samples, and meanwhile, the network can be simplified by means of track compression and track clustering, such that the operation efficiency of the algorithm is improved.
According to the present application, the complex track network structure is learned into the vector which can be expressed by using structured data, which provides possibility for a subsequent track analysis method.
According to the present application, the spatiotemporal correlation of the track is considered, and the track sequences having periodic features can be better processed.
Experiments are performed on a real commercial truck loan dataset in order to verify effectiveness of the model.
Additional features and advantages of the present application will be described in detail in the following detailed description of embodiments.
The accompanying drawings constituting a part of the present application serve to provide a further understanding of the present application, and the illustrative embodiments of the present application and the description thereof serve to interpret the present application. In the accompanying drawings:
The technical solutions are of an anomaly recognition method based on graph representation learning for tracks.
An objective is to recognize abnormal tracks of trucks. The key problem is to represent the track sequence containing spatiotemporal features as feature vectors, and therefore, a graph2cev algorithm is employed to perform representation learning on the tracks. The idea is to divide the tracks of a user according to a fixed period, represent the track of each period as a vector, calculate the stability according to a distance between the vectors, and classify points with the stability lower than a certain threshold as abnormal tracks.
In order to transform a spatiotemporal track network into a graph structure, points in a track sequence are clustered as nodes of a graph. A clustering method of density-based spatial clustering of applications with noise (DBSCAN) is better for clustering of two-dimensional latitude and longitude coordinates, but the algorithm of DBSCAN has relatively high spatial complexity and is difficult to process massive track data. Therefore, it is necessary to compress the tracks before clustering.
The technical solutions of the present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in
In order to reduce time complexity of an algorithm, the present application employs a Douglas-Peucker track compression algorithm to reduce a density of tracks and retain key nodes. The algorithm of DBSCAN is employed to classify dense points in the track into a cluster, which makes the model have better anti-interference performance. After clustering, a track network graph of a user is formed according to a time sequence.
Step S1) specifically includes:
An input of the method is vehicle-mounted GPS data, T is employed to represent a spatiotemporal track sequence of a certain user. The total number of points of the track is marked as N, that is Ti=[P1, P2, . . . , Pn, . . . , PN], where n represents the order in which the point appears in the time sequence, and each point in the sequence has four dimensions, namely longitude, latitude, time and an instantaneous speed respectively, that is Pn=[xn, yn, tn, vn]. For example, the track of a year is divided into 12 sections according to the month, namely 12 graphs.
Step S2) in which the track compression algorithm is employed specifically includes:
In practice, track compression may generally employ methods such as time interval point selection, distance spacing point selection or speed-based point selection, but these methods may lose some key data. In order to better retain basic features and reduce algorithm complexity as much as possible, the present application employs a classical Douglas-Peucker track compression algorithm. The algorithm can extract some prominent points from the original dense points, and the track connected by these points is roughly similar to an original track outline, so as to realize the function of replacing the original track.
In order to input the tracks of the trucks into the track embedding model for training, the spatiotemporal track sequence T needs to be transformed into a graph G, G=(V,E), where V represents a set of nodes, and E represents a set of edges. The method to determine the set V of nodes is to cluster track points on the whole track, and regard a cluster of points as a node. However, due to a high collection frequency and a large sample size of the GPS data of the trucks, in order to improve the operation efficiency of the algorithm, the points on the track can be thinned first by using the track compression method, and then, the nodes can be determined by the clustering method.
In the present application, for the track T=[P1, P2, . . . , Pn, . . . , PN], track compression includes the following steps:
(1) Defining variables and parameters: set a distance threshold D, defining a compressed track set C, and adding two endpoints P1 and Pn of the track into the set C.
(2) Finding a point of division: traversing all points in the track T to find the point Pc farthest from the line segment P1Pn and the maximum distance d, and if d> D, adding Pc into the set C.
(3) Performing a recursive loop: dividing the original track into two segments by a point Pc, taking Pc as an endpoint, and enabling the two segments of tracks to repeat step (2) until the maximum distance d in all sub-tracks is less than the distance threshold D.
By means of the above steps, the compressed track set C of the track T can be obtained, and a compression rate depends on a parameter distance threshold D. The smaller D is, the more original data is retained. The larger D is, the smaller the compressed point set is, but a distortion rate will also be increased. It is necessary to adjust the parameters according to the actual situation and compress the data amount on the basis of retaining the features of the original data as much as possible.
In step S3), a clustering algorithm is employed.
After the tracks are compressed, in order to form a graph suitable for being input into a deep learning model, tracks points need to be further clustered. Since the track points have obvious shape features and low dimensionality, the clustering algorithm of DBSCAN is selected in the present application. Such a clustering algorithm is a density-based clustering algorithm, which can recognize high-density regions with arbitrary shapes, has good anti-interference performance, and has a very significant effect on track data processing.
In the present application, for the track C=[P1, P2, . . . , Pm> . . . , PM] (M represents the total number of sampling points after track compression), the steps of track clustering are as follows:
(1) Setting parameters: k is set as the minimum number of points in a neighborhood, and r is set as a neighborhood radius.
(2) Creating a group: randomly selecting a point Pm, if other points exist in the neighborhood radius r of Pm and the number is greater than k−1, creating a new group A and classify Pm into the group, otherwise classifying Pm as a noise point, and reselecting points.
(3) Expanding the group: traversing all points in the neighborhood of Pm, if other points exist in the neighborhood radius r and the number is greater than k−1, classifying same into a new group A, and continuing performing recursion by using this method until no point that satisfies the requirements exists in the neighborhood.
(4) Performing cyclic grouping: randomly selecting points again, and repeating the above process until all the points have groups to which theses points belong or are recognized as noise points.
Since this method can recognize a region with a relatively high density, all the sub-tracks belonging to the same region may be classified into a cluster. The cluster is denoted as a node, and a node set is V. A vehicle moves between different sub-tracks, which is recorded into an edge set E of the graph, where the edge is a directed edge. Moreover, the degree of the node may be calculated, thereby forming a network graph G representing the whole motion track.
Since there is the problem of low operation when the graph is directly used for computing, in order to compare the similarity between track graphs, a graph embedding model need to be employed. The graph is mapped into a vector with one dimension k, and k is much smaller than the number of nodes in the original graph, such that the next research can analyze the graph in the form of low-dimensional vector by using machine learning, deep learning and other methods.
The present application employs a graph2vec algorithm, which is an unsupervised learning representation method based on a graph kernel. By means of training of a neural network, the whole track graph is embedded, and explicit embedding vectors that can be used for similarity calculation are obtained. The algorithm employs a document embedding method in natural language processing for reference in thinking. Compared with doc2vec, graph2vec regards the whole graph as a document, and the rooted subgraph extracted from the graph as a word. The form in which the rooted subgraph forms the subgraph can be regarded as the form in which words form a sentence or paragraph. The basic process for the graph2vec algorithm is as follows: firstly, extract a rooted subgraph of each node from the whole graph, then, perform vector embedding by using a Skipgram model, and finally, optimize an output result by using a stochastic gradient descent (SGD) algorithm.
The specific steps are as follows:
The rooted subgraph is a subgraph with a certain node in the graph as a root node and a maximum depth as a specified parameter D. The rooted subgraph is a high-order substructure that can better retain the structural features of the original graph than a low-order or linear substructure. The steps for extracting the rooted subgraph are as follows:
(1) Determining a parameter: determining the maximum depth D of the rooted subgraph.
(2) Searching a node: finding a neighbor node of a node RN from each depth dx from 0 to D by employing a breadth-first algorithm, then, searching all subgraphs with the depth of dx-1 for each neighbor node, and recording same in the set Mz(dx), and finding a subgraph with the node RN as a root node and the depth of dx-1.
(3) Performing reordering and merging: relabeling the subgraphs in Mz(dx) by using a Weisfeiler-Lehman algorithm, and then, performing merging with M′ into a subgraph with the depth of dx as an output.
Through the above steps, the rooted subgraphs of all the nodes in the graph are obtained, and unique labels are assigned to all the subgraphs.
The Skipgram model is a feedforward neural network model. In the gragh2vec algorithm, the function of the model is to predict possible subgraphs before and after a given subgraph, that is, to calculate the maximum likelihood estimation. For example, T subgraphs of the whole graph are given: {ω1, ω2, . . . , ωt, . . . , ωT}, a window length is determined as cw, that is, the subgraph to be predicted is {wt−cw, . . . , ωt+cw}. In order to maximize the prediction probability, the maximum likelihood estimation method is employed, and the calculation method is shown in Formula (1).
In the formula, Pr(ωt−cw, . . . , ωt+cw|ωt) represents the product of the probability of occurrence of each subgraph in the case of occurrence of ωt, and its calculation formula is Formula (2).
In the formula, Pr(ωt+j|ωt) represents the probability of occurrence of subgraph ωt+j in the case of occurrence of subgraph ωt. Since the probability of occurrence of each subgraph belongs to independent distribution in the set dictionary V of subgraphs, Pr(ωt+j|ωt) may be expressed by Formula (3).
The Skipgram model is a shallow neural network including an input layer, a hidden layer and an output layer. The network graph G to be embedded is selected, and the set of all subgraphs thereof is {ω1, ω2, . . . , ωt, . . . , ωT}. The window length is determined to be cw, and the subgraph ωt is selected in turn as an input of the neural network. The output layer is a softmax regression classifier, each node of which will output a value between 0 and 1, representing the probability distribution of the subgraph set {ωt−cw, . . . , ωt+cw}, and the sum of the probabilities represented by all values is 1. The objective function is maximization of R (d)=Σt=1T log Pr(ωt−cw, . . . , ωt+cw|ωt), where Pr(ωt−cw, . . . , ωt+cw|ωt) represents the product of the probability of occurrence of each subgraph in the case of occurrence of subgraph ωt, that is Pr(ωt−cw, . . . , ωt+cw|ωt)=Π−cw≤j≤cw,j≠0 Pr(ωt+j|ωt), where Pr(ωt+j|ωt) represents the probability of occurrence of subgraph ωt+j in the case of occurrence of subgraph ωt, and the calculation formula is Pr(ωt+j|ωt)=exp(ωt+j·ωt)/Σw∈νexp (ω·ωt) V is the dictionary composed of all the subgraphs, and the final output of the model is a vector representing the network graph G.
Due to the large amount of thesaurus data composed of all the subgraphs in graph2vec, it is too expensive to directly employ the Skipgram model. The graph2vec algorithm employs a negative sampling training method to reduce the number of elements contained in the dictionary V in the Skipgram model. The specific method is as follows: If the training graph Gi is selected, the subgraph set of Gi is c. A sample set c′ is formed by randomly selecting rooted subgraphs from several groups of graphs adjacent to Gi, where c′⊂V, and c′∩c=ø, which represents an empty set. The number of subgraphs in c′ should be much less than the number of subgraphs in V, and this parameter should be adjusted according to actual needs. Only the sample set c′ needs to be updated for each training. If two graphs are composed of similar rooted subgraphs, embedding results of the two graphs are closer in a vector space.
Due to a large sample size, in the algorithm, the stochastic gradient descent method is employed to optimize the output vector. Part of the samples are randomly selected for training to ensure the operation efficiency of the algorithm, and the learning rate α needs to be adjusted according to the actual situation.
For step S5), stability indexes are output.
Whether the tracks in different periods have similarity is determined by means of cosine similarity to analyze the stability of user behavior and recognize abnormal tracks.
All track graphs are jointly combined into a vector space, and the similarity between two tracks can be compared by calculating the distance between the vectors in this space. There are two ways to measure the distance between the vectors, namely a Euclidean distance and a cosine distance. The cosine distance is more suitable for calculating the similarity between two vectors, that is cos θ=v·u/∥v|×|u∥ the larger the obtained cos θ, the greater the correlation between two tracks. The whole track is divided into several segments according to the time period, and the average value of all the cosine distances is calculated to quantify the stability of the track.
An anomaly recognition system for tracks of trucks is provided in Embodiment 2 of the present disclosure. The system is implemented on the basis of the method in Embodiment 1 and includes: a running track obtainment module, a compression module, a clustering algorithm module, a vector output module and an anomaly recognition module.
The running track obtainment module is used for obtaining a running track T according to GPS data of running of a truck to be recognized.
The compression module is used for employing a track compression algorithm for the running track T to obtain a compressed track set C.
The clustering algorithm module is used for employing a density-based clustering algorithm and performing grouping according to set time periods to obtain a network graph G representing a motion track of each time period.
The vector output module is used for inputting the network graph G into a track embedding model established and trained in advance to obtain an explicit embedding vector corresponding to each network graph.
The anomaly recognition module is used for determining stability according to a distance between the vectors, and classifying points with the stability lower than a set threshold as abnormal tracks.
The track embedding model is realized by employing a Skipgram model on the basis of a graph2vec algorithm.
1. The GPS track data of 206 trucks are included in the data set selected for this experiment. The track data of each truck is composed of 100 thousand track points.
2. For the effect of track compression, the original track is shown in
3. The tracks are grouped by month, and then, each group of tracks are clustered. The effect of one group is shown in
4. Cosine similarity is obtained by means calculation to express the quantitative stability index of each vehicle, and a stability threshold is set by comparing a visualization track diagram. Effects are shown in
5. To test the experimental effect, a self-similarity test is employed. The track sequence of a certain user is divided into two sub-sequences according to the parity of row number, and similarity of embedding vectors of the two sub-sequences is compared. If the similarity is high, the model is valid. In this experiment, 20 users are randomly selected for the experiment, 13 of these users have self-similarity being over 0.95, and the rest are over 0.8, which is much higher than the stability threshold. Therefore, it can be proved that the model is effective in quantifying the stability of user behavior and recognizing abnormal users.
The preferred embodiments of the present application are described in detail above. However, the present application is not limited to specific details of the above embodiments. Within the scope of the technical concept of the present application, various simple modifications may be made to the technical solutions of the present application, and these simple modifications all fall within the protection scope of the present application.
Moreover, it should also be noted that various specific technical features described in the above particular embodiments may be combined in any suitable manner, without contradiction. In order to avoid unnecessary repetition, various possible combination modes are not separately described in the present application.
In addition, various different embodiments of the present application may also be combined randomly, so long as same do not deviate from the idea of the present application, and same should also be regarded as disclosed in the present application.
Number | Date | Country | Kind |
---|---|---|---|
2023107632957 | Jun 2023 | CN | national |