The invention relates to a method for identifying and classifying expressway users, in particular to a user segmentation method based on toll data of expressway electronic toll collection (ETC).
Expressway is an integral part of urban traffic, so it is of great significance to master travel demands of expressway users for expressway planning and management. “Outline for Building a Transportation Powerful Country” puts forward higher requirements for expressway operation management and travel service, while the traditional manual toll collection system (MTC) involves less data fields of users, so it can't continuously analyze expressway users. In addition, if manual investigation methods such as traffic survey and questionnaire are used, there are disadvantages such as long cycle, low sampling rate, high cost, etc., and due to the low data quality, it is difficult to achieve expected effects.
With the development of information technology and infrastructure, the ETC system has been widely used, and with the operation of expressway, a large amount of ETC toll data has been generated. The ETC toll data has the characteristic of uniquely identifying users, which realizes one person, one car and one signature, and provides the possibility for identifying the commuting, operation, business and sporadic travels of the expressway users. In October, 2020, the utilization rate of the ETC system is close to 70%, covering most of the expressway users. By mining the travel characteristics of users, it provides an opportunity for more in-depth identification and classification of expressway users.
Self-organizing map (SOM) is a representative semi-supervised machine learning algorithm. Different from the traditional k-means clustering and fuzzy clustering methods, the SOM algorithm does not need to set the initial value of the number of clusters, which makes it easier to operate. It can not only automatically find the internal relationship among sample attributes, but also reduce the dimension and complexity of data. The typical SOM model is a hierarchical structure, generally only having an input layer and a competition layer, so it has great advantages for processing large-scale complex data.
At present, there is no relevant literature report.
The technical problem to be solved by the invention is to provide a user segmentation method based on toll data of expressway ETC, which can quickly and accurately identify and classify expressway users, in order to overcome the shortcomings of the prior art.
The technical scheme adopted by the invention is as follows.
The invention provides the user segmentation method based on the toll data of expressway ETC, which has the advantages as follows:
A user segmentation method based on toll data of expressway ETC of the present invention will be described in detail with reference to the following embodiments and drawings.
The user segmentation method based on the toll data of expressway ETC of the present invention is to identify travel purposes of commuting travel, operation travel, business travel and sporadic travel of expressway users, as shown in
Where the cleaning data according to the abnormal state of the time includes: reading the outbound time and the inbound time of a travel record (i.e., consumption record) of the expressway user, and calculating the driving time under this travel record; if the driving time is negative, that is, the outbound time is less than the inbound time, or the driving time exceeds 24 hours, determining that this consumption record is the abnormal time data of the expressway user, and eliminating this abnormal time data.
The cleaning data according to the abnormal state of the space includes: reading the outbound time, the inbound time and the billing distance of a travel record of the expressway user, calculating a driving speed under this travel record; if the speed is greater than 120 kilometers per hour (km/h) or the billing distance is greater than 1000 kilometers (km), determining that this consumption record is abnormal space data of the expressway user, and eliminating this abnormal space data.
Specifically, a method for extracting the time index of each expressway user includes the following steps: counting numbers of the days for each expressway user to travel on working days and non-working days within the set period respectively, and counting numbers of the days for each expressway user to travel in peak and off-peak periods respectively, where the peak periods include the morning peak period of 7:00-9:00 and the evening peak period of 17:00-19:00 in one day, and the remaining time in one day is the off-peak period.
A method for extracting the space index of each expressway user includes the following steps: extracting starting-ending points of all toll stations in each expressway user's travel within the set period and assigning them with numbers a respectively, then calculating the travel frequency of each expressway user at each starting-ending point according to the numbers, and finally calculating the travel proportion of each expressway user at each starting-ending point. The calculating formulas applied thereto are as follows.
Where a represents the number of the starting-ending point of the toll stations, C represents the total travel frequency of the expressway user, A represents a set of all the starting-ending points that the expressway user has passed through, Ca represents the travel frequency of the expressway user at the starting-ending point a, and Qa represents the travel proportion of the expressway user at the starting-ending point a.
A method for extracting the personal attribute index of the expressway user includes the following steps: calculating the total travel frequency and total travel billing distance of each expressway user within the set period by using an aggregation function. The calculating formulas applied thereto are as follows.
Where a represents the number of the starting-ending point of the toll stations, C represents the total travel frequency of the expressway user, A represents the set of all the starting-ending points that the expressway user has passed through, Ca represents the travel frequency of the expressway user at the starting-ending point a, S represents the total travel billing distance of the expressway user, and Sa represents the single billing distance of the starting-ending point a.
The SOM clustering algorithm is used to complete the classification of expressway users, which specially includes: using the SOM clustering algorithm shown in
where sample represents the number of expressway users.
Cluster analysis is completed by a python-minisom tool in the SOM clustering algorithm, and the average values of the expressway users in each cluster in terms of time and space indexes are calculated according to cluster analysis results, and the following storage format is formed.
A method for identifying the commuting travel and business travel includes the following steps: selecting at least one cluster ID in which expressway users travel more than 3 days on average on wording days in a week, and then calculating, for the expressway users in the cluster ID, total numbers of days that the expressway users travel in the peak periods (7:00-9:00, 17:00-19:00) and off-peak period respectively, specifically selecting the k-th mouth for calculating,
where Wk represents the total number of the days for the expressway users to travel in the peak periods in the k-th month; Mk represents the total number of the days for the expressway users to travel in the off-peak period in the k-th month.
If Wk>Mk, the expressway users in the cluster ID are defined as the users of the commuting travel (i.e., commuting travel users), otherwise, the expressway users in the cluster ID are defined as the users of the daily operation travel (i.e., operation travel users).
A method for identifying sporadic travel and business travel includes the following steps: selecting at least one cluster ID in which expressway users travel less than 3 days on average on wording days in a week, and then calculating the travel frequency of each starting-ending point in the k-th month for each expressway user:
where Pkj represents the travel frequency of the expressway user at the j-th starting-ending point in the k-th the month; Pk represents the total travel frequency of expressway user in the k-th month; q represents the total number of the starting-ending points.
The proportion of each starting-ending point of expressway users in this cluster ID in all starting-ending points is calculated. If the maximum proportion of the starting-ending point exceeds 40%, the expressway users in this cluster ID are defined as the users of the business travel; otherwise, the expressway users in this cluster ID are defined as the users of the sporadic travel.
The specific embodiment is described below.
As shown in
Step 101: Pre-processing the toll data of expressway ETC.
The toll data of expressway ETC is huge, exceeding 100G. In order to improve the storage efficiency, the key fields are extracted from the original data according to the time and space characteristics, and the expressway toll records are sorted. The abnormal data records such as missing fields and wrong plate numbers are eliminated, and the following basic data storage format is formed, which contains 20 million records and more than 1.4 million users.
[plate number, inbound time, inbound location, outbound time, outbound location, billing distance, final toll]
Step 102: Cleaning user's travel records according to the judgment of abnormal time and space.
Because there are errors in system entry and identification of the toll data of expressway ETC, data cleaning is required before data processing. First, the travel records of each expressway user are sorted according to time, and then the following steps are carried out:
Step 1021: Cleaning the abnormal time data record.
Reading the outbound time and inbound time of a travel record of the expressway user, and calculating the driving time. If the driving time is negative (the outbound time is less than the inbound time), or the driving time exceeds 24 hours, it is determined that this consumption record is the abnormal time data of the expressway user.
Step 1022: Cleaning the abnormal space data record.
Specifically, reading the outbound time, inbound time and billing distance of a travel record of the expressway user, and calculating the driving speed of this travel record. If the driving speed is greater than 120 km/h or the billing distance is greater than 1000 km, it is determined that this consumption record is the abnormal space data of the expressway user. After the data cleaning, there are about 1.35 million expressway users left.
Step 1023: Extracting travel indexes of user's time, space and personal attribute.
Counting the days of working-day travel and non-working-day travel in the cycle, with 7:00-9:00 as the morning peak and 17:00-19:00 as the evening peak, and counting the days in the peak-period-travel and off-peak-period travel; Counting the travel frequency of each starting and ending point in the travel of the expressway user, and calculating the proportion of the starting and ending point in all travels; The aggregate function is used to calculate the total travel frequency and total travel billing distance of each expressway user in the set period, so as to obtain the travel indexes of all expressway users. The travel indexes of the certain expressway user are as follows.
Step 103: Using the SOM clustering to complete expressway user clustering.
Python-minisom, a tool of the SOM clustering method, is used to cluster and analyze the above-mentioned time, space and personal attribute indexes of expressway users. The input parameters of the SOM clustering algorithm include the travel days in working days of the expressway user, the travel days in non-working days of the expressway user, the travel days in peak periods of the expressway user, the travel days in non-peak period of the expressway user, the most commonly used proportion of the starting-ending points in all travels, and the size of the competition layer of the adaptive neural network is set to N×N=76×76.
After the SOM clustering, six classifications are finally obtained, and then the averages of all users' travel indexes in this cluster are calculated according to the cluster ID, and the following data format is formed for each cluster.
Step 104: Dividing commuting, operation, business and sporadic users according to the identification principle of expressway users.
The average weekly travel days of expressway users in cluster 1 and cluster 4 are more than three times, but the travel days of users in cluster 1 are more concentrated in peak periods, while those in cluster 4 are more dispersed, so cluster 1 is defined as commuting travel users, while cluster 4 is defined as operation travel users. The remaining of cluster 2, cluster 3, cluster 5 and cluster 6 have fewer travel days, with less than 3 travel days per week on working days. However, among the travel days of cluster 3, the most commonly used starting-ending point travels account for more than 40%, and the travel routes are concentrated, so cluster 3 is defined as business travel users, while the remaining of cluster 2, cluster 5 and cluster 6 are defined as sporadic travel users.
The above embodiment are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. All equivalent substitutions and modifications made without departing from the spirit and principle of the present invention should be included in the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2022101143064 | Jan 2022 | CN | national |