This application claims priority to Chinese Patent Application No. 202210590822.4, filed on May 27, 2022, the contents of which are hereby incorporated by reference.
The application relates to the technical field of computers, and in particular to a multi-granularity perception integrated learning method, a device, a computer equipment and a storage medium aiming at data analysis of the online behaviors of users.
With the wide application of Internet in many practical fields, such as information security, economic management, social governance, medical biology and so on, more and more data are produced to record users' online behavior information. How to extract knowledge and mine data from users' online behavior data more effectively and accurately to meet the actual needs is still facing a lot of tests. However, there are few application researches on the data of users' online behavior combining granular computing and ensemble learning. Users' online behavior data belongs to structured data, which is easy to query, modify and calculate, which may usually abstract a higher level of data. This abstract process is called granulation, and multi-granularity perception is a granulation conversion method on the data to different degrees for many times, thus generating abstract multi-granularity characteristics, so as to achieve the objective of multi-level and multi-perspective perception of data. From the perspective of cognitive computing, multi-granularity perception is the concept learning based on granular computing, which is beneficial to conceptual knowledge. At present, how to make users' online behavior data reasonably multi-granularity and how to carry out efficient, accurate and interpretable integrated learning on multi-granularity structured data have rarely been studied systematically, so it is very valuable and necessary to carry out the research on multi-granularity awareness integrated learning method of users' online behavior data.
Based on this, it is necessary to provide a multi-granularity perception integrated learning method, device, computer equipment and storage medium that may apply the granular computing theory to the analysis of users' online behavior.
The application relates to a multi-granularity perception integrated learning method, including following steps:
In one embodiment, the method further includes:
In one embodiment, the method further includes:
In one embodiment, the method further includes:
In one embodiment, the method further includes:
In one embodiment, the method further includes:
repeatedly performing an iteration the particle swarm algorithm according to the initial value until an end condition is met, and ending the iteration; and
In one embodiment, the method further includes that the base learner is a tree model.
A multi-granularity perception integrated learning device, including:
A computer device includes a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the following steps are realized:
A computer-readable storage medium has a computer program stored thereon, and the computer program may realize the following steps when executed by a processor:
The multi-granularity perception integrated learning method, device, computer equipment and storage medium preprocess the data set of users' online behavior; through the multi-granularity perception data derivation algorithm, the attribute characteristics are processed with particle as unit, and then the data are divided into granular layers according to the granularity characteristics to obtain multi-level derivative data sets; based on the base learning algorithm, a plurality of preset base learners are trained according to the derivative attribute values of the training data set data in the derivative data set and the particle label values of the corresponding granular layers, and the trained base learners are obtained; inputting the training data set into the trained base learner, calculating the self-prediction error, and counting the mean square error with the particle as the unit and the mean square error with the granular layer as the unit; determining the weight information according to the error of particles and granular layers; wherein, the smaller of the values of the particles or granular layers, the larger the weight values; inputting the testing data set into the trained base learner to obtain the prediction results of the testing data set, and then carrying out weighted integration on the prediction results according to the weight information to output the multi-granularity perception integrated learning prediction results of the user's online behavior data. According to the user's online behavior data, the application proposes to transform the user's online behavior data from particle visual field and particle layer perspective to derive a plurality of data sets with different visual fields, and divides the weights into two levels through weighted integration strategy: granular layer and particle, thus improving the interpretability of the user's online behavior analysis and the accuracy of the prediction results.
In order to make the objective, technical scheme and advantages of this application clearer, the application will be further described in detail with the attached drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.
In one embodiment, as shown in
The data in the preprocessed data set includes attribute characteristics, granularity characteristics and particle label values.
The theory of granular computing mainly involves three concepts of granular computing: particle, granularity and granular layer, and the formal descriptions are given below:
The concept of granularity in that application not only aggregates data hierarchically from bottom to top from the point of view of data storage, but also simulates the ability of human beings to recognize things abstractly. The first thing to do is to convert the data into a standard data format suitable for the application through data preprocessing, so that the data set has abstract multi-granularity characteristics and generates particle label values. Multi-granularity characteristics and the particle labels may be obtained by designing a data structure framework before collecting data. For example, when collecting online behavior records, the attributes of the account, department and company to which the online behavior data belongs are set, and these attributes may be used as multi-granularity characteristics. In addition, multi-granularity characteristics and particle label values may also be generated from the user's online behavior data set by hierarchical clustering.
In this embodiment, the “user online behavior data set” is taken as experimental data, which comes from the competition of “Analysis of abnormal behaviors of users online based on UEBA” under the Datafountain platform. The data description is shown in the following table, in which “account” and “group” are taken as the granularity characteristics of this data set:
Table 2 shows the data set style obtained after preprocessing the data set of users' online behavior:
The serial number set {1,2,3, . . . , v, . . . , V} represents the serial number set of the data set; it is assumed that xvq represent the q-th attribute characteristic value of the v-th data, xvk represents the k -th granularity characteristic value of the v -th data, T ={T1, T2, . . . , Tq, . . . , TQ} is the set of attribute characteristics, M={M1, M2, . . . , Mk, , . . . , MK} is the set of granularity characteristics. For example, M1 stands for the granularity characteristics of “Account” granular layer, M2 stands for the granularity characteristics of “Department” granular layer and M3 stands for the granularity characteristics of “Company” granular layer. The numbers under the T1 and T2 characteristics in the table indicate that they are numerical characteristics, the symbols under the Tq and TQ characteristics indicate that they are symbolic characteristics, M1 indicates the finest granular layer abstracted from the data set, M2 indicates the granular layer with larger granularity than M1, and so on, until the maximum granular level required for solving the problem is met, where 1≤k≤K, |Gi≥1.
S104, inputting the preprocessed data set into a pre-designed multi-granularity perception data derivation algorithm, performing multi-granularity perceptionsing processing on the derivative data set according to the characteristic categories of attribute characteristics and particle label values through the multi-granularity perception data derivation algorithm to obtain a multi-granularity perception data set, and dividing the multi-granularity perception data set according to the granularity characteristics in layers to obtain a multi-level derivative data set.
The derivative data sets are divided into training data sets and testing data sets; the data in the derivative data set include the derivative attribute values and the particle label values of the corresponding granular layers.
Multi-granularity perception Data Derivation Algorithm (MPDDA) algorithm essentially provides data diversity, which simulates the process of human cognition of the world, and deeply cognizes the data from multi-granularity perspectives and different particle structure perspectives, so that the model of the application has the interpretability on the data, and the differentiated data processed and derivative based on the granularity characteristics and particle structure of the data is beneficial to computer cognition and learning.
The data derivative from the original data set include three categories: Q-column attribute values, particle labels Mi corresponding to granular layers, and result label values. Particle label Mi will be trained and learned by the model as an important characteristic together with derivative attribute values, and it is the retention of particle label values that makes other derivative attribute values meaningful. In the data generated by practical problems, the numerical value derivative from some characteristics through multi-granularity perception is meaningless and unexplained, and only when the characteristics appear in the training set together with the granularity characteristics may the characteristics be interpreted for training. The result label value of this embodiment represents the abnormal degree of online behavior, and the result label value is applied to supervised learning tasks as an optimization goal.
S106, training a plurality of preset base learners according to the derivative attribute values of the training data set data and the particle label values of the corresponding granular layers to obtain a trained base learner based on a base learning algorithm, and the trained base learners are obtained.
The number of base learners is the same as the layer number of derivative data sets.
The base learner in the application may be homogeneous or heterogeneous, and different base learners may be selected according to the actual situation in the application process. The input data of the base learner consist of K derivative data sets. Especially when k=1, the attribute characteristics of this data set are the same as those of the preparatory data set. Because the granularity characteristic is that the data number of the data set may not promote the better learning of the model, the granularity characteristic M1 is not added in the process of training the first layer derivative training set.
In the actual process of data processing, it is found that the eigenvalues of some characteristics generated by multi-granularity perception data derivation algorithm deviate from the corresponding original characteristic connotations, which makes the characteristics difficult to understand, and only bind with the granularity characteristics from which they are derivative to form new connotations. Based on this, it is needed to specify that the base learner may be a tree model, and the global normalization operation may be omitted when preprocessing the data for the tree model.
S108, inputting the training data set into the trained base learner, calculating a self-prediction error of the testing data set predicted by the trained base learner, and counting a mean square error with particle as unit and a mean square error with granular layer as a unit according to the self-prediction error.
The self-prediction error is calculated by the result label value of the training data set and the output result of the base learner.
The premise that the particle weights obtained from the training data set may be reused in the testing set is that the particle label set of each granular layer in the testing set is the complete set of particle labels of each granular layer in the data set of users' online behavior.
S110, obtaining the particle-level weight according to the mean square error with the particle as the unit, obtaining the granularity-level weight according to the mean square error with the granular layers as the unit, and determining the weight information according to the particle-level weight and the granularity-level weight.
The application provides a weighted integration strategy based on particle mean square error (MSE) optimization. Particle weighting mechanism is to optimize and adjust the prediction effect of each base learner by giving weights to particles in different granular layers, and the particle structure with good prediction effect will be given greater weight, otherwise it will be given less weight. The data objects in each particle share the weight, which may reduce the computational complexity and the possibility of over-fitting. Essentially, the weighted integration strategy of the application optimizes the model from the particle visual field and particle layer perspective.
S112, inputting the testing data set into the trained base learner to obtain the prediction results of the testing data set, and performing weighted integration on the prediction results according to the weight information to output the multi-granularity perception integrated learning prediction results of the user's online behavior data.
In the multi-granularity perception integrated learning method, a preprocessed data set including attribute characteristics, granularity characteristics and particle label values is obtained by preprocessing the data set of online behaviors of users; performing multi-granularity perceptionsing processing on the derivative data set according to the characteristic categories of attribute characteristics and particle label values through the multi-granularity perception data derivation algorithm, and then the data is divided into granular layers according to the granularity characteristics to obtain multi-level derivative data sets; based on the base learning algorithm, a plurality of preset base learners are trained according to the derivative attribute values of the training data set data in the derivative data set and the particle label values of the corresponding granular layers, and the trained base learners are obtained; inputting the training data set into the trained base learner, calculating the self-prediction error, and counting the mean square error with the particle as the unit and the mean square error with the granular layer as the unit; determining the weight information according to the error of particles and granular layers; wherein, the smaller of the values of the particles or granular layers, the larger the weight values; inputting the testing data set into the trained base learner to obtain the prediction results of the testing data set, and then carrying out weighted integration on the prediction results according to the weight information to output the multi-granularity perception integrated learning prediction results of the user's online behavior data. According to the user's online behavior data, the application proposes to transform the user's online behavior data from particle visual field and particle layer perspective to derive a plurality of data sets with different visual fields, and divides the weights into two levels through weighted integration strategy: granular layer and particle, thus improving the interpretability of the user's online behavior analysis and the accuracy of the prediction results.
In one embodiment, the method further includes the following steps: obtaining the data set of the user's online behavior, and preprocessing the data set; generating the attribute characteristics, the granularity characteristics and the particle label values of data according to attributes in the data structure of the data set to obtain the preprocessed data set; the attributes in the data structure of the data set is an account, a department and a company to which the data belongs; or generating the attribute characteristics, the granularity characteristics and the particle label values of data according to the data set through a hierarchical clustering method to obtain the preprocessed data set.
In one embodiment, that method further includes: inputting the preprocessed data set into a pre-designed multi-granularity perception data derivation algorithm; taking the particle label value as one of the attribute characteristics, the attribute characteristics of the preprocessed data are discriminated. If the attribute characteristics are numerical characteristics, the numerical characteristics are normalized within particles, and if the attribute characteristics are symbolic characteristics, the symbolic characteristics are recoded within particles; and a multi-granularity perception data set is obtained. the multi-granularity perception data set is divided into a multi-granularity training set and a multi-granularity testing set; according to the granularity characteristics, the multi-granularity training set and the multi-granularity testing set are divided according to the granular layer, and the multi-level training data set and the multi-level testing data set are obtained respectively; the training data set and the testing data set constitute a derivative data set.
Specifically, the flow chart of the multi-granularity perception data derivation algorithm is shown in
Intra-granular normalization operation for numerical characteristics and intra-granular recoding for symbolic characteristics are the core algorithms of multi-granularity perception data derivation. The main functions are to realize multi-level perception of data sets through multi-granularity data derivation, and the essence is to normalize or recode the data sets in units of particles, which is equivalent to each particle forming its own system, so that computers may distinguish each data more accurately at each particle level. The subsequent data derivation process is equivalent to expanding the derivative data set corresponding to the granular layer based on the original data set, providing more data and perspectives for the next machine learning.
{circle around (1)}Intra-granular normalization: the traditional normalization is only a dimensionless method of linear transformation of data, which may accelerate the gradient descent speed of some machine learning algorithms, but intra-granular normalization is more than that. Intra-granular normalization frames the normalized data range within particles in different granular layers, and the numerical characteristics in all particles under each granular layer should be normalized separately, so as to achieve the data processing purpose of multi-granularity perception of numerical characteristics.
{circle around (2)}Intra-granule recoding: intra-granule recoding is aimed at the symbolic characteristics in the data set, and it is carried out inside each granule in different granule layers of the universe. There are two common coding methods in data processing, One-hot Encoding and Label Encoding. Single-hot coding is suitable for non-tree models whose loss function is sensitive to numerical changes, such as logistic regression and SVM. Label coding is suitable for tree models whose loss function is insensitive to numerical changes, such as RF, GBDT, etc. Therefore, it is necessary to judge the type of machine learning model before selecting coding rules. The data processing objective of intra-granular recoding is to realize multi-granularity perception of symbolic characteristics.
The detailed flow of the multi-granularity perception data derivation algorithm, i.e. pseudo code, is shown in Algorithm 1:
One embodiment further includes: taking the weight information as the initial value of the particle swarm algorithm; iterating repeatedly through particle swarm optimization according to the initial value until the end condition is met, and end the iteration; obtaining the enhanced weight information; inputting a testing data set into a trained base learner to obtain the prediction results of the testing data set; according to the enhanced weight information, the prediction results are weighted and integrated.
This embodiment provides an enhancement strategy based on particle swarm optimization. If the accuracy is high, but the training time is not high, the initial weighting strategy may be obtained by the method based on granular MSE optimization, which may be used as the initial input value of particle swarm optimization to speed up the optimization process, and the enhanced weighted integration strategy may be obtained after repeated iterations.
Specifically,
SE
k,v=(ŷk,v−yv)2
S2 (Particle Error Statistics): calculating the mean square error MSE in the unit of particles to measure the average prediction deviation of particles, where mk,v represents the particle label, the particle characteristic value of the v-th data in the k -th granular layer, ID (mk,v) represents the numbered set of data in the k-th granular layer that are the same as the particle label of the v-th data, Gik may be the i-th particle in the k-th granular layer, |Gik| may be understood as the number of data in the particle, and may also be understood as the granularity, so the mean square error of the particle visual field is as follows:
S3 (granular layer error statistics): estimating the prediction deviation of the model from the perspective of granular layer, and also referring to the index of mean square error MSE. If the total data volume of each training set is V, the mean square error MSEk of granular layer perspective may be expressed as follows:
S4 (MSE-based weight generation strategy): Obviously, the larger the values of MSEk,v and MSEk, the worse the prediction effect of the base learner in the range of particle v or granular layer k. Therefore, the particles and granular layers with large mean square error are given smaller weights, while the particles and granular layers with small values are given larger weights, so as to enhance the overall prediction effect of the model. It should be noted that the first layer is the original data set, and there is no abstract particle structure, so there is no need to calculate particle weights. The granular layer base learner with k≥2 is given the weight w2 as a cognitive whole, while the granular layer base learner with k=1 is given the weight w1 as a whole. Particle weight wk,v and granular layer weight wk are respectively expressed as follows:
The weight generation strategy based on MSE has the advantages of fast calculation speed and low calculation complexity. Meanwhile, this embodiment gives another weight enhancement strategy based on particle swarm optimization (as shown in S5), which may improve the prediction effect again, but the calculation complexity increases, so it may be decided whether to adopt this enhancement strategy according to the actual problem, and if not, skip to S6 (weighted integration) directly.
S5 (Weight enhancement strategy based on particle swarm optimization): obviously, the weight generation strategy based on MSE in the above S4 is mathematically provable and interpretable, but it may not be able to optimize the ensemble learning model to the most ideal state. Therefore, an optional weight enhancement step is given here, and the particle swarm algorithm is adopted to find the optimal weight distribution strategy of particles and granular layers. In the D-dimensional search space, assuming there are N particles, each particle represents a weight allocation strategy (w1,1), then Xid=(xi1, xi2, . . . , xiD) represents the position of the i-th particle, Vid=(vi1, vi2, . . . , viD) represents the speed of the i-th particle, the individual optimal solution searched by the i-th particle is Pid,pbest=(pi1, pi2, . . . , piD), the group optimal solution is Pd,gbest=(p1,gbest, p2,gbest, . . . , PD,gbest) fp represents the individual historical optimal fitness value, and fg represents the group historical optimal fitness value.
The core calculation formulas in the whole particle algorithm are velocity update formula xids+1, position update formula xids+1 and fitness function f, which are respectively expressed as follows:
Where s represents the number of iterations, ω is the inertia weight, c1 is the individual learning factor, and c2 is the group learning factor; r1 and r2 are random numbers within [0,1].
S6 (Weighted integration): after using the trained base learners to predict, the output results are combined with the particle weights to complete the final integration calculation, and the symbol is used to represent the learning results of multi-granularity perception integration:
=·w1+(Σk=2K(·wk,v))·(1−w1).
It should be understood that although the steps in the flowchart of
In another embodiment, as shown in
In a specific embodiment, the “data set of users' online behavior” as shown in Table 1 above is used as experimental data.
Scoring rules are based on RMSE Score, and the higher the value, the better the prediction effect of the model:
The experimental equipment is run by intel i7 32G CPU, and the programming language is Python3.8. LightGBM, XGBoost and random forest are used to add into the Multi-Granularity Perceptual Ensemble Learning (GEL) framework for comparative experiments.
In the experiment, three base learners are used to train and predict six patterns:
The experimental results are shown in
First of all, it can be found that in all the experimental results, the prediction effect is XGBoost>LightGBM>Random Forest.
Secondly, in the drawings, single-layer data (granular layer 1) refers to the original data set, while single-layer data (granular layer 2, 3) refers to the data set generated by the multi-granularity perception derivation algorithm. By observing the prediction accuracy of these three data sets using three kinds of base learners respectively, it can be found that the performance of the learner in single-layer data (granular layer 2) is better than that in single-layer data (granular layer 1), which shows the feasibility of using the multi-granularity perception derivation algorithm for data derivation. However, the performance on single-layer data (granular layer 2) is very poor, which shows that the data sets obtained by multi-granularity perceptionsing derivative algorithm may not all achieve good results on the learner.
Finally, the prediction effects of different integration modes are compared. Enhanced weighted GEL based on PSO>optimized weighted GEL based on MSE>data merging mode of each granular layer>original data K-Fold mode>average weighted mode of each granular layer.
On the whole, the effect of particle-weighted inheritance strategy in GEL is better than other integration methods, and the enhanced strategy based on PSO does make GEL have better prediction effect.
In one embodiment, as shown in
The preprocessing module 702 is also used for obtaining the data set of the user's online behavior, and preprocessing the data set; generating the attribute characteristics, the granularity characteristics and the particle label values of data according to attributes in the data structure of the data set to obtain the preprocessed data set; the attributes in the data structure of the data set is a account, a department and a company to which the data belongs; or generating the attribute characteristics, the granularity characteristics and the particle label values of data according to the data set through a hierarchical clustering method to obtain the preprocessed data set.
The data derivation module 704 is also use for inputting the preprocessed data set into a pre-designed multi-granularity perception data derivation algorithm; taking the particle label value as one of the attribute characteristics, the attribute characteristics of the preprocessed data are discriminated. If the attribute characteristics are numerical characteristics, the numerical characteristics are normalized within particles, and if the attribute characteristics are symbolic characteristics, the symbolic characteristics are recoded within particles; and obtaining a multi-granularity perception data set.
The data derivation module 704 is also used to divide the multi-granularity perception data set into a multi-granularity training set and a multi-granularity testing set; according to the granularity characteristics, the multi-granularity training set and the multi-granularity testing set are divided according to the granular layer, and the multi-level training data set and the multi-level testing data set are obtained respectively; the training data set and the testing data set constitute a derivative data set.
The base learner training module 706 is also used for enhancing the weight information through the particle swarm algorithm to obtain the enhanced weight information; inputting a testing data set into a trained base learner to obtain prediction results of the testing data set; according to the enhanced weight information, the prediction results are weighted and integrated.
The base learner training module 706 is also used to take the weight information as the initial value of the particle swarm algorithm; iterate repeatedly through particle swarm optimization according to the initial value until the end condition is met, and end the iteration; and the enhanced weight information is obtained.
For the specific definition of the multi-granularity perception integrated learning device, please refer to the definition of the multi-granularity perception integrated learning method above, which is not repeated here. Each module in the multi-granularity perception integrated learning device may be realized in whole or in part by software, hardware and their combinations. The above modules may be embedded in or independent of the processor in the computer equipment in the form of hardware, and may also be stored in the memory in the computer equipment in the form of software, so that the processor may call and execute the operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in
It can be understood by those skilled in the art that the structure shown in
In one embodiment, a computer device is provided, which includes a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the steps in the above method embodiment are realized.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, realizes the steps in the above method embodiment.
Those skilled in the art can understand that all or part of the processes in the method for realizing the above-mentioned embodiments may be completed by instructing related hardware through a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, the computer program may include the processes of the above-mentioned embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. The non-volatile memory may include read-only memory (ROM), programmable ROM(PROM), electrically programmable ROM(EPROM), electrically erasable programmable ROM(EEPROM) or flash memory. The volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM(SRAM), dynamic RAM(DRAM), synchronous DRAM(SDRAM), double data rate SDRAM(DDRSDRAM), enhanced SDRAM(ESDRAM), synchronous link DRAM (SLDRAM), rambus direct RAM(RDRAM), direct rambus dynamic RAM(DRDRAM), and rambus dynamic RAM(RDRAM).
The technical characteristics of the above embodiments may be combined at will. In order to make the description concise, not all possible combinations of the technical characteristics in the above embodiments are described. However, as long as there is no contradiction between the combinations of these technical characteristics, they should be considered as the scope recorded in this specification.
The above-mentioned embodiments only express several implementations of the present application, and their descriptions are more specific and detailed, but they cannot be understood as limiting the scope of application patents. It should be pointed out that for those skilled in the art, without departing from the concept of this application, several modifications and improvements may be made, which are within the protection scope of this application. Therefore, the scope of protection of the patent in this application shall be subject to the claims.
Number | Date | Country | Kind |
---|---|---|---|
202210590822.4 | May 2022 | CN | national |