The present disclosure belongs to the technical field of text classification methods with noisy labels, in particular to a taxpayer industry classification method based on label-noise learning.
It is critical to research the industry classification of enterprise taxpayers, which is the basic work of tax source classification management, and the necessary prerequisite for improving the electronic level of tax file management and implementing information-based tax management, as well as promoting industry modeling and carrying out tax source classification monitoring, early warning, and analysis. The existing classification of ‘taxpayers’ industries is mainly realized manually, which is limited by the professional knowledge and experience of the reporting personnel, and often leads to the wrong classification, which also brings a lot of noise to the taxpayer industry labels of enterprises. The wrong enterprise industry classification will have a series of adverse effects on the country's statistics, taxation, business administration and other work. With the increasing volume and complexity of taxpayer data, it has become an urgent problem to correctly classify taxpayers' industries based on the existing industry classification data with noisy label by means of big data analysis and machine learning. It is of great significance to identify and correct the inconsistency between the existing taxpayer's business scope and industry category, and to provide auxiliary recommendations for taxpayers' industry classification of newly-established enterprises.
At present, there is no relevant research to propose a corresponding solution to the taxpayer's industry classification based on the noisy labeled data. The present disclosure patents related to the taxpayer's industry classification mainly include:
1: A taxpayer industry two-level classification method based on a MIMO recurrent neural network (201910024324.1)
2: Enterprise industry classification method (201711137533.4)
The literature 1 proposed a two-level taxpayer industry classification method based on a MIMO recurrent neural network. A GRU neural network of MIMO is constructed by using 2-dimensional text features and 13-dimensional non-text features as the basic model, and the basic models are grouped and fused according to the mapping relationship between industry categories and industry details, and taxpayers' industries are classified through the fusion model.
The literature 2 proposed an enterprise industry classification method based on semi-supervised learning graph splitting clustering algorithm and gradient lifting decision tree. The semi-supervised graph splitting clustering algorithm is used to extract the key words of the enterprise's main business, and the gradient lifting decision tree is used to train cascade classifiers to realize enterprise industry classification.
The above technical solutions are all based on the premise that the training data industry label is accurate, and the training classification model realizes the taxpayer industry classification. However, in reality, limited by the professional knowledge and experience of the filling personnel, there is a lot of noise in the taxpayer's industry category labeling data in the existing database. If it is directly used for model training, the accuracy of industry classification will drop sharply. Therefore, how to construct a noise-robust taxpayer industry classification model based only on the existing noisy labeled data has become an urgent problem to be solved.
The present disclosure aims to provide a taxpayer industry classification method based on label-noise learning. Firstly, the text information encoder extracts text information from taxpayer industry information for text embedding, and performs feature processing on the embedded information; a non-text information encoder extracts non-text information from the taxpayer industry information for encoding; a network construction processor constructs a BERT-CNN (Bidirectional Encoder Representations with Convolutional Neural Network) deep network structure that meets the taxpayer industry classification problem, and determines the number of layers of the network, and the number of neurons and the dimensionality of input and output in each layer according to the feature information and the number of target categories processed in the previous step; a network pre-training processor pre-trains the network constructed in the previous step through contrastive learning, nearest neighbor semantic clustering and self-labeling learning in turn; a robust training processor adds a noise modeling layer on the basis of the constructed deep network, modeling label noise distribution through network self-trust and noisy label information, and performs model training based on the noisy labeled data; a classifier takes the deep network before the noise modeling layer as the classification model, and classifying taxpayer industries based on the model.
In order to achieve the above purpose, the present disclosure adopts the following technical solution:
A taxpayer industry classification method based on label-noise learning includes:
Used for checking noise data, comprising the following steps:
Extracting, by a text information encoder, text information to be mined from taxpayer industry information for text embedding, and performing feature processing on the embedded information.
Extracting, by a non-text information encoder, non-text information from the taxpayer industry information for encoding.
Constructing, by a network construction processor, a BERT-CNN deep network structure that meets a taxpayer industry classification problem, and determining the number of layers of the network, and the number of neurons and the dimensionality of input and output in each layer according to the feature information and the number of target categories processed in the previous step.
Pre-training, by a network pre-training processor, the network constructed in the previous step through contrastive learning, nearest neighbor semantic clustering and self-labeling learning in turn.
Adding, by a robust training processor, a noise modeling layer on the basis of the constructed deep network, modeling label noise distribution through network self-trust and noisy label information, and performing model training based on the noisy labeled data.
Taking, by a classifier, the deep network before the noise modeling layer as a classification model, and classifying taxpayer industries based on the model.
Compared with the prior art, the present disclosure has the following beneficial effects:
The taxpayer industry classification method based on label-noise learning provided by the present disclosure makes full use of the existing taxpayer enterprise registration information, improves the existing classification method, and only builds a noise robust taxpayer industry classification model based on the existing noisy labeled data without additional labeling. Compared with the prior art, the present disclosure has the following advantages:
(1) The present disclosure directly uses the noise data in the existing enterprise registration information to learn the classification model, which is different from the prior art that additional accurately labeled data is usually needed. The present disclosure directly uses the noisy label in the enterprise registration information as the sample label for model training, which saves the data labeling cost.
(2) The present disclosure mines features and the relationship between features by means of contrastive learning, nearest neighbor semantic clustering and self-label learning, and makes full use of the feature similarity between samples of the same category to mine feature information. Different from the prior art method of directly using original features to learn, the present disclosure can avoid the interference of shallow features, mine more information of deep features, and improve the classification accuracy.
(3) The present disclosure provides a noise modeling method, in which a clustering noise modeling layer is constructed based on similar features mined in the previous step, and noisy label information is added into the clustering network through the clustering noise modeling layer, thus improving the clustering accuracy; subsequently, a classification noise modeling layer and a classification permutation matrix layer are constructed based on the clustering results, and the classification model is trained based on the constructed classification noise modeling layer and classification permutation matrix layer, which effectively reduces the adverse effects of noise on the classification network training, ensures the noise robustness of the taxpayer classification network, and improves the classification accuracy with noisy labeled data.
The present disclosure will be further described in detail with reference to the following drawings:
The information of taxpayers registered from 2017 to 2019 in the national tax of a certain area is selected, including 97 industry categories. With reference to the following drawings, the present disclosure will be further described in detail in combination with experimental cases and specific embodiments. All technologies realized based on the content of this application belong to the scope of this application.
As shown in
Step 1. Taxpayer Text Information Processing
A lot of useful information in the taxpayer information registration form is stored in the database in the form of string text. Five columns {taxpayer's name, main business, part-time business, mode of operation, business scope} are extracted from the registered taxpayer information and registered taxpayer information expansion table as text features. The implementation process of text feature processing by the information encoder is shown in
S101. Text Information Standardization
The required taxpayer text information is screened from the taxpayer registration information table, and the special symbols, numbers and quantifiers in the text information are deleted;
S102. BERT Text Encoding
Text feature generation mainly includes the following steps: adding clause marks before and after the text information, processing control characters other than blank characters, replacement characters and blank characters in the text, dividing sentences by words and removing spaces and non-Chinese characters, and encoding the text information by a BERT pre-training model;
S103. Text Feature Matrix Generation
The embedded vectors after word encoding are spliced into a text feature matrix.
In this embodiment, the taxpayer's name is “ VR
”. After step 1, the special symbol a is deleted (S101 in
} according to the characters. The encoding length is selected to be 768 dimensions, and the characters are encoded by a BERT pre-training model (S102 in
Step 2. Taxpayer Non-Text Information Processing
Besides text information, the taxpayer registration information database also includes some non-text information, which has more intuitive features. This non-text information is also of great value for taxpayer industry classification, clustering and anomaly detection.
As shown in
S201. Numerical Feature Standardization
The information of registered taxpayers and the expanded information table of registered taxpayers in the taxpayer industry information database are queried, nine columns {registered capital, total investment, number of employees, number of foreigners, number of partners, fixed number, proportion of natural person investment, proportion of foreign investment and proportion of state-owned investment} are selected as numerical features, and z-score processing is carried out on the above nine columns.
In this embodiment, firstly, the sample means μ1, μ2, . . . , μ9 and sample variance σ1, σ2, . . . , σ9 of the above nine columns of features are calculated, Xi is denoted as the value of the ith numerical feature of the sample X, and then the features in the nine columns are mapped by the z-score formula
(i=1, 2, . . . , 9) to realize the standardization of numerical features (S201 in
S202. One-Hot Encoding of Categorical Features
The information of registered taxpayers and the expanded information table of registered taxpayers in the taxpayer industry information database are queried, seven columns {registration type, head office mark, whether it is a national and local tax condominium, license category code, industry detail code, whether it is engaged in industries restricted or prohibited by the state, and electronic invoice enterprise mark} are selected as categorical features, and one-hot encoding processing is carried out on the above seven columns.
In this embodiment, the feature of the head office mark is taken as an example, firstly, the value range of the head office mark is calculated. After calculation, there are three types of head office mark values {head office, non-head office and branch office}, so a 3-bit register is set to encode them; then {head office, non-head office and branch office} are mapped into three register codes of {001, 010, 100} respectively; finally, according to the mapping rule, all the features of the column of the head office mark column are encoded (S202 in
S203. Feature Mapping
After the non-text features and text features are processed in steps S201 and S202, feature vectors are obtained, and these feature vectors are mapped and spliced by a linear layer to obtain a complete numerical feature matrix.
Specifically, in this embodiment, firstly, the normalized numerical features are mapped into 768-dimensional feature vectors by constructing a 1×768-dimensional linear layer; then, the maximum dimensions of status registers with different types of features are compared, and the maximum dimension is 264 dimensions after comparison; the codes with less than 264 dimensions are supplemented with 0 to 264 dimensions; the a 264×768-dimensional linear layer is constructed finally to map the categorical feature codes to 768 dimensions, and the vectors mapped by the two linear layers are spliced to obtain a non-text feature vector matrix (S203 in
Step 3: Constructing a Taxpayer Industry Classification Network (BERT-CNN)
A BERT-CNN network has four layers of network structure, and the input layer is divided into a text feature encoding part and a non-text feature mapping part; the second layer is a convolutional neural network layer, which is used for feature mining and extraction; the third layer implements max-pooling for the output of the second layer; the output layer is a fully connected layer with softmax, and the network is built by the network construction processor.
In this embodiment, firstly, a 768-dimensional BERT encoding part, a 1×768-dimensional numerical feature mapping linear layer and a 264×768-dimensional categorical feature mapping linear layer are used as the first layer; first of all, for the BERT encoding part, in this embodiment, there are five features of {taxpayer name, main business, part-time business, mode of operation and business scope}; the dimensions of the feature matrix re set to be {20×768, 20×768, 20×768, 10×768, 100×768}; specifically, take the taxpayer's name as an example, the output is set as a 20×768-dimensional matrix; for those less than 20 words after segmentation, zero is added for alignment, and those more than 20 words are intercepted; the numerical feature mapping linear layer outputs a 9×768-dimensional matrix, the categorical feature mapping linear layer outputs a 7×768-dimensional matrix, and the three matrices are spliced into a 36×768-dimensional matrix as the output of this layer (
Step 4. BERT-CNN Network Pre-Training Based on Nearest Neighbor Semantic Clustering
The BERT-CNN network pre-training based on nearest neighbor semantic clustering is divided into three steps: contrastive learning, nearest neighbor semantic clustering and self-labeling learning. The network pre-training processor firstly mask the samples to construct similar samples according to the idea that similar samples have similar feature representations, and implements contrastive learning by minimizing the distance between the network feature representations of the original samples and the control samples; secondly, the nearest neighbors of multiple samples are selected according to the network feature representation, and the nearest neighbor semantic clustering is carried out by minimizing the distance between the network feature representations of the nearest neighbors; finally, the samples with high confidence are selected as prototype samples, and self-labeling learning is carried out based on cluster labels of prototype samples.
In this embodiment, the data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1. The training set is used for network training, the verification set is used to select the training model, and the test set is used to test the model effect. The specific training process is as follows: firstly, the feature matrix of the feature of the sample X encoded by the input layer is set to be SX, and it can be known that each line vector of SX corresponds to a character in the text features or a feature in the non-text features, that is, each line vector corresponds to an original feature; the number h ∈ {1, 2, . . . , 10} is randomly selected, and the line h of SX is randomly set to be 0 vector as a control sample, and the matrix after mask is denoted as ψ(SX); the network parameter of the first three layers is denoted as θ, and the output of the third layer as the vectors fθ(SX) and fθ(ψ(SX)); back propagation is carried out by using
as the training objective to realize contrastive learning, and finally the 20 nearest neighbors of each sample are calculated according to the Euclidean distance between the output vectors of the third layer for subsequent training (, X is a sample in
, the nearest neighbor set of X is
X, η is a network parameter, gη(X) is the vector output after the sample X is mapped through the network, and gηc(X) is the probability of the sample X being classified into the first class through the network estimation, and back propagation is carried out using
as the optimization objective to realize nearest neighbor semantic clustering (′, |
′| is the number of elements in
′, Xi is a sample in
′, y′i is the cluster to which Xi belongs, y′i is an indication vector generated after one-hot encoding of y′i, i=1, . . . , |
′|, back propagation is carried out using
as the optimization objective to realize self-labeling learning, and a clustering network is obtained (S403 in
Step 5. BERT-CNN Network Training Based on Label Noise Distribution Modeling
The BERT-CNN network pre-training based on label noise distribution modeling includes constructing a cluster noise modeling layer, pre-training the cluster noise modeling layer, training the clustering network based on the cluster noise modeling layer, generating a classification permutation matrix, generating a classification noise modeling matrix, transferring the clustering network to a classification network, constructing the classification noise modeling layer and training the classification network.
In this embodiment, a robust training processor constructs a transfer matrix T of 97×97, which will be added as an additional layer to the current clustering network (S501 in
as the optimization objective to train the network (S503 in
is used as the training objective of the classification network, and the final classification network is obtained by back propagation (S507 in
Step 6. Taxpayer Industry Classification
As shown in
Specifically, in this embodiment, the test set sample X is input into the network to obtain a 97-dimensional classification probability vector gη(X) (S601 in
The steps of the method or algorithm described with reference to the disclosure of the embodiments of the present disclosure can be implemented in hardware or by a processor executing software instructions. Software instructions can be composed of corresponding software modules, which can be stored in a Random Access Memory (RAM), a flash memory, a Read Only Memory (ROM), a Erasable Programmable ROM (EPROM), an electrically erasable programmable read-only memory (EEPROM), a register, a hard disk, a removable hard disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor so that the processor can read information from and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may be located in ASIC. In addition, the ASIC may be located in a node device (such as the above processing node). Of course, the processor and the storage medium can also exist in the node device as discrete components.
The present disclosure can be a system, a method and/or a computer program product. The computer program product may include a computer readable storage medium loaded with computer readable program instructions for causing a processor to implement aspects of the present disclosure. The computer-readable storage medium can be a tangible device that can hold and store instructions used by the instruction execution device. The computer-readable storage medium can be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include portable computer disks, hard disks, random access memories (RAM), read-only memories (ROM), erasable programmable read-only memories (EPROM or flash memory), static random access memories (SRAM), portable compact disk read-only memories (CD-ROM), digital versatile disks (DVD), memory sticks, floppy disks, and floppy disks. The computer-readable storage medium used here is not interpreted as instantaneous signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through optical fiber cables), or electrical signals transmitted through electric wires. The computer-readable program instructions described here can be downloaded from computer-readable storage media to various computing/processing devices, or downloaded to external computers or external storage devices through networks, such as the Internet, local area networks, wide area networks and/or wireless networks. The network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, switch, gateway computer and/or edge server. The network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in the computer readable storage media in each computing/processing device.
Number | Date | Country | Kind |
---|---|---|---|
202110201214.5 | Mar 2021 | CN | national |
The present application is a continuation of International Application No. PCT/CN2021/079378, filed on Mar. 5, 2021, which claims priority to Chinese Application No. 202110201214.5, filed on Feb. 23, 2021, the contents of both of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/079378 | Mar 2021 | US |
Child | 17956879 | US |