The present application relates to the technical field of text analysis, for example, a text classification method, an electronic device, and a computer-readable storage medium.
In the field of text analysis, text classification is always a focus of researches, among which a relatively great number of researches are conducted on the classification of common text (for example, common text in categories such as finance and economics, entertainment, and sports), but a relatively small number of researches are conducted on the classification of illegal or politically sensitive articles. The field of text classification is flooded with traditional classification methods and learning algorithms of classification methods, for example, <HHH> vector machine (SVM), k-nearest neighbor algorithm (KNN), and random forests, as well as neural network classification methods that are popular in recent years. In the related technology, a model is established using an algorithm based on text-feature words to classify text. However, the related technology can provide merely a probability value regarding the text and cannot determine the category of an article based on a certain word.
The present application provides a text classification method, an electronic device, and a computer-readable storage medium to overcome the deficiencies existing in the preceding related technology.
The present application provides a text classification method. The method includes the steps below.
The present application further provides a text classification method. The method includes the steps below.
In response to a determination result that the to-be-tested text contains the sensitive word, the text category of the to-be-tested text is determined based on the sensitive word contained in the to-be-tested text.
The present application further provides an electronic device including a processor and a memory.
The memory is configured to store a program.
When the program is executed by the processor, the processor implements any preceding text classification method.
The present application further provides a computer-readable storage medium storing computer-executable instructions for executing any preceding text classification method.
The technical solutions in the embodiments of the present application are described hereinafter clearly and completely in connection with the drawings in the embodiments of the present applications. Apparently, the described embodiments are part, not all, of embodiments of the present application.
A text classification method is provided. The method includes the steps below.
When a sensitive word is detected through the AC automaton in step 2, first a trie is created by using a sensitive-word dictionary. In this embodiment, the trie is created with an example in which a dictionary includes multiple words [ ]. As shown in
The sensitive-word dictionary may be created by customization. Alternatively, a built-in dictionary may be used as the sensitive-word dictionary.
When a Chinese character string, for example "" is input, "
" serves as a match. The matching path is shown in
", node "
", and node "
" being child nodes of the root node, the character string "
" is input by traversing; the first four characters "
", "
", "
' and "
" do not match any node; "
" in the character string matches node "
"; since node "
" and node "
" are the next nodes of node "
", "
" in the character string matches node "
"; since node "
" is the next node of node "
", "
" in the character string matches node "
", and then the maximum length of this path is reached; since being contained in the dictionary, "
" serves as a match; then the position of the failure link of node "
" is skipped to; however, since the character after "
" in the character string "
" is "
", the failure link of node "
" points to the root node; and finally "
" serves as a match.
Detection of illegal content through a recurrent neural network in step 3 mainly includes two parts. As shown in
A dictionary and the tagged training data can be used for the training of the model. The dictionary may include as many words as possible. The dictionary may include some illegal words and may also include some normal words. A tag carried by the training data needs to be accurate. The training data may be tagged artificially to guarantee accuracy. In modeling training, a word frequency vector of a word belonging to a lexicon, contained in an article in the training data and found through the dictionary is used as an input vector for performing training.
Dictionary: {illegal, politically, reactionary, prohibited, legal}
Training text: "Some website is an illegal website containing politically reactionary content. The access to the website is prohibited in China."
Text tag: [0, 1, 0, 0] ([1, 0, 0, 0] denotes normal text; [0, 1, 0, 0] denotes politically reactionary text; [0, 0, 1, 0] denotes pornographic text; and [0, 0, 0, 1] denotes the text of another type.)
Text vector: [1, 1, 1, 1, 0] (The first number 1 represents that "illegal" in the dictionary occurs once in the text; the second number 1 represents that "politically" in the dictionary occurs once in the text; and other numbers can also be explained in this manner.)
The tagged text vector is input into a recurrent neural network to train the recurrent neural network. Then the trained model is output.
After the model training is completed, illegal content is detected based on steps in
For example, {'probe_dist': {
Based on the score in the preceding scoring result, the article is determined as a politics-related article.
1. Test text
["Taiwan independence": "politically sensitive",
"Democratic Progressive Party": "politically sensitive",
"Kuomintang": "politically sensitive"]
3. Test results
Sensitive words contained in the text can be identified accurately through the function of detection of sensitive words. Based on the identified sensitive words, the articles are determined politically sensitive articles. Sensitive words in other categories can also be identified accurately and the corresponding categories are determined.
In the method of the present application, for detection of sensitive words, no model needs to be created, and only programming is required. For identification and classification of illegal content, a model may be created. The data used for creating the model are as below.
2.1. Test text
2.2. Test results
The accuracy rate, the precision rate, the recall rate, and the definition of the F1 value are described below.
Reference is made to a confusion matrix before each indicator is introduced. If a problem of binary classification exists, four situations occur when predicted results and actual results are combined in pairs.
Since the representation by numbers 1 and 0 does not facilitate reading, T (True) denotes correctness, F (False) denotes incorrectness, P (Positive) denotes 1, and N (Negative) denotes 0. A predicted result (P|N) is viewed first; and then a determination result is given based on the comparison of a predicted result and an actual result. Based on the preceding logic, the table below is obtained after redistribution.
TP, FP, FN, and TN may be understood as below.
The accuracy rate is the percentage of the correctly predicted results in total samples. The expression of the accuracy rate is as below.
The precision rate, in terms of the predicted results, refers to the probability that a sample among all the samples predicted to be positive is actually positive. The expression of the precision rate is as below.
The recall rate, in terms of original samples, refers to the probability that a sample among all the actually positive samples is predicted to be positive. The expression of the recall rate is as below.
The expression of the F1 score is as below.
The electronic device may further include an input apparatus 130 and an output apparatus 140.
The processor 110, the memory 120, the input apparatus 130, and the output apparatus 440 that are in the electronic device may be connected through a bus or in other manners.
As a computer-readable storage medium, the memory 120 may be configured to store software programs, computer-executable programs, and modules. The processor 110 runs the software programs, instructions and modules stored in the memory 120 to perform function applications and data processing, that is, to implement any method in the preceding embodiments.
The memory 120 may include a program storage region and a data storage region. The program storage region may store an operating system and an application program required by at least one function. The data storage region may store the data created according to the use of the electronic device. Additionally, the memory may include a volatile memory, for example, a random access memory (RAM), and may also include a non-volatile memory, for example, at least one magnetic disk memory element, a flash memory element, or another non-volatile solid-state memory element.
The memory 120 may be a non-transient computer storage medium or a transient computer storage medium. The non-transitory computer storage medium includes, for example, at least a magnetic disk memory element, a flash memory element, or another non-volatile solid-state memory element. In some embodiments, the memory 120 optionally includes memories which are disposed remotely relative to the processor 110. These remote memories may be connected to the electronic device via a network. The examples of the preceding network may include the Internet, an intranet, a local area network, a mobile communication network, and a combination thereof.
The input apparatus 130 may be configured to receive the input digital or character information and generate key signal input related to user settings and function control of the electronic device. The output apparatus 140 may include a display device, for example, a display screen.
This embodiment further provides a computer-readable storage medium storing computer-executable instructions for executing the preceding methods.
All or part of the procedure processes in a method of the preceding embodiments may be performed by related hardware executed by computer programs. The programs may be stored in a non-transitory computer-readable storage medium. During the execution of the programs, the processes in a method according to the preceding embodiments may be included. The non-transitory computer-readable storage medium may be, for example, a magnetic disk, an optical disk, a read-only memory (ROM), or an RAM.
Compared with the related technology, the present application has the advantages below.
Number | Date | Country | Kind |
---|---|---|---|
201910859082.8 | Sep 2019 | CN | national |
The present disclosure claims priority to Chinese Patent Application No. 201910859082.8 filed with the China National Intellectual Property Administration (CNIPA) on Sep. 11, 2019, the disclosure of which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/108652 | 8/12/2020 | WO |