The subject application is a U.S. National Stage application of International Application No. PCT/CN2020/109670, filed on 18 Aug. 2020, which claims the benefit of Chinese Patent Application No. 202010188511.6, filed on 17 Mar. 2020. The contents of each application incorporated herein by reference.
The present disclosure relates to the technical field of text analysis, for example, a text keyword extraction method, an electronic device and a computer-readable storage medium.
Keyword extraction is the key technology of information retrieval, text classification and clustering, and automatic abstract generation and is an important means of quickly acquiring the document subject. Keywords are traditionally defined as a set of words or phrases that can summarize the subject matter of a document. The keywords have important applications in many fields, such as automatic abstract for documents, web page information extraction, document classification and clustering, and search engines. However, in most cases, the keywords are not directly given in the text, so it is necessary to design a keyword extraction method.
In the field of text analysis, the technologies for extracting the text keywords mainly include a TextRank algorithm, a term frequency-inverse document frequency (TFIDF) algorithm, and Latent Dirichlet Allocation (LDA) model.
(1) TextRank Algorithm
The TextRank algorithm is evolved from a PageRank algorithm. When the keywords are extracted using the TextRank algorithm, not every word has a connection, but after a window is set, the words in the window have the connection; and a score of each word is calculated and counted using a score calculation formula, and n words with the highest score are obtained as the keywords. The score calculation formula is described below.
WS (Vi) denotes a weight (textrank value) of Vi; Vi denotes the word i; d denotes a damping coefficient and is usually taken as 0.85; ω denotes an edge between words i and j, that is, a weight between the two words; In(Vi) denotes a set that can be connected to the Vi; Out(Vj) denotes a set that can be connected to the Vj; i denotes the word i; j denotes the word j which is the word preceding i; and k denotes another word k other than the word i and the word j. The formula indicates that the weight of the word i in the TextRank algorithm depends on a sum of a weight of an edge (j, i) formed by each point j preceding i and a weight of the point j to other edges.
(2) TFIDF Algorithm
The TFIDF algorithm is a statistics-based calculation method, and the calculation formula is described below.
tfi,j denotes a term frequency (TF), idfi denotes an inverse document frequency (IDF), and ni,j denotes the number of instances a word appears in an article; D denotes the total number of document sets; i denotes the word i; and j denotes the word j.
This formula counts the TF and the IDF, and a TFIDF value is often used for indicating the importance of a document word in a document set, where the document word may also be referred to as a feature word or a keyword of the document. The TF refers to the frequency of a word in an article, and the IDF refers to a log value of a value obtained by dividing the total number of file sets in which a word is located by the number of documents including the word and one. The larger the TFIDF value of a word is, the more frequently the word appears in an article, and the lower the number of occurrences in other articles in the file set is, the more likely the TFIDF value indicates the feature or criticality of the article.
(3) LDA Algorithm
The LDA algorithm assumes that both the prior distribution of topics in a document and the prior distribution of words in a topic obey the Dirichlet distribution. According to the Bayesian approach, a set of Dirichlet-multi conjugates are obtained through the prior Dirichlet distribution and the multinomial distribution obtained from the observed data, and based on this, the posterior distribution of the topics in the document and the posterior distribution of the words in the topic are inferred, which are the result we need finally.
Although the keywords of the text may be extracted to a certain extent in the preceding three methods, a text name and a text title are not considered in any of the methods. Therefore, important information is lost and thus some important information is less for extracting the keywords of the article.
The present application provides a keyword extraction method applicable to a Word text, an electronic device, and a computer-readable storage medium, so as to improve an accuracy rate and a recall rate.
The present application may be implemented by the technical schemes described below.
A text keyword extraction method includes the steps described below.
An electronic device includes a processor and a memory configured to store a program.
When the program is executed by the processor, the processor performs the text keyword extraction method.
A computer-readable storage medium stores computer-executable instructions for executing the text keyword extraction method.
The present application is described hereinafter in detail in conjunction with drawings and embodiments. Apparently, the described embodiments are part, not all, of embodiments of the present application. Based on the embodiments of the present application, all other embodiments obtained by those of ordinary skill in the art without creative work are within the scope of the present application.
It is to be understood that in an article, the name and title of the article play a more important role in the article. Some words in the name or title of the article can generally represent the article, except for articles with arbitrary names and articles without titles. The name and title of the article are generally formulated so as to know the approximate contents of the article when the file is not opened. The title of the article is the skeleton of the content of the entire article, represents the main core content of each paragraph, and serves as a connecting link between the preceding and the following in the article.
For example, the file name is “Stylish Victory! Nowitzki on site helps the Dallas Cowboys team kick off the season” (citing a piece of sports news in Baidu News as an example), and five keywords extracted by the preceding method are: Dallas, game, season, Cowboys, and team. The keywords of this piece of news should include “Nowitzki”, but “Nowitzki” only appears when 10 keywords are extracted. Therefore, the extracted keywords are not satisfactory and cannot satisfy real keyword extraction requirements.
As shown in
In some embodiments, the keyword extraction model may be a logistic regression (LR) model, a support vector machine (SVM) model, or a decision tree model. The LR model may get better results.
As shown in
As shown in
As shown in
In this embodiment, a test performed on this method may be described below.
For test data, a total of 116,532 papers are downloaded, including categories such as politics, teaching, and economics. Some data are screened according to whether the data includes keywords, titles, and other information. The papers actually used are described below.
(i) Comparison Between Various Text Keyword Extraction Methods
The keywords extracted by the TextRank algorithm are extracted based on this paper. Since a idf value in the TFIDF algorithm needs to be calculated from multiple pieces of test data, when the keywords are extracted, the idf value is calculated based on a test set of 535 papers, and the tfidf value involved in the method of the present application is also calculated based on the test set.
The test results show that for this article, in this method, only words that the model considers to be keywords are extracted, and the accuracy rate and the recall rate are higher than those of TextRank and TFIDF for extracting the keywords.
(ii) Accuracy Rate and Recall Rate of Different Keyword Extraction Methods
Test data is the preceding training set of 3556 papers and the preceding test set of 535 papers.
The steps may be described below.
After the test of the test set, the accuracy rate and the recall rate of extracted keywords are described in Table 2.
The following conclusions may be drawn from the analysis of the test results.
The definitions of the accuracy rate and the recall rate are described below.
First, for a confusion matrix, if a problem of binary classification exists, four cases occur when predicted results and actual results are combined in pairs.
Since the representation by numbers 1 and 0 does not facilitate reading, after conversion, T (True) denotes correctness, F (False) denotes incorrectness, P (Positive) denotes 1, and N (Negative) denotes 0. A predicted result (PIN) is referred to first, and then a determination result (T|F) is given based on the comparison between the actual results and the predicted results. Based on the preceding logic, the table below is obtained after redistribution.
TP, FP, FN, and TN may be understood below.
The accuracy rate refers to, in terms of the predicted results, a probability that a sample among all the samples predicted to be positive is actually positive. The expression of the accuracy rate is described below.
The recall rate refers to, in terms of original samples, a probability that a sample among all the actually positive samples is predicted to be positive. The expression of the recall rate is described below
The electronic device may further include an input apparatus 130 and an output apparatus 140.
The processor 110, the memory 120, the input apparatus 130, and the output apparatus 140 that are in the electronic device may be connected by a bus or in other manners. A connection by a bus is used as an example in
As a computer-readable storage medium, the memory 120 may be configured to store software programs, computer-executable programs, and modules. The processor 110 runs the software programs, instructions and modules stored in the memory 120 to perform function applications and data processing, that is, to implement any method in the preceding embodiments.
The memory 120 may include a program storage region and a data storage region. The program storage region may store an operating system and an application program required by at least one function. The data storage region may store the data created according to the use of the electronic device. Additionally, the memory may include a volatile memory, for example, a random-access memory (RAM), and may also include a non-volatile memory, for example, at least one magnetic disk memory, a flash memory, or another non-transitory solid-state memory.
The memory 120 may be a non-transitory computer storage medium or a transitory computer storage medium. The non-transitory computer storage medium includes, for example, at least one magnetic disk memory, a flash memory, or another non-volatile solid-state memory. In some embodiments, the memory 120 optionally includes memories which are disposed remotely relative to the processor 110. These remote memories may be connected to the electronic device via a network. The examples of the preceding network may include the Internet, an intranet, a local area network, a mobile communication network, and a combination thereof.
The input apparatus 130 may be configured to receive the input digital or character information and generate key signal input related to user settings and function control of the electronic device. The output apparatus 140 may include a display device, for example, a display screen.
This embodiment further provides a computer-readable storage medium storing computer-executable instructions for executing the preceding methods.
All or part of the processes in the methods of the preceding embodiments may be implemented by related hardware executed by computer programs, these programs may be stored in a non-transitory computer-readable storage medium, and during the execution of these programs, the processes in the preceding method embodiments may be included. The non-transitory computer-readable storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or an RAM.
Compared with the related art, the present application has the advantages described below.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202010188511.6 | Mar 2020 | CN | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2020/109670 | 8/18/2020 | WO |
| Publishing Document | Publishing Date | Country | Kind |
|---|---|---|---|
| WO2021/184674 | 9/23/2021 | WO | A |
| Number | Name | Date | Kind |
|---|---|---|---|
| 8768960 | Hu et al. | Jul 2014 | B2 |
| 9201957 | Turdakov | Dec 2015 | B2 |
| 10878004 | Bao | Dec 2020 | B2 |
| 11138381 | Zhou | Oct 2021 | B2 |
| 11416532 | Agnihotram | Aug 2022 | B2 |
| 11947911 | Zheng | Apr 2024 | B2 |
| 20070233465 | Sato et al. | Oct 2007 | A1 |
| 20170139899 | Zhao | May 2017 | A1 |
| Number | Date | Country |
|---|---|---|
| 102929928 | Feb 2013 | CN |
| 106407482 | Feb 2017 | CN |
| 108052593 | May 2018 | CN |
| 108073568 | May 2018 | CN |
| 108121700 | Jun 2018 | CN |
| 108920466 | Nov 2018 | CN |
| 109471937 | Mar 2019 | CN |
| 109543032 | Mar 2019 | CN |
| 109933787 | Jun 2019 | CN |
| 110188344 | Aug 2019 | CN |
| 110232183 | Sep 2019 | CN |
| 109190111 | Nov 2019 | CN |
| 11401040 | Jul 2020 | CN |
| 111401040 | Jul 2020 | CN |
| 100597434 | Jul 2006 | KR |
| 2021169186 | Sep 2021 | WO |
| 2021184674 | Sep 2021 | WO |
| Entry |
|---|
| Onan, Aytuğ, Serdar Korukolu, and Hasan Bulut, “Ensemble of keyword extraction methods and classifiers in text classification”, Mar. 2016, Expert Systems with Applications 57, pp. 232-247. (Year: 2016). |
| Yao, Lu, Zhang Pengzhou, and Zhang Chi, “Research on News Keyword Extraction Technology Based on TF-IDF and TextRank”, Jun. 2019, 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS 2019), pp. 452-455. (Year: 2019). |
| Song, Shouyou, Zhongru Wang, Shuai Xu, Shice Ni, and Jia Xiao, “A Novel Text Classification Approach Based on Word2vec and TextRank Keyword Extraction”, Jun. 2019, 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), pp. 536-543. (Year: 2019). |
| Pan, Suhan, Zhiqiang Li, and Juan Dai, “An Improved TextRank Keywords Extraction Algorithm”, May 2019, Proceedings of the ACM Turing Celebration Conference—China, pp. 1-7. (Year: 2019). |
| Liu, Xiaohui, Xin Yan, Zhengtao Yu, Guangshun Qin, and Yuanyuan Mo, “Keyword Extraction for Web News Documents Based on LM-BP Neural Network”, May 2015, The 27th Chinese Control and Decision Conference (2015 CCDC), pp. 2525-2531. (Year: 2015). |
| Zhu, Zede, Miao Li, Lei Chen, Zhenxin Yang, and Sheng Chen, “Combination of Unsupervised Keyphrase Extraction Algorithms”, Aug. 2013, Proceedings of the 2013 International Conference on Asian Language Processing, pp. 33-36. (Year: 2013). |
| Mihalcea, Rada, and Paul Tarau, “TextRank: Bringing Order into Texts”, 2004, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404-411. (Year: 2004). |
| Search Report issued in International Patent Application No. PCT/CN2020/109670 on Nov. 30, 2020. |
| First Office Action issued in related application Serial No. CN2020101885116 on Jan. 18, 2021. |
| Chinese First Office Action issued in related Application Serial No. 202010188511.6 on Jan. 18, 2021. |
| International Search Report issued in related Application Serial No. PCT/CN2020/109670 on Dec. 7, 2020. |
| Yongqing, “Research and Implementation of Text-based Keyword Extraction.” School of Data Science and Software Engineering, Qingdao University, Qingdao Shandong 266071, China. 1994-2020 China Academic Journal Electronic Publishing House. http://www.cnki.net. |
| Durai et al, “Complex Network based Supervised Keyword Extractor.” Department of Computer Science, University of Delhi, New Delhi 110007, India. Expert Systems With Applications. www.elsevier.com/locate/eswa. Dated Aug. 16, 2019. pp. 1-14. |
| Number | Date | Country | |
|---|---|---|---|
| 20230136368 A1 | May 2023 | US |