The present disclosure relates to a method and a system for determining and reclassifying valuable words, and more particularly to a system and a method that employ machine learning to extract valuable words from text, and then classify the valuable words.
Currently, the online world is filled with a lot of information, articles, essays, etc. However, it is difficult for the network users, the network data processing units, or the network advertising providers to accurately obtain useful information from the large amount thereof, or to apply it. As a result, how to quickly and accurately obtain useful information from the internet world has become a very important topic in the network development. Therefore, how to replace humans with machines, actively gather text information, and use machines to learn, determine and extract useful information is the goal of all walks of life. The technical means mentioned in TW No. TWI660317 “Popularity Prediction Method for Marketing Targets and Non-transient Computer Readable Media”, first downloads the corresponding marketing category articles from social media, obtains plural keywords through word segmentation, and uses time series to determine the correlation of keywords and establishes a neural network model. When the keywords are finally used by the users, they can be used by the users according to their correlation to other keywords.
However, the above-mentioned disclosure only considers the word exposure rate when analyzing keywords, and does not take other data such as click-through rate, word occurrence frequency, word usage rate, etc. into account. Meanwhile, the technology of word segmentation is adopted when obtaining several keywords. Although word segmentation technology plays a role in the today's text extraction keywords, it may also lead to the exclusion of popular words, Chinese-English mixed language, Martian text, etc., which may be meaningful (or valuable) for data analysis although they are not keywords. Finally, when users use keywords, the aforementioned disclosure only provides other keywords with relevance or similarity, and does not mention that it can provide the data in the other categories, aspects, and fields.
In summary, the existing extraction and use of valuable words do have the above-mentioned shortcomings. As a result, how to improve the existing shortcomings of extraction and use of valuable words is a problem to be solved.
It is a primary object of the present disclosure to provide a system and a method for identifying valuable words from text and reclassifying them.
According to the present disclosure, a word processing server is provided for a data provider to pre-input text, such as articles from Internet sources, email marketing texts, product descriptions, etc., which serves as basis of the valuable words in the text information. Meanwhile, a first machine learning process is performed such that the system can learn to determine valuable words in the text. Moreover, the system can then perform the second machine learning on the pre-entered valuable words and the related classification labels corresponding to the valuable words. In this way, the system can extract the valuable words from the text. After the extraction is completed, the extracted valuable words are classified. Finally, various labels are assigned to the corresponding valuable words. When there is a need for subsequent use of valuable words, not only can it be separately determined by the text, but also there are different applications according to label classification.
Referring to
The word processing server 11 is employed to perform machine learning after receiving the data transmitted by the data providing device 13. Meanwhile, a plurality of models are built based on learned data. Moreover, the word processing server 11 determines the data under test collected through the third-party search system 12 and extracts valuable words. Then, the valuable words are classified. According to a classification category, a classification label information is assigned to each valuable word.
The third-party search system 12 can be any one of a search engine database, an advertisement database, a text database, or any combination thereof. Any system that enables the word processing server 11 to obtain the required input samples under test can be employed.
The data providing device 13 can be one of a mobile phone, a tablet computer, a personal computer, etc. Any devices that can provide the data required by the word processing server 11 for machine learning can be employed. The data providing device 13 mainly provides text information, valuable word information, and classification information required by the word processing server 11 for machine learning and model building. The aforementioned information will be described below.
The word processing server 11 mainly includes a data processing module 111 which is respectively connected to a data storage module 112, a data collection module 113, a word determination module 114, and a word reclassification module 115. The data processing module 111 is employed to operate the word processing server 11 and to drive the above-mentioned modules in operation. The data processing module 111, for example a central processing unit (CPU), fulfills functions such as logical operations, temporary storage of operation results, and storage of the position of execution instructions.
The data storage module 112 can store electronic data, such as SSD (Solid State Disk or Solid State Drive), HDD (Hard Disk Drive,), or any type of memory. The data storage module 112 mainly includes a word determination database 1121, a word reclassification database 1122, and a classification completion database 1123. The word determination database 1121 can be used to store and record a text information T1 and a first valuable word information L1. Both of the text information T1 and the first valuable word information L1 are provided by the data providing device 13. The text information T1 can generally includes texts such as articles from internet sources, email marketing texts, product descriptions, public documents, short texts, or combinations thereof. The first valuable word information L1 mainly corresponds to the valuable words in the text information T1. Furthermore, the valuable words include keywords, current buzzwords, mixed Chinese and English languages, Martian words and other meaningful words of the times, all of which meet the definition of the valuable word. Furthermore, the valuable words are marked by the data providing device 13. The marking work is based on associated data such as the frequency of occurrence of the valuable words in the text, frequency of use, frequency of touch, frequency of clicks, frequency of common words, etc. The word reclassification database 1122 can store a second valuable word information T2 and a classification category information L2. The second valuable word information T2 is the same as the aforementioned first valuable information T1. However, the second valuable word information T2 refers to an input data of the second machine learning mentioned below. Therefore, there is no corresponding text information. The classification category information L2 is the information corresponding to the second valuable word information T2 here. The classification category information L2 is marked by the data providing device 13, which can be the field, frequency of use, scope of use, usage habits, word length, etc. of the valuable word. The classification category information L2 can also be the attribute, function, effect, and feature, brand, etc. of the classification label. The classification completion database 1123 mainly stores a valuable word information under test and a classification label information which will be described in detail below.
The data collection module 113 is mainly used to drive the third-party search system 12 to collect a text information under test, and transmit the text information under test to the subsequent word determination module 114. The data collection module 113 mainly uses browser search, data retrieval, web crawler and other methods or a combination thereof to obtain the text information under test. The text information under test can generally refer to texts such as articles from internet sources, email marketing texts, product descriptions, public documents, short texts, or a combination thereof, but should not be limited thereto. The text information includes not only a single natural language or a single natural language family, but also multiple natural languages or mixed natural languages.
The word determination module 114 mainly determines the valuable words in the text information under test transmitted by the data collection module 113, extracts it into a valuable word information under test, and transmits it to the subsequent word reclassification module 115. The word determination module 114 mainly employs machine learning, such as supervised learning, semi-supervised learning, reinforcement learning, etc. to build models, but it is not limited thereto. The word determination module 114 mainly uses text information T1 as input data for model training. The first valuable word information L1 is used as the label data during model training to perform a first machine learning, and the model is constructed accordingly.
The word reclassification module 115 mainly classifies the valuable word information under test transmitted by the word determination module 114, and assigns a classification label information to the valuable word information according to a classification result. Finally, the valuable word information under test and the classification label information are stored in the classification completion database 1123. The word reclassification module 115 mainly employs machine learning, such as supervised learning, semi-supervised learning, reinforcement learning, etc. to build models. The word reclassification module 115 mainly uses the second valuable word information T2 as input data for model training. The classification category information L2 is used as the label data during model training to perform a second machine learning, and the model is constructed accordingly.
As illustrated in
(1) Step of Inputting Information Under Test S1:
As shown in
(2) Step of Comparing the First Model S2:
Following the above-mentioned step S1 and referring to
(3) Step of Determining the Valuable Words S3:
Following the above-mentioned step S2 and referring to
(4) Step of Comparing the Second Model S4:
Referring to
(5) Step of Reclassifying the Valuable Words S5:
Following the above-mentioned step S4 and referring to
As shown in
As shown in
According to the present disclosure, the system employs a secondary machine learning method to enable the system to extract the valuable words from the text, then classify the valuable words, and assign various labels to the valuable words according to the classification category. Accordingly, the present disclosure can indeed achieve the purpose of identifying valuable words from the text and reclassifying the valuable words.
Number | Date | Country | Kind |
---|---|---|---|
110105019 | Feb 2021 | TW | national |