This application claims priority to Korean Patent Application No. 10-2023-0156844, filed on Nov. 14, 2023, the disclosures of which is herein incorporated by reference in its entirety.
This invention was supported by a grant of Institutes for Information & Communication Technology Planning & evaluation of Ministry of Science and ICT (IITP, No. 2022-0-00302, Development of global demand forecasting and analysis/prediction system of market/industry trends.
The present disclosure relates to a web crawler system that crawls and collects Internet articles and provides a summary service of issue articles affecting the global value chain (GVC), which crawls English issue articles from major global media including CNN USA, NYT (New York Times), WSJ (Wall Street Journal), Reuters UK and the like on the Internet to classify a news list and store a news article list/articles in a server DB, and then classifies sentences in an issue article (a sentence set, sentence 1, 2, 3, . . . ), selects sentences containing a title of the issue article, identifies the beginnings and ends of sentences using tokens in a sentence set including multiple sentences in an English issue article, extracts sentences, and extracts words in each sentence by utilizing a S/W engine that uses a key word combination algorithm of segmentation techniques of a language model (LM) of natural language processing (NLP) to search for keywords, analyze global economic trends in issue articles, and predict risks, and combines words reextracted according to English grammar, sentence structure, and word order, on the basis of the title of the issue article to provide a summary of major overseas issue articles that affect the global value chain (GVC).
A language model (LM) is a model processed by a computer by assigning probabilities to sentences (word sequences) or words to model language. Language modeling is a task of predicting words or sentences within a given task based on an existing dataset.
For example, if a word is w and a word sequence is an uppercase letter W, the probability of a word sequence W in which n words appear is expressed as follows:
The probability of a next word appearing, when n-1 words are listed, as a conditional probability of an n-th word is expressed by the following equation.
For example, the probability of a fourth word is expressed as follows:
The probability of an entire word sequence W is known after all words have been predicted, and the probability of a word sequence is expressed as follows:
Natural Language Processing (NLP) is a technology that allows a computer to understand the syntactic/semantic nature of human language through a natural language analysis. Natural language analysis is divided into four categories, such as a morphological analysis, a syntactic analysis, which divides a sentence with a series of strings expressed as a subject and a predicate into morphemes, which have the smallest units of meaning in the case of natural language, and analyzes the strings according to regular grammar, a semantic analysis, which analyzes the meaning (semantic) of a sentence, and a pragmatic analysis. In the field of NLP, an entire process of converting natural language used by humans into number vectors that computers can understand is called embedding.
A morpheme is the smallest unit of meaning (semantic) in the structure of words constructing a sentence, and is combined with a possessive case.
The syntactic analysis is classified into a top-down syntactic analysis and a bottom-up syntactic analysis. The syntactic analysis uses a syntactic analyzer that is handled by a parser included in an interpreter or compiler, and the parser is a program that performs a syntactic analysis by generating a structured syntax tree from a series of tokens, which are the results of lexical analysis. The top-down syntactic analysis is a method of reading a sentence from left to right, one character at a time, by a syntactic analyzer, and generating leaf nodes starting from the root node. Conversely, the bottom-up syntactic analysis is a method of generating a parse tree upward from the leaf nodes to the root node.
Recently, Google has developed bidirectional encoder representation from transformers (BERT), a learning-type natural language processing language model (LM), which is being utilized in the field of natural language processing (NLP). OpenAI and Microsoft Corporation have developed Chat-GPT, a conversational chatbot utilizing generative AI techniques, to provide language translation, database search, and question-answering services, and recently released GPT-4 Turbo.
Unlike an existing language model (LM) such as GPT, Google's BERT integrates bidirectional context based on an encoder of a transformer to express text, which is input data. In addition, a large corpus including sentences/words such as dictionaries and news that do not include labels is pre-trained, and the pre-trained model is fine-tuned by using a database that includes labels of downstream tasks and utilized in various downstream tasks. BERT simply recognizes the form of words or sentences. Recently, Google's AI service chatbot Bard was released.
Recently, OpenAI's GPT, ChatGPT, which is a natural language processing (NLP) language model (LM), was released, which provides language translation, text generation, and question-answering services using artificial intelligence. Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large-scale language model that is pre-trained to predict a next token. Recently, OpenAI, the US developer of ChatGPT, released its latest artificial intelligence (AI) model, ‘GPT-4 Turbo’.
In Korea, the Voice Intelligence Research Group of the Electronics and Telecommunications Research Institute (ETRI) is researching word embedding technology based on Korean phoneme sequences, and Kookmin University is researching sentence generation techniques using a Korean phoneme-based LSTM language model.
In addition, the morphological analysis is a statistical and dictionary-based approach, and a morphological analysis system includes probability information learned from a corpus tagged with parts of speech and a separately pre-built vocabulary dictionary.
Most existing morphological analysis methods may use a word-based approach, and attach a part of speech to each individual word or morpheme.
Morphological analysis-based language models (LMs) use morphological analysis APIs to analyze the form and meaning of words and the structure and meaning (semantic) of sentences to understand the meaning of natural language sentences.
For example, BERT provides pre-training and fine tuning, and BERT, which is a pre-training method, pre-trains a large-scale language model that learns a large corpus by an upstream task that learns widely through a transformer, and then performs transfer learning through training a large-scale language model again. That is, it has a structure that applies a BERT language model to a large corpus and executes ‘good embedding’ values as an input for ‘transfer learning’ to an additional model.
BERT, which is a model trained by embedding a large corpus of 330 million words, and creates labels itself and performs self-supervised learning. BERT executes through an embedding process for a given sentence (a sequence of words), wherein an entire process of converting natural language into number vectors that computers can understand is called embedding, which is a general-purpose language model that expresses the meaning of words as vectors and shows good performance in the field of NLP.
Embedding typically plays the following three roles:
Large corpus→BERT→(data to be classified)→machine learning model such as LSTM, CNN→classification
An upstream task means learning widely. The upstream task pre-trains a large corpus. Learning methods include learning relationships between sentences to predict a next word (sentence) that will come after a certain word (Next Sentence Prediction, NSP), and putting a blank space in a word and then guessing the blank space (Masked Language Model, MLM). A BERT model performs an upstream task (pre-training) using both methods, and GPT performs an upstream task using only the NSP method.
It learns widely through pre-training by an upstream task, and then learns deeply through transfer learning by a downstream task. Multiple tasks may be generated.
It is characterized by using an entire downstream task data, and updating an entire model. There is a disadvantage in that the larger the language model (LM), the higher the computational cost required to update the entire language model. Due to the above disadvantage, prompt tuning and in-context learning are used.
It means that a model is partially updated using entire downstream task data.
In-context learning is a method of performing a downstream task without updating a model by means of a method of using only a partial portion of downstream task data. In-context learning is classified into zero-shot learning, one-shot learning, and few-shot learning.
Tokenization means dividing a sentence into smaller units. In order to input into a natural language processing model such as BERT and GPT, a tokenization procedure is preferentially required. A tokenization method uses a word-level, a character-level, and a subword-level.
Tokenization is performed at a word-level. For example, the sentence “(I went to café yesterday)” is divided into “
(I)”, “
(went)”, “
(cafe) “, and “
(yesterday)”. It has a disadvantage in that a size of lexical set may increase because of taking into consideration as many words as possible, and as it increases, model training becomes difficult.
Tokenization is performed on a character-by-character basis. The character units represent a, b, c, d/. The advantage of character-level tokenization may include all the characters of the language, but there is a disadvantage in that each character token is difficult to be a meaningful unit. For example, the distinction between the character “
” in “
” and the character “
” in “
” disappears, and because a length of the token sequence becomes longer, there is a disadvantage in that learning performance deteriorates. For example, “
”→ is converted to
,
,
,
,
,
, and
, and the token sequence becomes longer.
As an intermediate form of word-level tokenization and character-level tokenization, subword-level tokenization is characterized by not having an excessively large vocabulary size, solving the problem of unregistered tokens, and ensuring that token sequences are not too long. A representative implementation example is Byte Pair Encoding (BPE).
As prior art 1 related thereto, Korean Patent Publication No. 10-2023-0047849 discloses “METHOD AND SYSTEM FOR SUMMARIZING DOCUMENT USING HYPERSCALE LANGUAGE MODEL.”
A document summary system including:
Recently, there has been an active development and operation of programs in which stock market and economic analysis experts conduct various economic analyses, develop content predicting economic crises and opportunities, transmit them to customers through YouTube, and widely use a program having a conversation with them. In those programs, world-renowned analysts and economic experts appear and engage in conversations with the host, using current economic issues and difficult economic models to provide persuasive explanations through direct demonstrations to clients. Additionally, corporate clients who manage the global value chain (GVC) also engage in question-answering conversations by asking questions in the comment section and receiving expert opinions on those questions.
These economic programs have received even greater support due to the COVID-19 pandemic that has swept the world, and have also been a great help in securing global responsiveness for companies through economic policy responses and estimates of changes in the global supply chain according to daily disease outbreaks and changes in response situations. In particular, for export-oriented South Korea, the COVID-19 pandemic has taught that GVC issues are very important information even for small and medium-sized enterprises, and securing confrontation ability based thereon is a source of corporate competitiveness.
However, despite the fact that economic experts are developing a large amount of economic content, companies and economic research institutes are unable to obtain clear, customized answers in the areas they need. In particular, although they quickly identify changes in their fields of interest and related fields, analyze them precisely, and take actions accordingly, they do not have the ability to handle the enormous flood of information pouring in real time through the Internet and do not have the ability to analyze big data using the information.
Recently, not only long-term predictions and actions regarding the economic impact of the successive announcements and ramifications of inflation and environmental policies by governments around the world, such as the Inflation Reduction Act (IRA) legislation in the United States, climate crisis action, and carbon neutrality actions, but also short-term and ultra-short-term predictions and responses have come to determine the rise and fall of companies and the national economy. Therefore, decision makers and executives who manage the company's global value chain (GVC) need a compass to read their direction under the flood of Internet information, and clients who manage companies want to secure customized economic information that is fast, accurate, and easy to understand by analyzing big data of the Internet pouring out in real time.
The background technologies that provide those services typically require technologies such as securing and storing Internet documents, natural language processing (NLP) to read articles, and text mining to extract information from articles. Furthermore, text mining includes text analysis technology that preprocesses and identifies sentences.
In addition, existing web crawler systems have simply provided only website crawling, but have not provided summary services for issue articles of overseas media that affect the global value chain (GVC) from an economic perspective.
(Patent Document 1) Korean Patent Publication No. 10-2023-0047849 (Publication Date: Apr. 10, 2023), “METHOD AND SYSTEM FOR SUMMARIZING DOCUMENT USING HYPERSCALE LANGUAGE MODEL”, Naver Corporation
In order to solve the above-described problems, an aspect of the present disclosure is to provide a web crawler system that crawls and collects Internet articles and provides a summary service of issue articles affecting the global value chain, which crawls English issue articles from major global media including CNN USA, NYT (New York Times), WSJ (Wall Street Journal), Reuters UK and the like on the Internet to classify a list of news articles by article title and store the article list/articles in a server DB, and then classifies sentences in an issue article (a sentence set, sentence 1, 2, 3, . . . ), selects sentences containing key words of an item of interest including a title of the issue article with a priority, selects sentences related to the key words of the item of interest (positive/negative) by natural language inference (NLI) for the remaining sentences, identifies the beginnings and ends of sentences using tokens in each of the selected sentences, and extracts key words in each sentence by utilizing a S/W engine that uses a key word combination algorithm of segmentation techniques of a language model (LM) of natural language processing (NLP) to search for keywords, analyze global economic trends in issue articles, and predict risks, and combines words reextracted by English grammar, sentence structure, and word order, on the basis of the title of the issue article to generate summary sentences and provide summary information on major overseas issue articles that affect the global value chain (GVC) centered on overseas media issue articles.
In order to achieve an aspect of the present disclosure, a web crawler system that crawls Internet articles and provides a summary service of issue articles affecting the global value chain includes a user terminal provided with a web agent that crawls and collects English issue articles of overseas media, searches keywords, and displays a summary of overseas media issue articles; and an overseas issue article summary web crawler system that crawls English issue articles of the overseas media by the web agent to store multiple news article lists/articles classified by article titles in a DB, and then searches for keywords, performs sentence classification (SC) on the issue articles, selects sentences containing key words of an item of interest in the issue articles, extracts words from each sentence, and provides a summary of overseas media issue articles that affect the global value chain (GVC) using a key word combination algorithm of segmentation techniques for each sentence according to English grammar, sentence format, and word order.
A web crawler system that crawls and collects Internet articles and provides a summary service of issue articles affecting the global value chain (GVC) in the present disclosure may crawl English issue articles from major global media including CNN USA, NYT (New York Times), WSJ (Wall Street Journal), Reuters UK, and the like on the Internet to classify news by an item of interest, store the article list/articles in a server DB, and then extract key words using a key word combination algorithm of segmentation techniques of phrases/clauses/words of English sentences in a language model (LM), provide a summary of major issue articles in overseas media that affect the global value chain (GVC), on the basis of English articles of overseas media, and monitor Internet articles provided by overseas media outlets to analyze global economic trends in issue articles and provide a summary service of major overseas issue articles based on economic analysis in connection with key incidents in Internet articles so as to allow government agencies, companies, and economic analysis organizations that need to predict risks to quickly discover and easily understand the key incidents in major issue articles that affect the global value chain (GVC) and corporate competitiveness.
The overseas issue article summary web crawler system may crawl English issue articles of overseas media on the Internet and store them in a server DB to detect and preemptively respond to risks in the global supply chain or global value chain (GVC) of issue articles from an economic perspective, and then monitor issue articles containing various incidents occurring in the global value chain (GVC) and global supply chain required by clients in real time, extract key word information related to an item of interest from issue articles of overseas media, summarize the issue articles to provide a framework to provide an overseas media issue article summary service. In addition, this framework is configured to provide both a summary service of overseas media issue articles and an on-premise service using cloud computing. In a current situation where incidents and accidents are frequent in each country around the world and have a significant impact on the global value chain (GVC) and global supply chain, it will be a very useful solution for economic organizations that need to predict risks in the global value chain (GVC), provide economic outlook to companies and customers who need to respond preemptively, and make decisions.
Hereinafter, the configuration and operation of preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
The present disclosure may not be limited to the disclosed embodiments and implemented in various different forms by those skilled in the art. In describing the present disclosure, a detailed description thereof will be omitted when it is determined that a specific description publicly known technologies or configurations related thereto may unnecessarily obscure the subject matter of the present disclosure. Additionally, in the accompanying drawings, reference numbers are given to the same reference numbers in other drawings when indicating the same element.
However, it should be understood that the present disclosure is not limited to specific embodiments, but includes all modifications, equivalents, or substitutes included in the concept and technical scope of the present disclosure.
A web crawler system that crawls and collects Internet articles and provides a summary service of issue articles affecting the global value chain (GVC) crawls English issue articles from major global media including CNN USA, NYT (New York Times), WSJ (Wall Street Journal), Reuters UK and the like on the Internet to classify news by article title and store the article list/articles in a server DB, and then performs sentence classification (SC, a sentence set, sentence 1, 2, 3, . . . ) on English issue articles, and then selects sentences containing key words of an item of interest from the sentence set with a priority, selects sentences related to the key words of the item of interest (positive/negative) by natural language inference (NLI) for the remaining sentences, identifies the beginnings and ends of sentences using tokens in each of the selected sentences, and extracts words in each sentence by utilizing a S/W engine that uses a key word combination algorithm of segmentation techniques of a language model (LM) of natural language processing (NLP) to search for keywords, analyze global economic trends in issue articles, and predict risks, and combines words reextracted based on English grammar, sentence structure, and word order on the basis of the title of the issue article to generate summary sentences and provide a summary of issue articles from overseas media that affect the global supply chain and global value chain (GVC) centered on issue articles in overseas media, centered on issue articles in overseas media from an economic perspective.
An overseas issue article summary web crawler system having a data processing process of providing a summary of major issue articles from major overseas media outlets includes a crawling module 100, an issue area definition and update module 200, an item-of-interest and region-of-interest input module 300, a GVC incident monitoring and trend analysis module 400, an issue article discovery module 500, and an information extraction and summary module 600.
The web crawling system, for example, when using a BERT algorithm or GPT-4 algorithm, performs sentence classification (SC, sentence 1, 2, 3, . . . ) on issue articles for each article list/articles classified by category of an item of interest such as COVID-19, exchange rate, stock price, war, earthquake and the like for the issue articles, and then selects sentences containing/relating to key words of the item of interest in the issue articles with priorities, selects sentences related to the key words of the item of interest (positive/negative) by natural language inference (NLI) for the remaining sentences, extracts sentences from issue articles using tokens that indicate the beginnings and ends of the respective selected sentences, identifies the beginnings and ends of respective sentences using tokens in a sentence set including multiple sentences from the issue articles, extracts words from each sentence, combines the extracted words again using a key word combination algorithm using a segmentation technique of English sentences into phrases/clauses/words according to English grammar, sentence format, and word order, and a key word combination algorithm to generate and provide a summary of issue articles.
A web crawler system that crawls and collects Internet articles and provides a summary service of issue articles affecting the global value chain (GVC) according to the present disclosure may include:
The sentences in the issue articles are recognized at a “word-level.”
Sentence classification (SC) is performed on issue articles using BERT, and then words in sentences containing the selected key words of the item of interest are extracted, and then a summary of the issue articles is provided by using a key word combination algorithm of segmentation techniques.
For example, the key word combination algorithm of segmentation techniques uses a bidirectional encoder representation from a transformer (BERT) algorithm or generative pretrained transformer-4 (GPT-4) algorithm as a language model (LM) of natural language processing (NLP) to extract a summary of English issue articles in overseas media, and subsequent to sentence classification (SC) of the issue articles, selects sentences containing an item of interest (key words), removes the remaining sentences, and uses a segmentation technique of phrases/clauses/words of the selected English sentences and a key word combination algorithm.
For reference, Google BERT represents text, which is input data, by integrating bidirectional context based on an encoder of transformer. In addition, a a large corpus including sentences/words of news (issue articles) that do not include labels is pre-trained, and the pre-trained model is fine-tuned by using a database that includes labels of downstream tasks and utilized in various downstream tasks. BERT simply recognizes the form of words or sentences.
A BERT-based natural language processing model receives a large corpus to use a tokenizers library to which the WordPiece tokenization technique is applied.
Tokenizers prioritize words based on statistical values, build a vocabulary, and parse text into tokens.
In addition, a generative pre-trained transformer (GPT) is a natural language processing (NLP) language model (LM), which is available as GPT-3, GPT-3.5, or GPT-4 to summarize issue articles, and provides language translation, text generation, summarization, and question-answering services using artificial intelligence. GPT-4 is a multimodal large-scale language model that is pre-trained to predict a next token.
The GPT-4 API is a generative AI technology that provides a text summary of token-based sentences in issue articles for a GPT-based large-scale language model (LLM). GPT-4 may reinterpret any kind of text document and write a summary of issue articles on its own.
The user terminal uses a computer, laptop computer, or smartphone/tablet PC.
The web agent may include a crawling module implemented as a plug-in program in a web browser to crawl issue articles from the overseas media, classify a news list according to article titles, and store an article list/articles of the issue articles in a server DB of the overseas issue article summary web crawler;
Additionally, the web agent further includes a chat module that transmits and receives chat data between a journalist and a site operator via a chat server. In this case, the overseas issue article summary web crawler system further includes a chat server.
An AI summary result of overseas media issue articles is displayed by news list classified by category of an item of interest and a region of interest on overseas media websites, and the summary result of overseas media issue articles is automatically generated at set intervals and displayed on the screen.
The overseas issue article summary web crawler system may include:
In the issue article discovery module 500, the step-by-step keyword includes an issue identification keyword, a field-of-interest identification keyword, and a field-of-interest change keyword, and articles including the issue identification keyword are extracted in a first step, and articles including the field-of-interest identification keyword are extracted from the articles selected in the first step in a second step. Articles including the field-of-interest change keyword are extracted from the articles extracted in the first and second steps in a third step.
The issue article information extraction and summary module 600 of the overseas issue article summary web crawler system crawls the English issue articles of the overseas media by the web agent of the user terminal to store multiple news lists/news classified by article titles in a DB, and then classifies sentences (sentence set), selects sentences containing key words of an item of interest from the sentence set including multiple sentences of the English issue articles, selects sentences containing the key words of the item of interest (positive/negative) by natural language inference (NLI) for the remaining sentences, identifies the beginnings and the ends of sentences using tokens in the selected sentences to extract sentences, and extracts words of each sentence, and combines the words extracted again according to English grammar, sentence format, and word order on the basis of the titles of overseas media English issue articles using a key word combination algorithm of segmentation techniques of phrases/clauses/words of a language model (LM) to generate summary sentences and provide a summary of major overseas issue articles affecting the global value chain (GVC).
Referring to
Crawling, which is a process of collecting content existing on a website (collecting through automated programming), is a technique of downloading HTML pages, storing them in a content storage, parsing HTML/CSS, and extracting only necessary data from the visited URL.
In an embodiment, the crawling module created a crawler in Python, and used a technique to extract only the necessary data from the received data by calling an open API to a service providing the REST API.
Additionally, a technique of programming by using a browser such as Selenium to extract only the necessary data may be extracted.
A web crawler is a program that uses an automated program to explore the World Wide Web (WWW), and automatically downloads html data from a website screen and stores it in a content storage.
Upon receiving seed URLs to start crawling, the web crawler first finds URLs related to a URL frontier, finds another hyperlink from the URLs, and when the URL frontier first passes a URL to be explored to a fetcher, which has an HTML downloader, the fetcher downloads the HTML content of the webpage by means of the HTML downloader, stores the downloaded HTML content in the content storage, passes it to a content parser, and finds other hyperlinks. The contents parser parses the downloaded html web page.
In this process, upon receiving a seed URL to start crawling, the web crawler finds related URLs, finds other another hyperlink from those URLs, and continues to repeat this process by means of a DNS resolver (domain name converter), and downloads the HTMLs of the hyperlinks through an HTML downloader and stores the downloaded HTMLs in a content storage. The HTML downloader downloads a web page from the Internet URL. In this case, in order for the HTML downloader to download a web page and store the downloaded web page in a URL storage, the DNS resolver converts a URL to an IP address, and the HTML downloader uses the DNS resolver to find out the IP address corresponding to the URL, downloads a web page, checks “Content Seen?” whether it is the ‘body content’ of a page that has already been visited, and stores it in the content storage, and then checks “URL Seen?” whether it is the URL that has already been visited, and since visiting the URL that has already been visited at this time will cause an infinite loop, the URL filter removes the URLs that have already been visited and duplicate URLs and passes them back to the URL frontier to continuously search and repeat this process.
The web crawler manages its crawling status with (1) a URL to download, and (2) a downloaded URL. In particular, web pages should not be collected and stored indiscriminately, and there are html pages that can be collected by the rules written in robot.txt, and html pages that cannot be collected and are prohibited from unauthorized redistribution according to the copyright of overseas media articles. For reference, Google defines robots.txt as a file that instructs search engine crawlers which pages or files they can or cannot request from a site. The file defines the paths that the crawler can request or that it should not crawl. In addition, various content items such as how many seconds intervals to send requests may be written in robots.txt.
The crawling module 100 includes four components: an overseas media collection target site URL, a crawling script, a preprocessing script, and collection cycle management. When a URL of an overseas media collection target site is registered, the system operates a crawling module and a crawling database. Then, when a crawling script for an overseas media collection target site is registered with the system, it is executed by a preset collection cycle, and a result thereof is stored in a database of the server of the web crawling system. Then, a preprocessing script is run to process the stored data, and finally stored in the DB. Collection cycle management may utilize a service such as Cron provided by the operating system.
The issue area definition and update module 200 performs, when various types of issues such as an infectious disease, a war, and a disaster occur, generation and upgrade of an issue area, and keyword addition and upgrade, wherein a keyword is used to define an issue area, and the keyword defining the issue area includes an issue identification keyword, a field-of-interest identification keyword, and a field-of-interest change keyword.
Table 1 illustrates examples of types of issues that can affect the global value chain (GVC). Various types of issues may occur, such as an infectious disease, a war, and a disaster. The definition of the issue area is carried out by using a keyword. Table 2 defines the composition of issue definition keywords, including an issue identification keyword, a field-of-interest identification keyword, and a field-of-interest change keyword. Table 3 presents the evolution of a term corresponding to an issue identification keyword, using an infectious disease as an example. Table 4 shows an example of a process of generating a new issue area.
The issue area definition and update module 200 performs generation and upgrade of an issue area in Table 2, and keyword addition and upgrade.
The item-of-interest and region-of interest input module 300 selects an issue area with GVC, and receives an item of interest and a region of interest.
The GVC incident monitoring and trend analysis module 400 is implemented in a manner of calculating a number of issue-related keywords and expressing figures in a graph, and
The issue article discovery module 500 performs a role of finding issue articles containing a step-by-step keyword defined in Table 2. The step-by-step keyword includes an issue identification keyword, a field-of-interest identification keyword, and a field-of-interest change keyword, and articles including the issue identification keyword are extracted in a first step, and articles including the field-of-interest identification keyword are extracted from the articles selected in the first step in a second step. Articles including the field-of-interest change keyword are extracted from the articles extracted in the first and second steps in a third step. It is up to the client to determine whether to finish extracting articles in a first step, or proceed to a second or third step.
The issue article information extraction and summary module 600 carries out a procedure of extracting information for GVC and summarization from issue articles discovered by the issue article discovery module 500. A procedure of extracting information for GVC from the discovered issue articles is developed into key information (issue-centered information) that clients directly need and reference information (economic outlook information, market trend information, supply chain trend information) that is good for the clients to know (see Table 6).
It is up to the client to determine whether to check only the key information of the issue articles or to select reference information as well. Key information, which is “issue-centered information,” refers to information composed of sentences containing an issue, an item of interest, and a region of interest that clients are looking for. Reference information, which is information on economic trends, supply chain trends, and market trends, when provided to clients, is information that helps the clients make decisions. Key information and reference information are extracted from issue articles and thus are all in sentence form.
Table 2 is a table showing “Examples of Issue Areas.”
Table 3 is a table describing “Composition of Keywords Defining Issue Areas.”
A keyword defining an issue area includes an issue identification keyword, a field-of-interest identification keyword, and an interest area change keyword.
Table 4 is a table showing “Examples of Evolution of Term.”
Table 5 is a table showing “Examples of Issue Area Generation Process.”
Table 6 defines a structure of a sentence to divide the sentence into a complete sentence, an incomplete sentence, and a general sentence. The grammar and parts of speech for constructing a sentence are developed in the form of an index in Table 8. In a news list/article, multiple English sentences making up an issue article have a subject, a verb, an indirect object/direct object, an adjective, a noun, an adverb, and a complement. The type of phrase includes a noun phrase, an adjective phrase, and an adverbial phrase, and there are a noun clause, an adjective clause, and an adverbial clause. The sentence combination framework is used to summarize the extracted issue articles. The sentence combination framework is written in English and may be additionally applied to Korean issue articles.
Table 6 is a table describing a method of “Structure of Sentence”.
Table 7 is a table describing “Type of Information Extracted Based on Sentence Structure.”
Table 8 is a table summarizing “Types of Parts of Speech in Language of Issue Articles in Terms of GVC Economics”. The type of an index that constitutes the grammar of the key word combination algorithm may include a market player, a market role, a market increase and decrease, a supply chain, a supply chain status change, a number and size, a ratio, a currency indicator, a date and time, and an item may be additionally set as needed.
Table 9 is a table describing a composition and example of “Korean Sentence Combination Template.”
Table 10 is a table describing a composition and example of “Sentence Combination Framework” in English
Tables 11 and 12 are tables showing “Example of Issue Item, Issue Article Discovery, Key Sentence Extraction, and Summary” (Table 11: Issue Item-Infectious Disease, Table 12: Issue Item —Electric Vehicle Batteries).
A web crawler system that crawls Internet articles and provides a summary service of issue articles affecting the global value chain (GVC) crawls English issue articles from major global media including CNN USA, NYT (New York Times), WSJ (Wall Street Journal), Reuters UK and the like on the Internet to classify news by article title and store the article list/articles in a server DB, and then performs sentence classification (SC, multiple sentence sets) on overseas media issue articles, and then selects sentences containing key words of an item of interest from a sentence set including multiple sentences of English issue articles by utilizing a S/W engine to analyze global economic trends in overseas media issue articles and predict risks, selects sentences containing the key words of the item of interest (positive/negative) by natural language inference (NLI) for the remaining sentences, identifies the beginnings and ends of sentences using tokens in each of the selected sentences and extracts sentences, and extracts key words in each sentence, and combines words extracted again according to English grammar, sentence structure, and word order, on the basis of the title of the issue article using a segmentation technique of a language model (LM) of natural language processing (NLP) into phrases/clauses/words of English sentences and a key word combination algorithm of segmentation techniques to generate summary sentences and provide a summary of issue articles of overseas media that affect the global supply chain and global value chain (GVC).
The overseas issue article summary web crawler system may crawl English issue articles of overseas media outlets on the Internet and store them in a server DB to detect and preemptively respond to risks in the global supply chain or global value chain (GVC) for the issue from an economic perspective, and then monitor issue articles containing various incidents occurring in the global value chain (GVC) and global supply chain required by clients in real time, extract key word information from issue articles of overseas media, summarize the issue articles to provide a framework to provide a service. Furthermore, this framework is configured to provide both a cloud computing issue article summary service and an on-premise service. In a current situation where incidents and accidents are frequent in each country around the world and have a significant impact on the global value chain (GVC) and global supply chain, it will be a very useful solution for economic organizations that need to predict risks in the global value chain (GVC), provide economic outlook of issues to companies and customers who need to respond preemptively, and make decisions.
Embodiments according to the present disclosure may be implemented in the form of program commands that can be executed through various computer elements, and recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, and the like individually or in combination thereof. The computer-readable recording medium may include a hardware device configured to store and execute program instructions on magnetic media such as storages, servers, hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and storage media such as ROMs, RAMS, flash memories, storages, and the like. Examples of program instructions may include those produced by a compiler, and high-level language codes that can be executed by a computer using an interpreter, as well as machine codes. The hardware device may be configured to operate as one or more software modules in order to perform the operation of the present disclosure.
The method of the present disclosure may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, memory card, hard disk, magneto-optical disk, storage device, etc.) in a form that can be read using computer software.
As described above, although the present disclosure has been described with reference to specific embodiments, the present disclosure is not limited to the same configuration and operation as the specific embodiments to illustrate the technical concept as described above, and may be implemented by various modifications without departing from the technical concept and scope of the present disclosure, and the scope of the present disclosure should be determined by the claims described below.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0156844 | Nov 2023 | KR | national |