This application claims priority from Republic of Korea Patent Application No. 10-2023-0084078, filed on Jun. 29, 2023, which is hereby incorporated by reference in its entirety.
The present disclosure relates to a method and apparatus for training a language model to improve performance of the language model in machine reading comprehension by training the language model in consideration of linguistic characteristics of the Korean language, which is verb-centered, and to a recording medium for recording the same.
Machine Reading Comprehension (MRC) refers to a technology where an artificial intelligence (AI) algorithm independently analyzes a problem and finds an optimized answer to a question. General methods for training a language model for the machine reading comprehension include a process of pre-training a language model using a large corpus through unsupervised learning and then training the language model using a dataset classified by domain through supervised learning.
However, labeled data used for fine-tuning a pre-trained language model require a significant amount of time and cost for labeling. If a task is domain-specific, it is more difficult to obtain the data.
In addition, unlike English, which is a noun-centered language, verb-centered languages such as Korean uses special honorific vocabulary, sentence-ending particles, and auxiliary words that clearly reflect social relationships. Due to these characteristics, styles of Korean are distinguished based on a system of sentence-ending particles, rather than merely on levels of formality. Therefore, adaptive pre-training of a Korean language model requires research from a stylistic perspective.
The present disclosure has been conceived under the above technical background, aiming to improve performance of a language model by applying style-based adaptive pre-training, taking into account unique linguistic characteristics of verb-centered languages such as Korean, which differ from those of English.
A training method according to an embodiment of the present disclosure proceeds through three steps to perform style-based adaptive pre-training. As shown in
Here, the first training dataset is a large corpus created regardless of the domain and the styles.
In the second step, the pre-trained language model may be re-trained through unsupervised learning.
The second training dataset may have a domain identical to a domain of the third training dataset.
The second training dataset may be different from the first training dataset but have a domain identical to a domain of the third training dataset.
In another embodiment of the present disclosure, there is disclosed a computing device and a recording medium for implementing the above-described training method.
According to the present disclosure, as a reading comprehension process requires understanding of style as well as text, style-based adaptive pre-training is applied to machine reading comprehension, enabling the language model to effectively perform machine reading comprehension by adaptively responding to different styles. In addition, unlike fine-tuning data that requires labeling, unlabeled data is additionally used in a pre-training step, resulting in time and cost-efficient outcomes.
Through the present disclosure, beyond conventional adaptive pre-training, which focuses on matching with fine-tuning data and domains, it is possible to matching with writing style, thereby promoting greater performance enhancement.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. In the following description and the accompanying drawings, however, well known techniques may not be described or illustrated in detail to avoid obscuring the subject matter of the present disclosure. In addition, throughout the specification, “including” a certain component does not mean excluding other components unless specifically stated to the contrary, but rather means that other components may be further included.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are used to distinguish one element from another. For example, a first element could be termed a second element, and a second element could be termed a first element, without departing from the scope of the present inventive concept.
Terms used in the present disclosure are only used to describe specific embodiments and are not intended to limit the present disclosure. As used herein, the singular forms “a”, “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise”, “include”, “have”, etc. when used in this specification, specify the presence of stated features, integers, steps, operations, elements, components, and/or combinations of them but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or combinations thereof.
Unless otherwise defined, all terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Labeled data used for fine-tuning a pre-trained language model takes a significant amount of time and cost for labeling. If a task is domain-specific, it is more difficult to obtain the data. Therefore, in the present disclosure, before fine-tuning a language model that has been pre-trained through unsupervised learning, the model is additionally re-trained to enhance fine-tuning performance. This adaptive pre-training involves further pre-training an unlabeled dataset, which is used for fine-tuning (task-) or has a similar domain as that of a fine-tuning dataset (domain-), so that the language model is adaptive to the fine-tuning data.
In addition, unlike English, which is a noun-centered language, verb-centered languages such as Korean uses special honorific vocabulary, sentence-ending particles, and auxiliary words that clearly reflect social relationships. Due to these characteristics, styles of verb-centered languages are distinguished based on a system of sentence-ending particles, rather than merely on levels of formality. Therefore, adaptive pre-training of a verb-centered languages model requires research from a stylistic perspective. Considering that a reading comprehension process requires understanding of texts, including style, the present disclosure relates to adaptive pre-training from a verb-centered language, for example, Korean stylistic point of view in machine reading comprehension (MRC).
Hereinafter, an embodiment of the present invention will be described in detail. The description of one embodiment below is aimed at Korean, but the present invention is not intended to be limited thereto, and of course can be equally applied to verb-centered languages that have linguistic characteristics similar to Korean.
The training method according to an embodiment of the present disclosure proceeds through three steps to perform style-based adaptive pre-training. As shown in
Here, the first training dataset is a large corpus created regardless of the domain and the styles.
In the second step, the pre-trained language model may be re-trained through unsupervised learning.
The second training dataset may have a domain identical to a domain of the third training dataset.
The second training dataset may be different from the first training dataset but have a domain identical to a domain of the third training dataset.
The configuration of the present disclosure as described above may be more specifically understood through experiments and results thereof, as described below.
In the training method of the present disclosure, the first step uses a pre-trained language model. Since style-based adaptive pre-training is performed to reflect linguistic characteristics of Korean, a Korean language model is used. In the second step, the pre-trained language model is re-trained using data with the same style as that of data used for fine-tuning. In this case, unlabeled text data is used in the second step which corresponds to adaptive pre-training. In the third step, fine-tuning is performed on the re-trained language model using labeled datasets.
In one embodiment, three styles are used to conduct adaptive pre-training from a Korean stylistic point of view. The first style is the written style of administrative documents (hereinafter referred to as “administrative style”). The administrative documents must clearly and concisely express content with hierarchical spacing according to document writing rules, often using nominal endings and nouns. The second style is the written style of news articles (hereinafter referred to as “news style”). The news style uses declarative sentence endings forming declarative sentences. The third style is the colloquial style of video comments (hereinafter referred to as “online colloquial style”). The online colloquial style is composed of informal spoken language forms. Examples of the aforementioned styles are shown in Table 1.
Hereafter, the practical effects of the training method according to an embodiment of the present disclosure will be experimented with and the experimental results will be explained.
The adaptive pre-training used in the experiment, the experimental design, the datasets, and the models will be explained with reference to
Adaptive Pre-Training from Stylistic Point of View (Second Step)
Existing machine reading comprehension involves fine-tuning the pre-trained Korean language model KoELECTRA, which is trained on news, Wikipedia, and Namuwiki data, with administrative style data using the machine reading comprehension model Retrospective Reader, as shown in (1) of
In the style-based adaptive pre-training (second step), additional pre-training (or re-training) is conducted with data of the same style as that of data used for fine-tuning, as shown in (3) of
To analyze the importance of style-based adaptive pre-training, the first experiment is conducted as shown in
Administrative documents use machine reading comprehension data from AI HUB's administrative document dataset. In this data, science and public administration domains, are used, and the machine reading comprehension question types used for fine-tuning include an answer boundary extraction type, a procedural (method) type, and an unanswerable type. For the news style, IT and science domains of AI HUB's news article machine reading comprehension data are used. For the online colloquial style, science domains of AI HUB's online colloquial corpus data are used. A total of 32,976 question-answer pairs from administrative style data are used for fine-tuning in all experiments. The first experiment's adaptive pre-training utilizes 50,000 sentences each from news and administrative documents, and the second experiment's adaptive pre-training utilizes 5,000 sentences each from news and administrative documents, along with 9,000 sentences from the online colloquial style, taking sentence length into account.
KoELECTRA-small-v2 is used as the Korean pre-trained language model. To ensure that the administrative documents are excluded from pre-training, the version 2 of KoELECTRA which is pre-trained on approximately 14 GB of data from news, Wikipedia, and Namuwiki is utilized. The machine reading comprehension in this experiment utilizes a span extraction type, which finds the answer to a question within a paragraph, and an unanswerable question type, which determines whether a question can be answered or not. The machine reading comprehension model utilizes the Retrospective Reader. The Retro-Reader is a model that mimics human reading comprehension, showing the best performance when using the ELECTRA model. For performance evaluation, the Exact Match (EM) metric, assessing whether questions are answered accurately, and the F1 score, determining answer correctness on a token-by-token basis, are utilized.
The experimental results are as follows.
The results of the first experiment are shown in Table 2. Among the three experiments (1), (2), and (3) of
The result of the second experiment, where adaptive pre-training (second step) is performed on each style, are shown in Table 3. The adaptive pre-training in (3) of Table 3, which matches fine-tuning data and the style, shows the best performance. In addition, adaptive pre-training for online colloquial style of (1) shows lower performance than that of news style of (2). Considering that both the news and administrative styles are written styles, the performance of adaptive pre-training may significantly differ depending on the similarity between Korean language styles.
In the present disclosure, style-based adaptive pre-training is conducted in machine reading comprehension, considering linguistic characteristics of Korean. Significant performance improvement is observed in an experiment where data with the same style as that of machine reading comprehension data are additionally pre-trained. This indicates that style-based adaptive pre-training improves the performance of a language model.
A computing device 800 includes a memory 830 storing a language model 831 for machine reading comprehension, and a processor 810 for executing the language model to infer results from inputs.
The language model 831 is pre-trained using a first training dataset through unsupervised learning, re-trained using a second training dataset with distinguished styles through unsupervised learning, and then fine-tuned using a third training dataset classified by domain through supervised learning.
The first training dataset may be a large corpus created regardless of a domain and style.
The second training dataset may have the same domain as that of the third training dataset, or may be different from the first training dataset and have the same domain as that of the third training dataset.
Meanwhile, the present disclosure may be implemented as computer readable codes on a computer-readable recording medium. The computer-readable recording medium may be any data storage device that may store data which may be thereafter read by a computer system.
Examples of the computer-readable recording medium include read only memory (ROM), random access memory (RAM), compact disk-read only memory (CD-ROM), magnetic tapes, floppy disks, optical data storage devices, etc. The computer-readable recording medium may also be distributed over network-coupled computer systems so that the computer-readable code may be stored and executed in a distributed fashion. Also, functional programs, codes, and code segments for accomplishing the present disclosure may be easily construed by programmers skilled in the art to which the present disclosure pertains.
In the above, various embodiments of the present disclosure have been described. It will be understood by one of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as defined by the following claims. Therefore, the embodiments should be considered in a descriptive sense only and not for purposes of limitation. The scope of the present disclosure is defined not by the detailed description of the disclosure but by the following claims, and all differences within the scope will be construed as being included in the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2023-0084078 | Jun 2023 | KR | national |