This application claims priority from Korean Patent Application No. 10-2021-0194405 filed on Dec. 31, 2021, in the Korean Intellectual Property Office, and all the benefits accruing therefrom under 35 U.S.C. 119, the contents of which in its entirety are herein incorporated by reference.
The present disclosure relates to a method of generating a task model based on meta-learning, a method of generating text embeddings for few-shot data, and an apparatus implementing the same. More particularly, the present disclosure relates to a meta-learning-based task-model generating method for generating a task model through meta-learning and then generating their related task model, a method of generating text embeddings for few-shot data, and an apparatus implementing the same.
In recent natural language processing and text analysis fields, to utilize a large-scale pre-trained language model, there is a currently used method involving taking a sufficient amount of label data from a target domain, tuning the language model itself to adapt to the target domain, and thereby performing transfer learning in accord with the target domain.
However, in an actual environment, securing a sufficient amount of label data is difficult due to the considerable cost and time required in the process of obtaining sufficient label data from a new domain.
Whereas a small amount of or few-shot label data is simply insufficient for adequately tuning a large-scale language model to a target domain. In particular, in the matter of classification of few-shot text data, which constraints the classification model to be trained with a small amount of training data, transfer learning is very difficult to perform on a large-scale language model. Additionally, in the classification of few-shot text data, upon encountering for the first time, a new domain or a new label category class, the large-scale language model cannot be properly tuned to the matter to be solved. In particular, actively utilizing pre-trained large-scale language models such as BERT and GPT-3 is difficult in the classification of few-shot text data.
Additionally, in performing the text classification based on the pre-trained language model, the text embedding vector provided by the language model is mainly transmitted as an input to the classification model.
More specifically, when text data is inputted into the language model, a text embedding vector is generated for that text data, and the generated text embedding vector is inputted into the classification model. Accordingly, the classification model generates and outputs a classification result for the inputted text embeddings vector.
In this process, since the classification model generates a classification result according to the text embeddings vector received from the language model, the language model depends highly on the text embeddings.
Therefore, when classifying text data by using a language model, there is a need to generate text embeddings well adapted to data in a new domain even without needing the large-scale language model itself to be tuned to the domain. Additionally, there is a need for a technique capable of generating text embeddings well adapted to a new domain and a task model related thereto with text data by a small amount rather than a sufficiently secured amount.
Aspects of the present disclosure concern the matter of classification of few-shot text data and, to be able to generate text embeddings well adapted to a new domain even without needing a large-scale language model itself to be tuned to a domain, provide a method of generating a task model based on meta-learning, a method of generating text embeddings for few-shot data, and an apparatus implementing the same.
Other aspects of the present disclosure concern the matter of classification of few-shot text data and, to be able to generate a task model well adapted to a new domain based on meta-learning by training the task model with data in various domains, provide a meta-learning based task-model generating method, a method of generating text embeddings for few-shot data, and an apparatus implementing the same.
However, aspects of the present disclosure are not restricted to those set forth herein. The above and other aspects of the present disclosure will become more apparent to one of ordinary skill in the art to which the present disclosure pertains by referencing the detailed description of the present disclosure given below.
According to an aspect of the present disclosure, there is provided a method performed by a computing device, including calculating a task-adaptation loss of a task model, the calculating the task-adaptation loss being based on a result of training the task model by using a training data set, updating the task model based on the task-adaptation loss, calculating a meta-optimization loss of the updated task model by using a validation data set, and further updating the updated task model based on the meta-optimization loss.
According to another aspect of the present disclosure, there is provided a method performed by a computing device for generating text embeddings for few-shot data, generating token embeddings and class (CLS) embeddings by inputting few-shot text data into a language model, generating feature information corresponding to a domain of the few-shot text data by inputting the CLS embeddings into a relation network and a gating network, and generating the text embeddings corresponding to the domain of the few-shot text data by synthesizing the token embeddings and the feature information.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable medium storing a computer program including computer-executable instructions for causing, when executed in a computing device, the computing device to perform the above-described methods.
According to yet another aspect of the present disclosure, there is provided a computing device including one or more processors, a communication interface configured to communicate with external devices, a memory configured to load a computer program that is executed by the one or more processors, and a storage configured to store the computer program. The computer program includes computer-executable instructions for causing, when executed in the computing device, cause the computing device to calculating a task-adaptation loss of task model, the calculating the task-adaptation loss being based on a result of training the task model by using a training data set, updating the task model based on the task-adaptation loss, calculating a meta-optimization loss of the updated task model by using a validation data set, and further updating the updated task model based on the meta-optimization loss.
According to yet another aspect of the present disclosure, there is provided a computing device including one or more processors, a communication interface configured to communicate with external devices, a memory configured to load a computer program that is executed by the processor, and a storage configured to store the computer program. The computer program includes computer-executable instructions for causing when executed in the computing device, the computing device to perform method steps that include inputting few-shot text data into a language model to generate token embeddings and class (CLS) embeddings, generating feature information corresponding to domain of the few-shot text data by inputting the CLS embeddings into a relation network and a gating network, and synthesizing the token embeddings and the feature information to generate the text embeddings corresponding to the domain of the few-shot text data.
The above and other aspects and features of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
Hereinafter, preferred embodiments of the present disclosure will be described with reference to the attached drawings. The advantages and features of the present disclosure and methods of accomplishing the same may be understood more readily by reference to the following detailed description of preferred embodiments and the accompanying drawings. The present disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of the disclosure to those skilled in the art, and the present disclosure will only be defined by the appended claims.
In adding reference numerals to the components of each drawing, it should be noted that the same reference numerals are assigned to the same components as much as possible even though they are shown in different drawings. In addition, in describing the present disclosure, when it is determined that the detailed description of the related well-known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.
Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase.
In addition, in describing the component of this disclosure, terms, such as first, second, A, B, (a), (b), can be used. These terms are only for distinguishing the components from other components, and the nature or order of the components is not limited by the terms. If a component is described as being “connected,” “coupled” or “contacted” to another component, that component may be directly connected to or contacted with that other component, but it should be understood that another component also may be “connected,” “coupled” or “contacted” between each component.
The terms “comprise”, “include”, “have”, etc. when used in this specification, specify the presence of stated features, integers, steps, operations, elements, components, and/or combinations of them but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or combinations thereof.
Hereinafter, some embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
First, the computing device 100 is responsive to an input text as inputted to a few-shot text embedding generator 11 for using a meta information dictionary 19 to generate and output text embeddings 12 adapted to the domain of the input text, from the few-shot text embedding generator 11. At this time, the generated text embeddings 12 are inputted to the task model 13, thereby outputting a task model 14.
The computing device 100 performs a two-step update operation for meta-learning on the outputted task model 14. To this end, the computing device 100 may select an inner update mode or an outer update mode in step 141 as the update mode. The inner update is an operation performed by a meta learner to adapt the task model 14 to a meta-train data set, and the outer update is an operation performed to evaluate, by using a meta meta-validation data set, the task model 14 whether its adaptation is well done, and to train the meta learner by reflecting the evaluation on the meta learner. Here, the meta learner may be a neural network model used when training the task model. The meta-train data set and the meta-validation data set may include a plurality of domain-specific text data. However, the meta-train data set and the meta-validation data set may consist of data in the same domain.
When the computing device 100 selects the inner update mode in step 141 as the first step for meta-learning, the meta learner calculates a task-adaptation loss in step 15 by using the meta-train data set. At this time, the computing device 100 performs an update operation step 16 on the task model 14 by using the calculated task-adaptation loss 15.
The update operation step 16 is to update the parameters constituting the task model 14, and the calculated task-adaptation loss 15 may be used to determine how much to update the parameters. The updating of the parameters of the task model 14 may use gradient descent that can reduce an error through changes in the gradient of a loss function.
When the computing device 100 selects the outer update mode in step 141 as the second step for meta-learning, the meta learner calculates a meta-optimization loss in step 17 by using the meta-validation data set. At this time, the computing device 100 performs an additional update operation step 18 on the task model 14 by using the calculated meta-optimization loss 17.
The additional update operation step 18 is to further update the parameters constituting the task model 14 updated through the update operation step 16, and the calculated meta-optimization loss 17 may be used to determine how much to update the parameters. The further updating of the parameters of the task model 14 may use the gradient descent previously used in the update operation step 16.
When performing the additional update operation step 18, along with the operation of further updating the parameters of the task model 14, the computing device 100 may update the meta information dictionary 19 and the parameters of the few-shot text embedding generator 11.
On the other hand, task-adaptation loss 15 and the meta-optimization loss 17 may be calculated based on the cross-entropy loss function as shown in Equation 1 below.
, wherein εn is the text embedding corresponding to the domain of the n-th text data, ƒ(·) is the task model, s(·) is the softmax function, ti is the probability of the i-th class appearing, C is the number of classes, N is the number of data items.
Additionally, in the update operation step 16, the parameters of the task model 14 may be updated through a calculation process such as Equation 2 below, and in the additional update operation step 18, the parameters of the task model 14 may be updated through a calculation process such as Equation 3 below.
, wherein
is the text embeddings generated by using a meta-train data set and
is the parameters constituting the task model.
, wherein
is the text embeddings generated by using the meta-validation data set, φ is the parameters that constitute the task model, meta information dictionary, and few-shot text embedding generator.
According to the embodiment as described above, in the classification of text data, the present disclosure can generate, based on meta-learning, a task model well adapted to a new domain by training the task model with data in various domains.
The method for meta-learning-based generation of a task model according to at least one embodiment of the present disclosure may be executed by the computing device 100. The computing device 100 executing the method according to embodiments of the present disclosure may be a computing device having an environment for executing an application program. It should be noted that the present disclosure may omit, from the description, the subject performing some operations included in the method according to the embodiments of the present disclosure, wherein the subject is the computing device 100.
Referring to
In Step S22, the task model is updated based on the task-adaptation loss calculated in Step S21. In this case, the calculated task-adaptation loss may be used to update parameters constituting the task model.
Then, in Step S23, a validation data set may be used to calculate a meta-optimization loss for the task model updated in Step S22. Using the meta-optimization loss, the computing device 100 may evaluate the updated task model on whether its adaptation is well done. The meta-optimization loss may also be calculated by using the cross-entropy loss function.
Finally, in Step S24, the updated task model is further updated based on the meta-optimization loss calculated in Step S23. At this time, the calculated meta-optimization loss may be used to further update parameters constituting the updated task model.
Step S24 may further include a step of updating a meta information dictionary including feature information for a plurality of domain-specific texts, and a step of updating parameters constituting a few-shot text embedding generator for generating text embeddings corresponding to inputted few-shot text data.
Upon completion of the meta-learning based process of updating the parameters of the task model as described above with
Specifically, the computing device 100 may be responsive to the few-shot text data as inputted to the few-shot text embedding generator 11 for outputting, by using the internal components of the few-shot text embedding generator 11, the text embeddings 12 reflecting the domain features of the text data.
The few-shot text embedding generator 11 includes the components of a language model 21, a relation network 24, a gating network 25, a filtering unit 27, and a synthesizing unit 29.
In the few-shot text embedding generator 11, when the few-shot text data is inputted to the language model 21, token embeddings 23 and CLS (class) embeddings 22 are generated. At this time, the relation network 24 generates a relation vector for representing the relationship between the CLS embeddings 22, and the gating network 25 generates, by using the relation vector, a gate vector for determining which information to extract from a meta information dictionary 19.
The filtering unit 27 utilizes the meta information dictionary 19 and the gate vector extracted from the gating network 25 to generate feature information 28 that corresponds to the domain of the few-shot text data. The synthesizing unit 29 synthesizes the generated feature information 28 and the token embeddings 23 to generate the text embeddings 12 reflecting domain features of the few-shot text data.
Through this process, the computing device 100 may process the newly inputted few-shot text data by using the meta information dictionary 19 to generate text embeddings 12 that are specific to the relevant domain.
At this time, the generated text embeddings 12 need a process of adapting to the task model 13. To this end, the computing device 100 inputs the generated text embedding 12 to the task model 13 to resultantly obtain an outputted task model at 14 and calculate a task-adaptation loss 15 for the task model 14. Using the task-adaptation loss 15, the computing device 100 may evaluate the task model 14 whether its adaptation to the text embeddings 12 is well done. Thereafter, through an update operation step 16, the computing device 100 may utilize the task-adaptation loss 15 to calculate how much to update the parameters of the task model 14 and reflect this calculation to the task model 13.
According to the embodiment as described above, when performing the classification of few-shot text data, the present disclosure works even without a large-scale language model itself needed to be tuned to the relevant domain, to generate the text embeddings and the task model reflecting the domain features of the few-shot text data and employ the generated text embeddings and task model in real-time.
The method of generating text embeddings for few-shot text data according to at least one embodiment of the present disclosure may be executed by the computing device 100. The computing device 100 executing the method according to embodiments of the present disclosure may be a computing device having an environment for executing an application program. It should be noted that the present disclosure may omit, from the description, the subject performing some operations included in the method according to the embodiments of the present disclosure, wherein the subject is the computing device 100.
Referring to
Next, Step S42 inputs the CLS embeddings to the relation network and the gating network to generate the feature information that corresponds to the domain of the few-shot text data.
As an embodiment, Step S42 may include a step of filtering, by using the gating network, the feature information that corresponds to the domain of the few-shot text data, from the meta-information dictionary including feature information for a plurality of domain-specific texts.
Here, the meta-information dictionary may be generated through a meta-learning process for the task model. The meta-learning process may include steps including calculating a task-adaptation loss of the task model based on a result of training the task model by using a train data set, updating the task model based on the task-adaptation loss, calculating a meta-optimization loss of the updated task model by using a validation data set, and further updating the updated task model based on the meta-optimization loss.
Finally, Step S43 synthesizes the token embeddings generated in Step S41 and the feature information generated in Step S42 to generate text embeddings corresponding to the domain of the few-shot text data.
The method further includes, after performing Step S43, steps including inputting the generated text embeddings into a task model, calculating a task-adaptation loss of the task model based on a result outputted from the task model, and updating the task model based on the task-adaptation loss.
In the illustrated example, the few-shot text data 51 is firstly inputted to a language model such as BERT 520, from which token embeddings 521 and CLS embeddings 522 can be generated.
The encoder 523 serves to reduce the dimension of class vectors corresponding to the CLS embeddings 522 generated from the BERT 520. The relation network 524 encodes the relationship between the class vectors whose dimensions are reduced through the encoder 523 and outputs the encoded relationship as a relation vector. In this case, the relation vector may be expressed by an equation at 61 in
means a relation vector for class n in a given domain τi, ƒθ
denotes a class vector generated from the BERT 520 for the j-th text data.
The gating network 525 generates a gating vector 526 for extracting the most relevant k pieces of feature information among the relation vectors outputted through the relation network 524. At this time, the gating vector 526 may be expressed by an equation at 62 in
means a gating vector using a given input
with all elements but k being zero.
In this case, k pieces of feature information 527 suitable for the relevant domains may be extracted through a multiplication operation between the gating vector 526 and a meta information dictionary 53 including N pieces of feature information. At this time, the extracted k pieces of feature information 527 may be expressed by an equation at 63 in
means feature information.
The synthesizing unit 528 may generate the domain-specific text embeddings 54 of the few-shot text data 51 by multiplying the extracted k pieces of feature information 527 by the token embeddings 521.
The text embeddings 54 generated according to the above process are inputted to a classification model 55, thereby providing classification results 56.
For performance comparison between models, four types of data were used such as 20 Newsgroup, Huffpost Headlines, Reuters-21578, and RCV-1 as text data for research, and, for comparison with the classification model (OUR) of the present disclosure, models were used such as MAML, prototype-network (PROTO), latent embedding optimization (LEO), induction network (INDUCTION), and distributional signature (DS) models. Additionally, BERT’s class vector (CLS vector) was used to evaluate the performance of the models.
For performance comparison between classification models of few-shot text data, the N-way K-shot method may be used, which is a method using K data samples for N classes. As an example, to compare the performance between the models, TABLE I used the 5-way 1-shot classification method, and TABLE II used the 5-way 5-shot classification method.
TABLE I and TABLE II confirm that the classification model (OUR) of the present disclosure exhibits higher performance over the existing methods in the matter of classification of few-shot text data.
Referring to
The processor 101 controls overall operations of each component of computing device 100. The processor 101 may be configured to include at least one of a Central Processing Unit (CPU), a Micro Processor Unit (MPU), a Micro Controller Unit (MCU), a Graphics Processing Unit (GPU), or any type of processor well known in the art. Further, the processor 101 may perform calculations on at least one application or program for executing a method/operation according to various embodiments of the present disclosure. The computing device 100 may have one or more processors.
The memory 103 stores various data, instructions and/or information. The memory 103 may load one or more programs 105 from the storage 104 to execute methods/operations according to various embodiments of the present disclosure. An example of the memory 103 may be a RAM, but is not limited thereto.
The bus 107 provides communication between components of computing device 100. The bus 107 may be implemented as various types of bus such as an address bus, a data bus and a control bus.
The network interface 102 supports wired and wireless internet communication of the computing device 100. The network interface 102 may support various communication methods other than internet communication. To this end, the network interface 102 may be configured to comprise a communication module well known in the art of the present disclosure.
The storage 104 can non-temporarily store one or more computer programs 105. The storage 104 may be configured to comprise a non-volatile memory, such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or any type of computer readable recording medium well known in the art.
The computer program 105 may include one or more instructions, on which the methods/operations according to various embodiments of the present disclosure are implemented. When the computer program 105 is loaded on the memory 103, the processor 101 may perform the methods/operations in accordance with various embodiments of the present disclosure by executing the one or more instructions.
In at least one embodiment, the computer program 105 may include computer-executable instructions for executing method steps including calculating a task-adaptation loss of the task model based on a result of training the task model by using a train data set, updating the task model based on the task-adaptation loss, calculating a meta-optimization loss of the updated task model by using a validation data set, and further updating the updated task model based on the meta-optimization loss.
In at least one embodiment, the computer program 105 may include computer-executable instructions for executing method steps including generating generate token embeddings and CLS embeddings by inputting few-shot text data into a language model, generating feature information corresponding to a domain of the few-shot text data by inputting the CLS embeddings into a relation network and a gating network, and generating text embeddings corresponding to the domain of the few-shot text data by synthesizing the token embeddings and the feature information.
The technical features of the present disclosure described so far may be embodied as computer readable codes on a computer readable medium. The computer readable medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disc, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer equipped hard disk). The computer program recorded on the computer readable medium may be transmitted to other computing device via a network such as internet and installed in the other computing device, thereby being used in the other computing device.
Although operations are shown in a specific order in the drawings, it should not be understood that desired results can be obtained when the operations must be performed in the specific order or sequential order or when all of the operations must be performed. In certain situations, multitasking and parallel processing may be advantageous. According to the above-described embodiments, it should not be understood that the separation of various configurations is necessarily required, and it should be understood that the described program components and systems may generally be integrated together into a single software product or be packaged into multiple software products.
In concluding the detailed description, those skilled in the art will appreciate that many variations and modifications can be made to the preferred embodiments without substantially departing from the principles of the present disclosure. Therefore, the disclosed preferred embodiments of the disclosure are used in a generic and descriptive sense only and not for purposes of limitation.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0194405 | Dec 2021 | KR | national |