Embodiments of the present disclosure are directed to the field of Natural Language Processing (NLP), more specifically Word Sense Disambiguation (WSD) that aims to automatically understand the exact meaning of a word in the context of the word's use in a sentence or expressions.
Human Languages are ambiguous in a way because a word can have multiple meanings in different contexts. WSD aims to automatically identify the exact meaning of a word in the context of the word's use, usually a context sentence. The identification of the correct meaning of the word in its context is essential to many downstream tasks such as machine translation, information extraction, and other tasks in natural language processing.
One of the problems solved by the present disclosure is the struggle supervised models face when attempting to predict the correct meaning for rare word senses because of limited training data on those rare word senses. Since most models predict the meaning of a word based on training from a pre-defined word sense inventory, rare words that do not occur or occur very infrequently are usually overlooked when predicting the meaning of a word.
Many approaches include fine-tuning language models with massive text data on task specific datasets. However, those approaches often limit the applicability of the trained models and cause major problems. Firstly, the models' performance decreases significantly when predicting rare and zero-shot word sense because of insufficient samples in the training data. Another problem is that task specific fine-tuning of models often renders the models inventory dependent wherein they can only select the best definition form one predefined word sense inventory (e.g., WordNet) and not more generally.
The present disclosure addresses one or more technical problems. To address the problem of correctly predicting the meaning of rare word sense, i.e., the data sparsity problem, and generalize the model to be independent of one predefined inventory, the present disclosure proposes a gloss alignment algorithm that aligns glosses with the same meaning from different word sense inventories to collect rich lexical knowledge. Training or fine-tuning the model to identify semantic equivalence between a word in context and one of its glosses using these aligned inventories addresses the data sparsity and generalization problems, with improved predictions on both frequent and rare word senses.
Embodiments of the disclosure provide a method and an apparatus for predicting a word sense.
According to one aspect of the disclosure, a method for predicting a word sense, the method includes generating one or more aligned inventories, wherein the one or more aligned inventories are generated using one or more word sense inventories; obtaining a word in a context sentence; determining one or more semantic equivalence scores indicating semantic similarity between the word in the context sentence and each of one or more associated glosses in the one or more aligned inventories using a semantic equivalence recognizer model; and predicting a correct sense of the word in the context sentence based on the determined one or more semantic equivalence scores.
According to an aspect of the disclosure, the generation of the one or more aligned inventories includes collecting glosses from a first word sense inventory; collecting glosses from a second word sense inventory; determining a best match between the first word sense inventory and the second word sense inventory, wherein the determining of the best match between the first word sense inventory and the second word sense inventory includes for each common word in the first word sense inventory and the second word sense inventory, determining a sentence textual similarity score between each gloss from the first word sense inventory and each of one or more associated glosses from the second word sense inventory; and determining a matching function to map the each gloss from the first word sense inventory to the each of the one or more associated glosses from the second word sense inventory, wherein the matching function is configured to maximize a sum of the sentence textual similarity score between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory;
According to an aspect of the disclosure, the generation of the one or more aligned inventories further includes generating positive gloss pairs by pairing a gloss from the first word sense inventory with the each of the one or more associated glosses from the second word sense inventory based on determining that the sentence textual similarity score between the gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory is above a threshold; and generating negative gloss pairs by pairing a gloss from the first word sense inventory with the each of the one or more associated glosses from the second word sense inventory based on determining that the sentence textual similarity score between the gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory is below the threshold.
According to an aspect of the disclosure, determining the sentence textual similarity score between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory includes determining one or more sentence embeddings based on a secondary pre-trained model; and determining a cosine similarity between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory based on the one or more sentence embeddings.
According to an aspect of the disclosure, the secondary pre-trained model includes a Sentence Bidirectional Encoder Representations from Transformers (SBERT) model.
According to an aspect of the disclosure, the determining of the one or more semantic equivalence scores indicating the semantic similarity between the word in the context sentence and the each of the one or more associated glosses in the one or more aligned inventories using the semantic equivalence recognizer model includes inputting the word in the context sentence into the semantic equivalence recognizer model; inputting the one or more aligned inventories into the semantic equivalence recognizer model; identifying one or more glosses from the one or more aligned inventories associated with the word in the context sentence; and applying a trained gloss classifier to the identified one or more glosses to generate a probability score for each of the identified one or more glosses.
According to an aspect of the disclosure, the trained gloss classifier is trained using an augmented training data, wherein the augmented training data is a combination of the one or more aligned inventories and built-in training data associated with a specific word sense inventory.
According to an aspect of the disclosure, the trained gloss classifier is trained using the one or more aligned inventories and the trained gloss classifier is fine-tuned using built-in training data associated with a specific word sense inventory in a new domain.
According to an aspect of the disclosure, the one or more word sense inventories is a lexical dataset for a language.
According to an aspect of the disclosure, the predicting of the correct sense of the word in the context sentence based on the determined one or more semantic equivalence scores includes selecting a result gloss associated with a highest semantic equivalence score.
Further features, nature, and various advantages of the disclosed subject matter will be more apparent from the following detailed description and the accompanying drawings in which:
The proposed features discussed below may be used separately or combined in any order. Further, the embodiments may be implemented by processing circuitry (e.g., one or more processors or one or more integrated circuits). In one example, the one or more processors execute a program that is stored in a non-transitory computer-readable medium.
At operation 110, gloss alignment of word sense inventories may align a plurality of word sense inventories to produce a best mapping or best matching alignment of glosses from across a plurality of word sense inventories. To leverage the lexical and contextual information from the plurality of word sense inventories, the gloss alignment or alignment of inventories may include a best matching function that includes a mapping of glosses of common words from one of the word sense inventories to the glosses of another one of the word sense inventories such that the mappings in the matching function may have maximum sentence textual similarity.
At operation 120, pairs of glosses may be generated, wherein the pairs of glosses may include each mapping of glosses of common words from one of the word sense inventories to the glosses of another one of the word sense inventories. In some embodiments, the mappings where the pair of glosses may be aligned, i.e., both the glosses in the pair may have a high sentence textual similarity may be labelled as positive pairs of glosses. In some embodiments, the mappings where the pair of glosses may not be aligned, i.e., both the glosses in the pair may have a low sentence textual similarity may be labelled as negative pairs of glosses. In some embodiments, only pairs that have sentence textual similarities above a threshold may be considered to improve the quality of supervision and training. In some embodiments, pairs of glosses may be generated using glosses within each word sense inventory individually. Thus, in some embodiments, for every word in a word sense inventory, a gloss sentence may be paired with an example sentence to get positive pairs of glosses. Similarly, in some embodiments, for every word in a word sense inventory, a gloss sentence may be paired with an example sentence for another unassociated word to generate negative pairs of glosses.
At 140, the context sentence that contains the word whose sense is to be determined may be obtained. At 130 and 135, transformers may be used to train the model using training data. In some embodiments, transformers may be pre-trained and may be applied to the context sentence to generate a probability. At 160, the generated probability may be used to predict the correct sense of the word in the context sentence.
As an example, consider the evaluation of the word sense predicting model 100 using two WSD Datasets one that focuses on all-words WSD evaluation and another a Few Shot Examples of Word Senses (FEWS) that emphasizes low-shot evaluation to understand the word sense predicting model 100's performance in general word sense inventories and with data sparsity.
Both all-words WSD and FEWS datasets are annotated with WordNet 3.0. We may generate positive and negative pairs of glosses generated from the built-in training data from the specific datasets for training. We may also generate aligned inventories using one or more dictionaries that provide glosses with rich lexica knowledge. Generating aligned inventories may include generating positive and negative pairs of glosses from the one or more dictionaries.
In some embodiments, the word sense predicting model 100's transformers (130, 135) may be trained using augmented training data that combines the pairs of glosses from the aligned inventories and the pairs of glosses from the built-in training data from the specific datasets. The augmented model (SemEq-Base) may train using only the augmented training data.
In some embodiments, the word sense predicting model 100's transformers (130, 135) may be trained first using training data that includes only the pairs of glosses from the aligned inventories. Using only the pairs of glosses from the aligned inventories may generate a general model (SemEq-Large-General) that may determine whether a word in the context sentence and a gloss are semantically equivalent or not independent of any specific word sense inventories. In some embodiments, this general model is further trained or fine-tuned on the built-in training data for the specific word sense inventories to create an expert model (SemEq-Large-Expert). The expert model may adapt better to new domains and achieve better performance.
As Table 1 indicates, the expert model (SemEq-Large-Expert) (line 16) consistently outperforms AdaptBERT (line 9), the previous best model without using WordNet synset graph information, on SE07, SE2, SE3 and SE13, attaining 1.2% higher F1 on ALL. The expert model (SemEq-Large-Expert) also better disambiguates all types of words including nouns, verbs, adjectives, and adverbs than AdaptBERT. This demonstrates the benefits of leveraging multiple word sense inventories using gloss alignment and transfer learning. The expert model (SemEq-Large-Expert) is 0.6% more accurate when compared with EWISER (line 10) that uses the extra WordNet graph knowledge. Thus, by pre-training on lexical knowledge derived from aligned inventories, the word sense prediction model may generalize more easily and may better capture semantic equivalence between the word in the context statement and a gloss sentence for identifying the correct sense of the word.
Table 2 indicates the results on FEWS dataset. BEMSemCor (line 4) is a similar transfer learning model but fine-tuned on SemCor before training on FEWS while BEM (line 3) only trains on FEWS. The second section shows that augmenting the FEWS train set with multiple word sense inventories using gloss alignment (line 6) greatly improves zero-shot learning performance by 1.6% on the dev set and 2.4% on the test set (compared with line 5). When the transfer learning strategy is adopted on the FEWS dataset, the final SemEq-Large-Expert (line 10) model's performance on test sets increases to 82.3% on few-shot senses and 72.2% on zero-shot senses, which significantly outperforms all baseline models.
The word sense inventories (204-209) may be dictionaries that provide multiple example sentences for each word sense due to its usage and may be used as a means of receiving context sentences for that word sense. As an example, using dictionaries like Collins or Webster's Dictionary may provide an immense database of lexical knowledge in English. Each of the word sense inventories (204-209) may have multiple examples or glosses for that word in a limited number of contexts. Thus, the glosses for a word's senses from the different word sense inventories (204-209) may be different expressions for the same meanings. Aligning parallel glosses from the plurality of word sense inventories for the same word sense can significantly increase the lexical knowledge acquired by the model, especially for rare and infrequently used word senses.
To leverage this rich lexical and contextual information, the gloss alignment or alignment of inventories may include a best matching function (220) that includes a mapping of glosses (214, 216) of common words from one of the word sense inventories to the glosses of another one of the word sense inventories such that the mappings in the matching function may have maximum sentence textual similarity.
In some embodiments, the best matching function (220) may be determined using an optimization setup. In some embodiments, the optimization setup may be a Maximum Weighted Bipartite matching that aims to find a best matching in a weighted bipartite graph that maximizes the sum of the weights of the edges. As an example, in
An example setup of the Maximum Weighted Bipartite Matching optimization to obtain a best matching function for alignment of inventories may be as follows. Suppose we retrieve two word sets S1 and S2 from word sense inventory 204 and word sense inventory 205 respectively, where each word set consists of a list of definition sentences or glosses (210, 211, 212). To determine a best matching function (220) f: S1→S2, the reward function r: S1 X S2→R may be maximized. In some embodiments, sentence level textual similarity or sentence textual similarity may be used as a reward function to measure the similarity between two glosses. In some embodiments, to measure or determine sentence level textual similarity between two glosses, a secondary pre-trained model may be used. The secondary pre-trained model may be any state-of-the-art model that may perform Semantic Textual Similarity (STS) tasks and paraphrase detection tasks. In some embodiments, a Sentence Bidirectional Encoder Representations from Transformers (SBERT) model may be used.
In some embodiments, determining a sentence textual similarity between glosses may include determining one or more sentence embeddings based on a secondary pre-trained model. In some embodiments, determining a sentence textual similarity between glosses may include determining a cosine similarity between the glosses based on the sentence textual similarity between the glosses. As an example, in some embodiments, the secondary pre-trained model (e.g., SBERT) may be applied to the word sets S1 and S2 to obtain sentence embeddings and calculate cosine similarity as a reward function.
In some embodiments, the Maximum Weighted Bipartite Matching optimization may be solved using Linear Programming. As an example, a Linear Programming based solution for the Maximum Weighted Bipartite Matching optimization may be as follows.
Suppose a weight wij denotes the sentence textual similarity between the ith gloss in S1 and the jth gloss in S2. Aligning word sense inventories 204 and 205 may include solving the following linear integer programming problem:
In some embodiments, S1 and S2 may include any of the word sense inventories (204-209). In some embodiments, S1 and S2 may include a combination of two out of the all the word sense inventories (204-209) and the aligning of the inventories may include aligning all the combinations of the word sense inventories (204-209). Thus, the alignment of inventories may provide a mapping of glosses across all of the word sense inventories (204-209).
The gloss aligned inventories or aligned inventories (310) may include the mapping of glosses (214, 216) and the best matching function (220). The gloss examples (320) may include the mapping of glosses (214, 216) from the aligned inventories. The context sentence may include the sentence comprising the word for which the word sense may be predicted using the semantic equivalence recognizer model (300).
According to embodiments, the semantic equivalence recognizer model (300) may receive the gloss examples (320) from the gloss aligned inventories (310) as input for training the semantic equivalence recognizer model (300) or the transformers (330, 335). In some embodiments, the gloss examples (320) from the gloss aligned inventories (310) may be a pair of positive glosses wherein the pair of glosses are aligned. In some embodiments, the gloss examples (320) from the gloss aligned inventories (310) may be a pair of negative glosses wherein the pair of glosses are not aligned.
According to some embodiments, the semantic equivalence recognizer model (300) may include one or more transformers (330, 335) for predicting semantic equivalence between a word in context and any associated gloss. Transformers (330, 335) may be a deep learning model that includes encoders and decoders that can handle processing of input data, such as glosses in gloss examples (320) out of sequence using context for any position in the input sequence. In some embodiments, the transformers' (330, 335) may only include only encoders. In some embodiments, the transformers (330, 335) may be trained using only the gloss examples (320).
In some embodiments, the transformers (330, 335) and by extension the semantic equivalence recognizer model (300) may be trained using augmented training data. In the augmented training data, mapping of glosses (214, 216) may be combined with the built-in training data of specific word sense inventories like the WSD Dataset (315). Thus, using augmented training data, the semantic equivalence recognizer model (300) may be trained using both the aligned inventories and built-in training data for specific word sense inventories like the WSD Dataset (315) at the same time.
In some embodiments, the transformers (330, 335) are first trained using mapping of glosses (214, 216) such that the semantic equivalence recognizer model (300) may become a general model capable of determining whether a word in a context sentence and a gloss are semantically equivalent or not. However, such a model is general and may not predict meaning well for domain specific words. Thus, the transformers (330, 335), and by extension the semantic equivalence recognizer model (300) may be further trained or the model fine-tuned by connecting the output of the first trained model to an additional layer related with a specific word sense inventory like the WSD Dataset (315). This produces a semantic equivalence recognizer model (300) that is an expert in the domain of the specific word sense inventory like the WSD Dataset (315). In some embodiments, the specific word sense inventory used to fine-tune the trained model may be in a different domain than the word sense inventories used in the aligned inventories.
Once-trained, the transformers (330, 335) may determine a transformer output (340, 345) that may include dense representations of the input such as semantic representations of the input gloss examples (320) and context sentence(325). The semantic equivalence recognizer model (300), when applied to a context sentence (325), produces one or more output probabilities (360) for one or more glosses that whose meanings are semantically equivalent to the word in the context sentence. The semantic equivalence recognizer model (300) may select the gloss with the highest probability as the predicted word sense for the word in the context sentence.
At 410, aligned inventories are generated using one or more word sense inventories. The gloss aligned inventories (310), may include a best mapping or best matching alignment of glosses from across the one or more word sense inventories. In some embodiments, the word sense inventories may be dictionaries that provide multiple example sentences for each word sense due to its usage and may be used as a means of receiving context sentences for that word sense. Each of the word sense inventories may have multiple examples or glosses for that word in a limited number of contexts. Thus, the glosses for a word's senses from the different word sense inventories may be different expressions for the same meanings. Aligning parallel glosses from the plurality of word sense inventories for the same word sense can significantly increase the lexical knowledge acquired by the model, especially for rare and infrequently used word senses.
At 420, a word in a context sentence may be obtained. The word in the context sentence may be the word whose meaning or word sense the model may predict. In some embodiments, the entire context sentence may be obtained.
At 430, one or more semantic equivalence scores may be determined, wherein the one or more semantic equivalence scores indicate semantic similarity between the word in the context sentence and each of one or more associated glosses in the one or more aligned inventories using a semantic equivalence recognizer model. As an example, the semantic equivalence recognizer model (300) may generate an output probability score indicating the semantic similarity between the word in the context sentence and the each of the one or more associated glosses in the one or more aligned inventories.
At 440, a prediction for the correct sense of the word in the context sentence may be based on the determined one or more semantic equivalence scores. In some embodiments, predicting the correct sense of the word in the context sentence based on the determined one or more semantic equivalence scores may include selecting a result gloss associated with a highest semantic equivalence score. As an example, the gloss with the highest probability from the output probability generated by the semantic equivalence recognizer model (300) may be selected as the predicted correct sense of the word in the context sentence.
At 510, glosses from a first word sense inventory may be collected. As an example, at 510, glosses from word sense inventories (204-209) like dictionaries may be collected.
At 520, glosses from a second word sense inventory may be collected. As an example, at 520, glosses from word sense inventories (204-209) like dictionaries may be collected. In some embodiments, the first word sense inventory and the second word sense inventory may be different.
At 530, a best match between the first word sense inventory and the second word sense inventory may be determined. As an example, a best matching function (220) may be generated to indicate a mapping of glosses (214, 216) of common words from one of the word sense inventories to the glosses of another one of the word sense inventories. In some embodiments, the mappings in the matching function may be generated as a function maximizing the sentence textual similarity.
At 540, a sentence textual similarity score between each gloss from the first word sense inventory and each of one or more associated glosses from the second word sense inventory may be determined for each common word in the first word sense inventory and the second word sense inventory. In some embodiments, determining the sentence textual similarity score between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory may include determining one or more sentence embeddings based on a secondary pre-trained model. In some embodiments, determining the sentence textual similarity score between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory may include determining a cosine similarity between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory based on the one or more sentence embeddings.
At 550, a matching function may be determined. The mapping function may map the each gloss from the first word sense inventory to the each of the one or more associated glosses from the second word sense inventory, wherein the matching function may be configured to maximize a sum of the sentence textual similarity score between the each gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory. As an example, the best matching function (220) may be configured to maximize the sum sentence textual similarity score between the each gloss from the first word sense inventory (204) and the each of the one or more associated glosses from the second word sense inventory (205). As another example, the best matching function (220) may be configured to generate a mapping such that the total sentence textual similarity score may be maximized.
At 560, positive gloss pairs may be generated. In some embodiments, positive gloss pairs may be generated by pairing a gloss from the first word sense inventory with the each of the one or more associated glosses from the second word sense inventory based on determining that the sentence textual similarity score between the gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory is above a threshold.
At 570, negative gloss pairs may be generated. In some embodiments, negative gloss pairs are generated by pairing a gloss from the first word sense inventory with the each of the one or more associated glosses from the second word sense inventory based on determining that the sentence textual similarity score between the gloss from the first word sense inventory and the each of the one or more associated glosses from the second word sense inventory is below a threshold.
At 610, the context sentence may be input into the semantic equivalence recognizer model. At 620, the pairs of glosses from the aligned inventories may be input into the semantic equivalence recognizer model. As an example, all the positive and negative pairs of glosses from the gloss aligned inventories (310) and the context sentence containing the word the sense of which is to be predicted may be input into the semantic equivalence recognizer model (300).
At 630, one or more glosses from the one or more aligned inventories associated with the word in the context sentence may be identified. In some embodiments, the glosses associated with the word the context sentence whose meaning or sense is to be predicted are identified.
At 640, the trained gloss classifier may be applied to the identified one or more glosses to generate a probability score for each of the identified one or more glosses at 650.
In some embodiments, at 645, the gloss classifier may be trained using an augmented training data, wherein the augmented training data may be a combination of the one or more aligned inventories and built-in training data associated with a specific word sense inventory. As an example, the semantic equivalence recognizer model (300) may be trained using augmented training data. In the augmented training data, mapping of glosses (214, 216) may be combined with the built-in training data of specific word sense inventories like the WSD Dataset (315). Thus, using augmented training data, the semantic equivalence recognizer model (300) may be trained using both the aligned inventories and built-in training data for specific word sense inventories like the WSD Dataset (315) at the same time.
At 710, the context sentence may be input into the semantic equivalence recognizer model. At 720, the pairs of glosses from the aligned inventories may be input into the semantic equivalence recognizer model. As an example, all the positive and negative pairs of glosses from the gloss aligned inventories (310) and the context sentence containing the word the sense of which is to be predicted may be input into the semantic equivalence recognizer model (300).
At 730, one or more glosses from the one or more aligned inventories associated with the word in the context sentence may be identified. In some embodiments, the glosses associated with the word the context sentence whose meaning or sense is to be predicted are identified.
At 740, the trained gloss classifier may be applied to the identified one or more glosses to generate a probability score for each of the identified one or more glosses at 750.
In some embodiments, the trained gloss classifier may be trained using the one or more aligned inventories at 744. In some embodiments, at 746, the trained gloss classifier may be fine-tuned using built-in training data associated with a specific word sense inventory in a new domain. As an example, the semantic equivalence recognizer model (300) may be first trained using mapping of glosses (214, 216) such that the semantic equivalence recognizer model (300) may become a general model capable of determining whether a word in a context sentence and a gloss are semantically equivalent or not. In some embodiments, the semantic equivalence recognizer model (300) may be further trained or the semantic equivalence recognizer model (300) may be further fine-tuned by connecting the output of the first trained model to an additional layer related with a specific word sense inventory like the WSD Dataset (315). This may produce a semantic equivalence recognizer model (300) that is an expert in the domain of the specific word sense inventory like the WSD Dataset (315). In some embodiments, the specific word sense inventory used to fine-tune the trained model may be in a different domain than the word sense inventories used in the aligned inventories.
Although
The techniques described above, can be implemented as computer software using computer-readable instructions and physically stored in one or more computer-readable media or by a specifically configured one or more hardware processors. For example,
The computer software can be coded using any suitable machine code or computer language, that may be subject to assembly, compilation, linking, or like mechanisms to create code comprising instructions that can be executed directly, or through interpretation, micro-code execution, and the like, by computer central processing units (CPUs), Graphics Processing Units (GPUs), and the like.
The instructions can be executed on various types of computers or components thereof, including, for example, personal computers, tablet computers, servers, smartphones, gaming devices, internet of things devices, and the like.
While this disclosure has described several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of the disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope thereof