The present invention relates to large language models. More specifically, the present invention presents a method of decoding by heuristic tree search process, coupled with algorithms for detecting and reducing hallucination detection.
Large language models (LLMs) are a type of artificial intelligence (AI) designed to understand and generate human-level natural language. They are built using machine learning techniques, particularly deep learning, and are trained on vast amounts of text data.
However, there is a potential for LLMs to generate misinformation, specifically text that is syntactically and semantically plausible but factually incorrect or nonsensical, often referred to as “hallucination”.
To mitigate hallucination problems, previous research has attempted to augment the input of the language model with relevant knowledge, employing a reader language model that takes both the context and the reference knowledge as input to generate a response. While this approach has demonstrated potential, it involves fine-tuning the reader language model, which poses significant challenges. First, the ever-increasing size of language models makes training them computationally expensive, which is becoming increasingly prohibitive, not to mention that some API-based LLMs (e.g., OpenAI APIs) are not trainable by end users. Second, many state-of-the-art language models are designed to be multi-task zero-shot models through instruction tuning, aiming to perform well across various tasks.
Another common approach to mitigate hallucination problems is to fine-tune the LLM with the knowledge in its input. However, fine-tuning a language model extensively on a specific task can lead to catastrophic forgetting or catastrophic interference, a phenomenon in which a model abruptly forgets previously leraned information upon learning new information., which in turn greatly compromises its generalizability across different tasks and compromising its overall performance.
In addition, although designing a decoding method for LLMs is a natural way to mitigate hallucinations without fine-tuning, current works in plug-and-play guided decoding, such as weighted decoding (WD), are still inapt to directly be adapted to the knowledge-grounded scenarios due to their inability to identify the necessary knowledge required for generation, which leads to hallucination.
Therefore, there is a need to develop a novel model for hallucination-mitigated LLM decoding without the need of fine-tuning. The present invention addresses this need.
In one aspect, the present invention provides a computer-implemented model-agnostic natural language generation method.
The computer-implemented model-agnostic natural language generation method comprises providing an information-containing plain text prompt, generating tokens based on the information-containing plain text prompt on an automated basis, re-ordering the importance of each generated token and selecting the token with the highest knowledge groundedness in each scenario through a heuristic search process, determining token-level hallucinations with a specifically trained knowledge classifier by identifying the inflection point of hallucination and replacing the hallucinated tokens with further generated tokens, and generating a plain text response with the tokens with the least hallucination and highest knowledge groundedness.
In one embodiment, the heuristic search process of the computer-implemented model-agnostic natural language generation method is a Monte Carlo tree search, comprising selecting a child node traversing down the tree from the root node until a leaf node is reached, expanding the leaf node until terminal state is reached, evaluating the approximated token-level knowledge groundedness scores of the expansion outcomes, backpropagating recursively the approximated token-level knowledge groundedness scores from the leaf node to the root node, and selecting the child node with highest approximated token-level knowledge groundedness scores for further traversing down the tree to reach the subsequent leaf node.
In another embodiment, the training the knowledge classifier of the computer-implemented model-agnostic natural language generation method comprises providing a training example training example (y, x, k)˜D, swapping randomly a knowledge k with another knowledge k′ from the knowledge source to form a negative example, sampling a position 1<i<T, where T is the length of y, truncating the response y to i′th token and obtain yi, and completing the sequence yi by sampling from PLM(y|x, yi, k) by including the knowledge text k inside the instruction.
In a further embodiment, in the training process of the knowledge classifier, the tokens are labelled by starting with labelling tokens as positive, the first hallucinated token is labelled as negative, and all tokens after the first hallucinated are predictively labelled as negative.
The appended drawings, where like reference numerals refer to identical or functionally similar elements, contain FIG.s of certain embodiments to further illustrate and clarify the above and other aspects, advantages and features of the present invention. It will be appreciated that these drawings depict embodiments of the invention and are not intended to limit its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
As used in the disclosure herein, “decoding” refers to the process of text generation from the LLM, when given a prompt or input. Decoding involves selecting words, or tokens, sequentially for the formation of coherent sentences in response to the prompt or input. Common decoding strategies include greedy search, in which the LLM selects the token with the highest probability at each step; beam search, in which multiple candidate sequences known as beams are maintained and expanded in parallel, keeping only the top k beams based on cumulative probability; top-k sampling, in which the top k tokens are sampled by the LLM with the highest probabilities in each step; top-p sampling, in which the LLM samples from the smallest set of tokens which exceed an arbitrary cumulative probability at each step; and temperature sampling, where the probabilities of tokens are adjusted by applying temperature parameter T, under which lower temperature leads to sharper distribution and higher temperature flattens the distribution.
As used in the disclosure herein, “token” refers to the smallest text units for individual processing by LLM. Tokens could be words, subwords (which, under different tokenization methods, each word could be further split into smaller units), characters, or special tokens used for specific purposes in the LLM.
As used in the disclosure herein, “knowledge groundedness” refers to the extent of which the results of the decoding, i.e. the generated text, align with the factual and contextually appropriate information. As such, knowledge groundedness also reflects the degree of hallucination of the outputs from an LLM.
As used in the disclosure herein, “faithfulness” refers to the degree to which the tokens or generated text from the LLM reflects accurately the information or data which is input to the LLM, such that the generated output contextually align with the information received by the LLM.
Hallucination of language models or the generation of contents that are either non-factual or not supported by evidence have been studied and reported in various fields such as machine translation, abstractive summarization or image captioning. Recently developed LLMs such as Bing Chat or perplexity.ai are used as generative search engines despite the responses are not always verifiable. To automatically detect and quantify hallucination in model-generated text, several detection methods and benchmarks have been designed.
Knowledge-grounded generation is mainly driven by retrieving relevant knowledge and training the generator to produce responses augmented on the retrieved knowledge; or by learning and storing entity or fact representations which are provided as input to the generator. While these methods all address the problem of knowledge-grounded generation, full fine-tuning of the generator are required by all of these methods, which may degenerate the zero-shot ability of the base model due to catastrophic forgetting, and thus incurring a significant computational cost, i.e. the ability of the model to perform an unseen task without explicit training on relevant examples, solely basing on the training from which the LLM acquired the knowledge prior.
Guided decoding includes supervised controllable text generation, discriminator-guided decoding and constrained decoding, in which the user controls the sentiment or style of the generated text, or constrain the generation to lexical constraints. Different LMs employ different guided decoding frameworks. For example, Plug-and-play ML (PPLM) introduces a key concept of Bayesian decomposition P(y|x, c)∝P(y|x)P(c|y), where c is the control attribute. PPLM trains a small discriminator on a frozen LM and performs gradient ascent from the discriminator to maximize P(c|y). FUDGE performs weighted decoding (WD) by direct re-weighing the token probabilities P(yt|y<t, x) with an auxiliary classifier probability P(c|y<t). To perform re-ranking every step, P(y|x)P(c|y) is decomposed into token-level, and a token-level attribute classifier P(c|y<t) is used. NADO samples from a similar token-level distribution that is also weighted by P(c|y<t), which is defined as an approximation of the sequence-level oracle P(c|y). GeDi and DExperts also take the weighted decoding approach but avoid enumerating the vocabulary for computing P(c|y<t,i) by training generative classifiers.
Constrained decoding methods focus on constraint satisfaction, such as lexical constraints or right-hand-side coherence. As constraint satisfaction can be measured after a sequence is fully generated, search-based methods that take the estimate of the future score (reward) in each decoding step have been proposed. Unlike weighted decoding, these methods commit to a token not only based on the current token's score but also on the estimate of future rewards.
In the present invention, a novel method named Knowledge-Constrained Tree Search Decoding (KCTS) is provided, which is a first discriminator-guided constrained decoding approach to knowledge-grounded generation. KCTS applies an auxiliary knowledge classifier on top of a frozen LM to detect hallucinations and uses its knowledge-groundedness score to guide the decoding process.
The KCTS method also employs a Monte-Carlo Tree Search decoding (MCTS) instead of conventional LM decoding methods, which re-orders each token's importance based on the estimated future faithfulness starting from that token with multiple simulations. This leads to a better estimate of a token's value in comparison to the previous arts, which re-orders the tokens based only on current faithfulness score. By incorporating the knowledge classifier and the classifier-guided decoding algorithm, KCTS constrains the generated text, ensuring its faithfulness to the reference knowledge.
In addition, a novel token-level hallucination detection method, RIPA (Reward Inflection Point Approximation), is trained to predict the starting point of the hallucinating token, enabling effective adaptation of the knowledge classifier defined on the sequence level to the token level. RIPA knowledge classifier solves two problems from previous arts: it alleviates expensive retraining of LM weights by keeping the LM frozen and training adapter weights only, and it precisely determines if the text being generated contains hallucination or not. This provides a better signal to the MCTS decoding algorithm, leading to more faithful text generation.
Further, the KCTS method of the present invention is model-agnostic, enabling post-hoc application to models which have been previously trained. Also, the KCTS model does not involve fine-tuning of the LLM, which incurs a heavy computational cost.
The instruction-tuned LMs' factual generation ability is to be improved under a constrained decoding setting. The problem can be formulated as
y˜PLM(y|x, k, αk), (1)
where y is generated text, x is input text with the task description, k is the piece of knowledge that y must be constrained to, and ak is the attribute denoting the groundedness of y to k.
Let f(y, k)=P(αk=1|y, k) be a function that defines the faithfulness of the generation y to k. Following the Bayesian decomposition, the Bayes rule is applied to equation (1) above to obtain equation (2) below:
PLM(y|x, k, αk)∝PLM(y|x)f(y, k). (2)
From an optimization perspective, obtaining a generation that is best grounded in the knowledge while being faithful to the task instruction can be written as the equation below:
Then, given the auto-regressive nature of language models, this equation can be decomposed into a token-level as found in FUDGE:
Knowledge faithfulness f (or hallucination in the opposite perspective) is well-defined at the sequence level, which can be modeled as an entailment or fact verification problem. However, to guide the generation at each step, we need to define f(y<t, k) for partially generated y<t. Following NADO (Meng et al., 2022), we define f(y<t, k) as the approximation of future faithfulness, as denoted in the following equation:
f(y<t, k)≈f(P(y|y<t, x), k). (5)
While weighted decoding (WD) re-weights the token distribution with the knowledge-groundedness score at every step, it selects the most grounded token in a greedy manner, which may lead to a suboptimal solution. This is especially problematic given that the groundedness is only well-defined after the sequence is fully generated, and the guidance signal from the classifier at each step is an approximation. To this end, we propose to use the Monte-Carlo Tree Search Algorithm (MCTS), which can provide a better estimate of the future knowledge groundedness through multiple simulations, as has been proven effective in other scenarios such as sentiment polarity control.
MCTS consists of four main steps: Selection, Expansion, Rollout, and Backpropagation.
Selection: Starting from the root (the currently generated sequence), we traverse the tree down until we reach a leaf node, selecting the children using the PUCT algorithm:
where V(si) is the estimated value of node si, ni is the visit count of the node si (i.e., number of simulations after the node), and Ni is the number of visit count of the parent of si. cpuct is a hyperparameter that controls the trade-off between exploration and exploitation, with higher cpuct encouraging exploration. P denotes the language model probabilities.
Expansion: If the selected leaf node is not EOS (terminal state), the node is expanded in depth with k children by decoding for one step using the LM and selecting top-k tokens as the children.
Rollout (Evaluation): from the selected leaf node s, generate until EOS using the language model, then evaluate the faithfulness of the generated sequence, f(y, k), using the knowledge classifier denoted as f. Let this be the value of s, V(s)=f(y, k). However, such a full rollout can be costly and result in high variance. Hence, we used the approximated token-level groundedness score instead: V(s)←f(y<t′s, k), where y<t′s refers to currently generated sequence y<t appended by a token s.
Backpropagation: Then, this score is backpropagated recursively from the node previously evaluated back to the root. Following PPL-MCTS, mean aggregation of all simulations played after this node is used. This leads to
for all si on the path from the leaf node s to the root, where ρ(si) is parent of si. These values will be used in the PUCT algorithm to select the nodes in Step 1 in the next simulation.
Steps 1 to 4 are repeated for pre-defined amount of simulations. For the present invention, the number of simulations is set to 50 for strong and reliable performance in terms of knowledge-grounded generation.
f is first modelled as a fact verification problem, and a binary classifier f(y, k)=Pf(αk=1|y, k) is trained on the sequence level. To adapt f to token-level faithfulness f(y<t), previous methods trained a classifier with random input sequence truncation or token-level labeling. The random truncation approach can be sample inefficient as only a part of the input is used during training, and it may add noise to the training since the input sequence may no longer contain hallucinated content after truncation while still receiving a negative label. Although the token-level labeling approach can be more sample efficient, it may induce even more noise during training and hinder convergence.
To alleviate these shortcomings, a novel approach called Reward Inflection-Point Approximation (RIPA) is developed to approximate future f for unfinished token sequences by explicitly providing a token-level label for groundedness. A schematic diagram of the comparison of RIPA and previous approaches can be found in
RIPA does not associate benign tokens with hallucination labels, which may lead to more stable training. Additionally, it is trained to predict 0 for all tokens after hallucination is detected, which will further discount future exploration under that node in MCTS, discouraging the selection of that token. Together, RIPA and MCTS (i.e. KCTS) provide a better estimate of equations (4) and (5).
Training RIPA requires fine-grained token-level annotation of hallucination, which is difficult to acquire through human annotation. Alternatively, two simple approaches to generating synthetic data are listed below.
Knowledge Shuffle: Given a training example (y, x, k)˜D, randomly swap k with another knowledge k′ from the knowledge source to form a negative example. Then, although the relevance between y and x remains unchanged, the groundedness of y on k becomes negative, as y is no longer based on k. All tokens in y are labeled 0.
Partial Hallucination: Similar to above, given a training example (y, x, k)˜D, first randomly swap k with another knowledge k′. Then, randomly sample a position 1<i<T, where T is the length of y, and truncate the response y to i′th token and obtain yi. An LM is then asked to complete the sequence yi by sampling from PLM(y|x, yi, k) in a zero-shot manner, by including the knowledge text k inside the instruction. Notice that the goal here is to utilize the hallucination of LMs: hence, we sampled the completion with a temperature greater than 1. In this approach, only the completion tokens (y>i) are labeled as 0.
A balanced mixture of the two to obtain the training set is used; however, for tasks in which x and k are indistinguishable (e.g. summarization), the problem becomes P(y|k). Therefore, only the partial hallucination approach was employed. Detailed hyperparameters used in each task are presented further below.
To show the strength of the guided decoding method in knowledge-grounded generation, two well-studied tasks are selected: knowledge-grounded dialogue and abstractive summarization. In both tasks, the language model is given a piece of reference knowledge in the input and asked to generate a response using that knowledge.
Knowledge Grounded Dialogue: Knowledge-grounded dialogue (KGD) can be formulated as modeling PLM(y|x, k), where y is the generated response, xxx is dialog history, and k is the relevant knowledge. Gold knowledge is experimented with, as the focus of this study was to show the potential of constrained decoding in knowledge grounding. The Wizard of Wikipedia (WoW) dataset's unseen topic portion of the test set is used as the benchmark dataset for this task.
Summarization: Abstractive summarization can be naturally considered a knowledge-grounded generation task, as the generated summary should only contain the information from the reference document. Improving factual faithfulness in abstractive summarization is a challenging task. The CNN/DM dataset is used as the benchmark dataset for investigation.
Various evaluation metrics applied for knowledge grounding and natural language generation are used. The metrics are categorized into three categories: token-based, knowledge-based, and multi-faceted. For token-based automatic metrics, BLEU-4, Rouge-L, ChrF and METEOR 5 are used. For knowledge-based metrics, Knowledge-F1 (KF1) is first used, which measures the unigram overlap between the generated and knowledge tokens, and K-Copy, as defined in equation (8):
where LD stands for Levenshtein Distance between the generated response and reference knowledge string. This metric captures the amount of verbatim copies of the knowledge in a generation. The purpose of this metric is to monitor if the model simply copies the knowledge as a response, which defeats the purpose of using a generative LM. Hence, an excessively high copy rate (e.g., ≥70%) may indicate a reduced utility of the response.
Finally, UniEval, a multifaceted, model-based evaluator trained using Boolean QA format, is also utilized. For the dialog task, Naturalness, Coherence (with dialogue context), and Groundedness (to the knowledge), are utilized; and for summarization, Coherence (within the summary), Consistency (with the article), Fluency, and Relevance (to the gold answer) are taken as fine-grained evaluation dimensions. For summarization, the MFMA (Lee et al., 2022) pre-trained metric is also employed, which showed SOTA-level correlation with human labels on the CNN/DM (data split of FRANK and QAGS benchmark).
Popular API-based LLMs and publicly available instruction-tuned language models of various sizes are used as the initial baseline. The LLMs provided through the OpenAI API are investigated; namely, ChatGPT (gpt-3.5-turbo-0301) and GPT 3.5 (text-davinci-003). For instruction-tuned models, two different sizes (XL & XXL) of the Flan-T5 (FT5) model family, and T0++ are investigated. Note that they are not directly compared with the method due to the significant differences in terms of the model size and training cost.
Note that while FT5 and T0++ models have been fine-tuned on some dialogue data, knowledge-grounded dialogue is still an unseen task for these models. Hence, zero-shot results are first gathered from various instruction-tuned models and experimented with guided decoding. On the other hand, the CNN/DM summarization task is included in the T0 dataset mixture and the Natural Instructions dataset, which was part of the Flan fine-tuning. Therefore, performing Knowledge-Constrained Decoding (KCD) on the CNN/DM test set can be considered guiding an already fine-tuned model to improve the factuality dimension further.
Then, weighted decoding (WD) is applied and constrained decoding baselines, namely FUDGE, NADO, and MCTS, on the KCD setting directly, which serves as the strong baseline directly comparable to the present method.
To train the classifiers, lightweight adapters are applied through LoRA only to the decoder layers of the language model and added a single linear layer on top of the last hidden states. This only adds 0.21% of additional training weights that can be disabled or enabled at test time, which does not hurt the rich multi-task ability of the base instruction-following model. More details about model training are available further below.
The performance of zero-shot LLMs and various instruction-finetuned models is reported in the upper half of Table 1, and the performance of directly comparable decoding-based baselines and the methods of the present invention in the lower half. The performance of a Supervised-FineTuned (SFT) version of FT5-XL for the KGD task is also studied. Note that the performance in the upper half (LLM and SFT) is only used to provide an overview of how powerful each language model is when tested on WoW and is not directly compared to the present methods. The instructions used for each model are listed further below.
11.72
74.14
27.48
74.54
58.19
30.71
12.74
28.27
33.40
28.10
90.51
87.86
97.54
56.06
30.54
35.22
28.92
92.78
91.78
98.30
SFT stands for supervised fine-tuning, and FT5 is shorthand for Flan-T5. Under the UniEval metrics, each letter stands for the following: N—Naturalness, C—Coherence, G—Groundedness. For all metrics, a larger number is preferred, except for K-Copy. Note that the performance of LLM in the upper half is for reference only. For each column, boldface denotes the best score out of the KCD methods under the FT5-XL backbone, and underline indicates the second best.
From the results in the upper half of Table 1, ChatGPT shows a strong ability to generate responses with high overlap with the knowledge. On the other hand, FT5-XL, while being the smallest in size (3B) out of all the models studied, showed the best performance in generating responses that are the most grounded to the reference knowledge, as indicated by the KF1 and Groundedness columns of Table 1. Therefore, FT5-XL is selected as the base model in further studies for guided decoding.
The comparison of baseline decoding methods and KCTS of the present invention can also be found in the lower half of Table 1. The Knowledge-Weighted Decoding (KWD), a RIPA-guided WD variant, is also experimented for ablation purposes. The penultimate group contains the results of baseline guided decoding methods, used with each of their own proposed approximation of token-level classifiers. FUDGE and MCTS both use random truncation approximation, and NADO uses a token-level labeling approach. All decoding methods showed improvement over the nucleus sampling baseline regarding groundedness to the knowledge, indicated by a higher KF1 score and the Groundedness column of UniEval. The results also clearly show that the RIPA provides a better token-level guidance signal for KCD, as both KWD and KCTS show improvement in dimensions except naturalness. KCTS also resulted in the highest f activation, confirming the hypothesis that MCTS, which also estimates future rewards for token selection through simulations, can produce a higher reward.
To show that KCTS guides the future generation trajectory towards a more grounded response in the end, the token generation is constrained for initial T tokens, then letting the original language model complete the sentence with nucleus sampling. The results in Table 2 indicate that the initially grounded tokens provide a good context to the LM that leads to a more grounded response generation, following the intuition of using MCTS with RIPA to fulfill the definition of future faithfulness f(y<t, k)≈f(P(y|y<t), k). Moreover, this provides a performance/speed trade-off parameter that users can adjust.
A human evaluation of the generated responses is also conducted to assess their quality. 100 examples are randomly sampled and three different evaluators are asked to measure their fluency, relevance to the dialogue context, groundedness to reference knowledge, and if the response is an unnatural copy of the knowledge. The human evaluation results in Table 3 further confirm the strength of KCTS, which received better scores in all dimensions than the baselines. Furthermore, KCTS resulted in higher relevance and groundedness while copying less knowledge than FUDGE, which suggests that KCTS responses have higher perceived utility. The results in the Non-Copy group also show that KCTS outperforms baselines even excluding the knowledge-copied responses.
0.12
2.92
2.55
2.37
2.91
2.61
2.24
Fl., Relv., and Gr. stands for Fluency, Relevance, and Groundedness, respectively. The interrater agreement by Krippendorff alpha was 0.57, 0.46, 0.77, 0.31. Non-Copy means average scores of examples that annotators agreed the generation does not copy the knowledge.
The same models as in 4.1 are used. From the results found in Table 4, it can be observed that ChatGPT again outperforms other models in most dimensions, except for BLEU and Rouge metrics. On the other hand, the instruction-tuned models show a different trend than with the KGD setting; the T0++ model outperforms FT5 models, presumably because Flan fine-tuning introduces a vast amount of additional tasks than the T0 fine-tuning dataset mixture, leading to deteriorated performance in some tasks.
20.39
36.24
12.30
27.20
34.25
28.46
96.24
96.64
91.60
88.48
85.11
22.97
38.27
14.21
28.10
37.18
31.37
95.85
96.03
90.24
87.16
85.36
The guided decoding was conducted with FT5-XL model as the base model. Coh., Cons., and Relv. stand for coherence, consistency, and relevance, respectively. As the performance of LMs is for reference, best scores on the last two groups are highlighted with boldface and second-best with underline.
For efficiency, the smallest Flan-T5-XL model is used throughout guided decoding experiments with the summarization task. KCTS again showed superior performance over the baseline methods, with significant improvement in token overlap and MFMA. RIPA-guided decoding also outperformed all baseline methods in UniEval, with KWD showing slightly better performance than KCTS.
50 samples are randomly selected for human evaluation, with the same three human evaluators from the KGD task as above. The evaluators were asked to evaluate the summaries in three dimensions: fluency, groundedness, and completeness. Groundedness is analogous to the precision of knowledge, while completeness is similar to recall. It can be observed from Table 5 that the evaluators preferred KCD over the baselines in all dimensions, including groundedness and completeness.
2.95
2.97
2.40
This follows a 3-point Likert scale with agreement alpha of 0.35, 0.44 and 0.19.
The instructions used for different models and tasks are listed in Table 6. For ChatGPT with the KGD task, we used the chat completion API, where each dialogue turn is separated and formatted as a user/assistant message, and the instruction was given as the system message at the end.
As a preliminary study, knowledge-constrained decoding is applied on LLMs, specifically GPT-3.5 (text-davinci-003). One limitation of the OpenAI API is that it does not return the token probability distribution over the whole vocabulary; at most, the top-5 log probabilities are returned. This significantly limits the search space for all WD methods, which may reduce the ability to guide the generation toward the objective. Hence, we propose a new method called pre-WD, where we use a proxy model to propose top-k tokens first, re-rank the tokens with RIPA, then include it in the API request in the logit bias field, which is added to the logit of the LLM before sampling. This can be denoted as:
where Zi is the logit of a token i, ZLLM is the logit of the base LLM, and {tilde over (Z)} is the logit of the smaller proxy model. α is another hyper-parameter that controls the strength of logit bias. Since GPT2 shares its vocabulary with the GPT3 family, {tilde over (Z)} was computed with GPT2-XL.
100 examples are randomly sampled from the WoW test set for this experiment. The results in Table 7 show that applying post-guidance does not improve much, as the search space is limited: without sufficient width, the generation is bound to what the base model believes. On the other hand, although the overlap between tokens proposed by the proxy model and the actual token distribution is unknown, the empirical results suggest that this method can successfully add bias toward tokens that are grounded on reference knowledge. Finally, using both pre-guidance and regular post-reweighting together can result in the most faithful generation.
20.20
24.46
45.62
98.90
27.92
30.51
24.00
4.00
28.83
95.18
The response length was set to 64 tokens for summarization and 32 for KGD. For nucleus sampling, top-p was 0.95 with temperature=1, which also applies to OpenAI models. For all decoding methods studied, top-k filtering is applied with k=50. In addition, in NADO, the constraining factor α was set to 0.25, and in MCTS, the constant cpuct=3. We also applied a repetition of 1.2 for MCTS following the original implementation.
The synthetic data generated has the following statistics: for WoW, 8,832 partial hallucination examples were generated using FT5-XL in a zero-shot manner, with temperature T=1.4 to encourage hallucination. 10,000 knowledge shuffle examples are also sampled, along with 20,000 original examples, leading to a balanced mixture of 20 k positive examples and 18.8 k negative examples. For CNN/DM, 12,811 partial hallucination negative examples were generated using the same procedure. The final dataset included 13,180 positive examples as well. A random 9:1 split is then applied to obtain the training and validation sets.
All experiments were conducted with NVIDIA GPUs with CUDA 11.7 and CuDNN 7.5 or higher enabled. Either RTX A6000 48 GB, RTX 3090 Ti 24 GB, or RTX 3090 24 GB GPUs is used. All classifier training was performed with an effective batch size of 64 for KGD and 32 for summarization for 2000 steps. For efficiency, the models are loaded in 8-bit quantization with bitsandbytes, both during training and inference.
All implementations were based on the huggingface transformers and peft libraries. The evaluate library is utilized for metric implementations. All the pretrained model weights were downloaded from the huggingface hub. Human evaluation is performed by three postgraduate NLP researchers with at least one year of experience in the field.
Some example outputs are included from each method for the case study for KGD in Table 8 and summarization in Table 9. Example 1 and 2 of Table 8 show the strength of our approach. In Example 1 of Table 8, while the FT5-XL fails to address the knowledge entirely and FUDGE outputs “sweater” which is not mentioned, KCTS uses the knowledge in a natural way to generate the response. ChatGPT also generates a well-grounded response. In Example 2 of Table 8, FT5-XL is not very informative, FUDGE makes an unsupported claim about goldendoodle's popularity, while KCTS generates the most knowledgeable response. On the other hand, Example 3 of Table 8 shows the failure case for all models. While the provided knowledge is about “fish protein powders” being sanitary, ChatGPT mentions high protein content and other models mention taste.
Table 9 includes two success cases for KCD. The ChatGPT response tends to be more detailed, which may contribute to having higher completeness in human evaluation, but it can be not sufficiently concise to serve as a good summary. FT5-XL may generate seemingly unrelated comments, and FUDGE can degenerate in fluency during token re-ranking. KCTS generates the most concise and faithful summary that captures the core of the article.
Throughout this specification, unless the context requires otherwise, the word “comprise” or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers. It is also noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as “comprises”, “comprised”, “comprising” and the like can have the meaning attributed to it in U.S. Patent law; e.g., they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the present invention.
Furthermore, throughout the specification and claims, unless the context requires otherwise, the word “include” or variations such as “includes” or “including”, will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.
References in the specification to “one embodiment”, “an embodiment”, “an example embodiment”, etc., indicate that the embodiment described can include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Other definitions for selected terms used herein may be found within the detailed description of the present invention and apply throughout. Unless otherwise defined, all other technical terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which the present invention belongs.
It will be appreciated by those skilled in the art, in view of these teachings, that alternative embodiments may be implemented without undue experimentation or deviation from the spirit or scope of the invention, as set forth in the appended claims. This invention is to be limited only by the following claims, which include all such embodiments and modifications when viewed in conjunction with the above specification and accompanying drawings.
The present application claims priority from a U.S. provisional patent application Ser. No. 63/588,297 filed Oct. 6, 2023, and the disclosure of which are incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63588297 | Oct 2023 | US |