The embodiments relate generally to machine learning systems and document summarization, and specifically to systems and methods for controlling hallucinations in abstractive summarization with enhanced accuracy.
Abstractive summarization models extract words and phrases from a document to form a summary of the document. Prior approaches of abstractive summarization systems tend to hallucinate (generate false information by combining words or phrases incorrectly) at a high frequency. Such hallucinations may broadly be classified as extrinsic, when a model adds information that is not present in the source document, and intrinsic, when the model distorts information present in the source document into a factually incorrect representation.
Neural abstractive text summarization systems, trained by maximizing the likelihood of reference summary given its source document, have been shown to generate plausible summaries. However, recent human analyses and automatic evaluations have shown that the generated summaries tend to contain factual errors (e.g., hallucinations). In addition, higher empirical performance, achieved by other methods, on standard evaluation metrics such as ROUGE score, does not necessarily imply higher faithfulness to the source document. Therefore, there is a need to provide a more factual document summarization system.
In the figures, elements having the same designations have the same or similar functions.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Prior approaches of abstractive summarization systems tend to hallucinate information at a high frequency, resulting in output summaries that fail to accurately reflect the contents of the source documents. Such hallucinations may broadly be classified as extrinsic, when a model adds information that is not present in the source document, and intrinsic, when the model distorts information present in the source document into a factually incorrect representation. Models trained with source summaries that include extrinsic hallucinations tend to generate a higher proportion of extrinsic hallucinations as compared to models trained on cleaner data sets.
Embodiments described herein provide a document summarization framework that controls different factual errors, referred to as “Mixture of Factual Experts (MoFE)” framework. MoFE applies an ensemble of factual expert models to control hallucination in summarization systems. Each factual expert model is trained to generate summaries with a unique type of factual quality, such as low extrinsic hallucinations, or low intrinsic hallucinations. The overall factual quality of MoFE may be achieved by controlling the relative weight of each factual expert. For example, when MoFE have three factual expert models, one optimized for minimal intrinsic factual errors, one optimized for extrinsic factual errors, and one optimized for high informativeness. The three experts may be ensembled (either through logits ensembling, or weighted average of parameters) in order to create a combined output that shares characteristics from each according to its relative weight.
In one embodiment, factual consistency metrics may be used to filter training data in order to adjust the training inputs for each respective expert. For example, a metric may measure the amount of extrinsic errors in a summary. By measuring the extrinsic errors of the summaries in the training data, those with high amounts of extrinsic errors may be filtered out, so that a factual expert may be trained on the remaining summaries to produce an expert model that produces low extrinsic errors.
In one embodiment, the MoFE model may be applied to achieve different quality goals for summarization by applying different weights to each respective factual expert. For example, a specific goal may be to have factual content recall above a certain threshold, while maintaining the least amount of intrinsic and extrinsic hallucinations. By adjusting the relative weights of the experts when they are ensembled, such a goal may be controlled for. After the individual experts are trained, the model may still maintain the flexibility to adjust the weights used during the ensembling process so that the goal may be dynamically adjusted. For example, in some embodiments, when the baseline summarization model does not contain any factual errors in a produced summary (neither intrinsic nor extrinsic errors), the model may ignore the expert models and output the summary produced by the baseline model. Effectively, this is a dynamic adjustment of the weights of the models.
Examples of factual accuracy metrics which may be used include entity overlap for measuring extrinsic hallucinations, and dependency arc entailment (DAE) for measuring intrinsic hallucinations. Entity overlap evaluates the number of entities in summary that are absent from the source document and can be used as a direct measure of extrinsic hallucination. Intrinsic hallucination, on the other hand, is broader and includes errors such as incorrect predicates or their arguments, coreference errors, discourse link errors, etc. Since DAE accuracy measures the fine-grained entailment relations at the dependency arc level, it is a reasonable proxy for measuring intrinsic hallucinations. In one embodiment, both metrics may be used to compute rewards for training experts targeting both types of hallucination.
In one embodiment, MoFE may also include an entity recall-based expert that is trained using both entity overlap and DAE metrics, because experts trained using these two metrics are prone to reducing factual recall.
A summarization model is pre-trained on the unfiltered training data 110 using maximum likelihood estimation (MLE) or another method to produce pre-trained summarization model 120. This pre-trained model may be used as the beginning point for training of each of the factual expert models, as indicated by the arrows from the pre-trained summarization model 120 to the factual experts.
Training data 110 is partitioned into multiple subsets which may or may not overlap. Each subset (e.g., 130, 140) may be generated by filtering the training data with a particular factual consistency metric. There are three well-known paradigms for evaluating the factual consistency of summaries generated by a model. The fist is entity overlap precision which includes measuring token-level overlap between the information of interest (e.g., named entities) in the summary and source document. This metric can be used as a proxy to measure simpler cases of hallucinations, such as extrinsic entity errors. The second type of evaluation evaluates if the facts claimed in a summary is entailed by the source document. Two well-known entailment-based metrics include FactCC which measures entailment at the summary-level and DAE which measures fine-level entailment by breaking the summary into smaller claims defined by dependency arcs. DAE correlates with the human judgment of factuality, and has the highest correlation with complex discourse errors, such as entity coreference. The third and most complex methods for evaluating factuality rely on question generation (QG) and question answering (QA). They first use a QG module to generate questions based on summaries and then use another QA module to find answers in the source document. They are computationally expensive to use to train experts, so are not used in examples herein, although these could be used in training factual experts.
The documents and reference summaries in the training data 110 may be analyzed to identify some feature and/or given a score according to some metric such as the ones described above. (In some aspects, the identification of a feature may also be considered a score, i.e., a summary with the feature is scored a 1 and a document without the feature is scored a 0). Document/summary pairs that are identified as having some feature and/or exceed some predetermined threshold may be included in a subset. In some aspects, the training system performs the scoring/identifying step, in other aspects, the training data 110 as provided to the system includes scores for the documents and reference summary pairs. Subset 130 may use entailment-based filtering on training data 110 so that it only includes summaries with no entailment error according to some metric. For example, subset 130 may be produced by measuring dependency arc entailment (DAE) accuracy between the source document and reference summary. Training samples may be filtered where all the dependency arcs in the summary are entailed by the source documents to control intrinsic hallucinations. Subset 140 may use entity overlap based filtering on training data 110 so that it only includes summaries with no extrinsic entity error according to some metric. For example, subset 140 may be produced using SpaCy to identify named entities, and then filter to only include summaries in which all the entity tokens are also mentioned in the source document.
Different partitioned subsets of training data 110 may be used for training/fine-tuning the factual expert models using reinforcement learning (RL). A model which maximizes the log-likelihood of reference summaries can efficiently learn to generate summaries with high n-gram overlap but may fail to learn to enforce factual consistency. Therefore, the training of factual experts may be done by directly optimizing for factual consistency using the self-critic algorithm. Parameters of an expert (θ) may be considered as the policy model, and an action may be defined as predicting the next token in a summary sequence. Given a factual consistency metric M, the method may define the action reward R(y,ŷ) as the score of the generated summary (y) according to M. Here, ŷ is the source document for precision-based factual consistency metrics (e.g., DAE accuracy, entity precision), and the reference summary for fact recall-based metrics (e.g., Entity recall). Further, in accordance with the self-critic training, the method may use the test-time greedy decoding strategy (i.e., argmax) to obtain a summary and calculate the baseline reward Ra(y,ŷ). The method may substract the baseline reward from the action-based reward (R(y,ŷ)) and use the resulting reward signal to train the experts. This minimizes the variance of the gradient estimate and importantly adjusts the reward scale to provide both positive and negative values. Overall, the method trains the expert policy to minimize the negative of expected reward difference. For example, a loss may be computed by computing the different between an action reward score, and a baseline reward score. Parameters of the summarization model may be updated based on the computed loss. After Monte Carlo approximation, the loss is computed as:
Following standard reinforcement learning-based sequence training formulations, the method initializes the policy model with a text summarization model φ trained on human-annotated datasets. Further to prevent the policy from collapsing to single mode o significantly deviating away from φ, the model adds an additional KL divergence loss (eq. 2) between the next token probabilities of the policy θ and baseline φ. The model trains experts using the weighted sum of the two losses
For example, the divergent loss is computed by comparing the divergence between a summary generated by a baseline model and a summary generated by a fine-tuned summarization model. Specifically, the loss may be based on a divergence between next token probabilities of the baseline summarization model and the fine-tuned summarization model. Such a loss may be represented as follows.
Equations 1 and 2 describe the general framework for training experts according to embodiments herein. In equation 2, y* is chosen depending on the number of factual errors in training samples. Human-written reference summaries are generally more natural and preferable than the summaries generated by a summarization model. So, on training samples that do not contain factual errors (filtered training samples), the reference summary may be used as y*. On the contrary, when the dataset contains frequent factual errors, minimizing KL divergence with respect to reference summary encourages the model to continue to uniformly increase probability mass on factually inconsistent references. This may lower the gain from reward-based loss. Therefore, when factual quality of training data is indeterminable, summaries sampled following probabilities from then expert (policy) model may be used as y*. Using reference summary on factually consistent training data is suitable for training experts that aim to improve factual consistency. However, data filtering reduces the number of samples. Given this training data size vs factual quality trade-off, different experts may be trained differently. For example, performing data filtering followed by RL training to build experts that target content-precision metrics, and empirically determining data filtering and mode of RL training for recall-related experts. In some embodiments, factual experts are trained using MLE with filtered training data rather than RL. For example, an expert targeting low intrinsic hallucinations may be trained using MLE loss on a training data subset filtered using the DAE metric.
As illustrated in this example, factual expert I 150 is trained for the goal of lower intrinsic hallucinations, and this goal is approached by using the subset 130 which contains no entailment errors. Factual expert II 160 is trained for the goal of lower extrinsic hallucinations, and this goal is approached by using the subset 140 which contains no extrinsic entity errors. Factual expert III 170 is trained for the goal of higher entity informativeness, and this goal is approached by using the subset 140. Although trained with the same subset 140 as factual expert II 160, factual expert III 170 is trained to maximize recall of salient entities between the generated summary and the reference summary. In some aspects, factual expert III may be trained using a separate subset of training data 110 which is filtered using a specific informativeness metric. In some embodiments greater or fewer factual experts may be trained based on different goals/metrics.
The factual experts 150, 160, and 170, and in some aspects the pre-trained summarization model 120, may be combined through either weights or logits ensembling 180 to generate a composite output. The mixing weights/coefficients for all expert models 150, 160, and 170, and pre-trained summarization model 120 are used to control the factual quality of summaries generated by the ensemble model. For example, a user may determine that the ensembled model should have a certain level of intrinsic hallucinations, extrinsic hallucinations, and informativeness. By adjusting weights either manually or automatically by a system, a user may tune the ensembled model to meet the specified goal. For example, in some contexts, it may be more desirable to have the ensembled model produce summaries with high informativeness even at the cost of high intrinsic hallucinations, while in other contexts it may be more desirable to have the ensembled model produce summaries with low intrinsic and extrinsic hallucinations at the cost of informativeness.
For weights ensembling, the method may use the element-wise weighted average of all the parameters of pre-trained summarization model 120 and expert models 150, 160, and 170. The result of weights ensembling is a single composite model, which in effect reduces the memory and processing needed for using the model when decoding since the multiple models have been collapsed to a single mode. The weights used for each model, however, are determined at the time of ensembling, and may be more difficult to change later as the individual models may no longer be available. The weights used during weights ensembling may be determined based on a predefined factual quality goal. Weights may be applied to the respective summarization models being ensembled in order to control how each respective summarization model contributes to the combined summarization model.
For logits ensembling, the method may use the weighted average of logits from all the experts 150, 160, and 170, and the pre-trained summarization model 120 during decoding. Each model is still used individually during decode, allowing for weights of each model to be adjusted more dynamically.
At step 205, the system receives a training dataset comprising a plurality of documents and a plurality of summaries corresponding to the plurality of documents, wherein each of the plurality of summaries is associated with a respective first score indicative of a first factual quality, and a respective second score indicative of a second factual quality. As discussed above, the score associated with the plurality of summaries may be a score based on a metric or may be the identification of a feature such as every summary token also being in the source document. In some aspects, the score is received with the training dataset, and in other aspects the system determines the score.
At step 210, the system filters the training dataset by removing summaries with the respective first scores below a first predetermined threshold resulting in a first training data subset. For example, the dataset may be filtered for the goal of lower intrinsic hallucinations by only including summaries which contain no entailment errors according to some metric.
At step 215, the system filters the training dataset by removing summaries with the respective second scores below a second predetermined threshold resulting in a second training data subset. For example, the system may use entity overlap based filtering on the training dataset so that it only includes summaries with no extrinsic entity error (i.e., extrinsic hallucinations) according to some metric.
At step 220, the system trains a first summarization model with the first training data subset. The first summarization model may start with a generic pre-trained summarization model, for example trained on the entire unfiltered dataset. Based on how the training data subset was formed and the training method, the first summarization model may target a specific factual accuracy/informativeness goal.
At step 225, the system trains a second summarization model with the second training data subset. Similar to the first summarization model, the second summarization model may start with the same generic pre-trained summarization model, for example trained on the entire unfiltered dataset. Based on how the training data subset was formed and the training method, the second summarization model may target a specific factual accuracy/informativeness goal.
At step 230, the system constructs a combined summarization model by ensembling the first summarization model and the second summarization model. As discussed above, the ensembling may be through either weights or logits ensembling. By adjusting the weights of each of the summarization models in the ensembling, different goals may be achieved in the composite output of the ensembled model.
Memory 320 may be used to store software executed by computing device 300 and/or one or more data structures used during operation of computing device 300. Memory 320 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 310 and/or memory 320 may be arranged in any suitable physical arrangement. In some embodiments, processor 310 and/or memory 320 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 310 and/or memory 320 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 310 and/or memory 320 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 320 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 320 includes instructions for a Summarization module 330 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the Summarization module 330, may receive an input 340, e.g., such as a document on a particular topic, via a data interface 315. The Summarization module 330 may generate an output 350, such as a summary of the input 340.
In some embodiments, the Summarization module 330 may further include the data filtering module 331, factual experts module 332, and a mixing experts module 333. The data filtering module 331 is configured to filter training data as described above. The filtering module 331, for example, may produce multiple subsets of training data based on different metrics such as low extrinsic hallucinations or low intrinsic hallucinations.
The factual experts module 332 is configured to train a number of factual experts optimized based on factual accuracy metrics. By using different subsets of the training data as filtered by the filtering module 331, and using different training methods, different goals may be achieved by different factual experts. For example, one factual expert may produce summaries that have low intrinsic hallucinations, and another factual expert may produce summaries that have low extrinsic hallucinations.
The mixing experts module 333 is configured combine the factual experts and in some aspects a pre-trained summarization model through weights or logit ensembling as described above. The trained model may then output a summary based on an input document based on the ensembled model. By adjusting the weights of the different models during ensembling, the output summary may be adjusted to achieve certain goals that is determined by a user. For example, by weighting more the factual expert that is trained for low extrinsic hallucinations, the output summary may thereby be optimized for low extrinsic hallucinations. Goals that are more nuanced with combinations of goals may be accomplished by ensembling with different weights so that the combined output that shares characteristics from each expert according to its relative weight.
Some examples of computing devices, such as computing device 300 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the processes of methods described herein. Some common forms of machine readable media that may include the processes of methods described herein are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Between logits and weights ensembling, both perform comparably on factual consistency metrics. However, by calculating logits for all experts and the pre-trained model at each decoding step, logit ensembling increases the decoding time linearly with the number of experts. Weights ensembling, on the other hand, does not increase the inference time and provides a lightweight method for combining experts. Accordingly, for fair comparison with the base model, the table in
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
The present disclosure is a nonprovisional of and claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 63/252,507, filed on Oct. 5, 2021, which is hereby expressly incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63252507 | Oct 2021 | US |