UNCERTAINTY-AWARE SEQUENCE MODELING

Information

  • Patent Application
  • 20250173366
  • Publication Number
    20250173366
  • Date Filed
    November 26, 2024
    a year ago
  • Date Published
    May 29, 2025
    9 months ago
  • CPC
    • G06F16/3349
    • G06F16/334
  • International Classifications
    • G06F16/3349
    • G06F16/334
Abstract
Described herein are systems and methods for improving accuracy of model output generation. A method can include obtaining a risk-aware model and a user input, applying the risk-aware model to the user input, receiving, based on the applying, model output and corresponding risk values, comparing the corresponding risk values to a threshold risk value, and regenerating the user input based on the comparing. The method can also include iteratively performing the applying, receiving, comparing, and regenerating using the regenerated user input until one or more processing conditions is met. The user input can be regenerated in response to determining that the corresponding risk values are greater than the threshold risk value. The model output can include one or more sequences in a train-of-thought (TOT) of the risk-aware model.
Description
TECHNICAL FIELD

This disclosure generally describes devices, systems, and methods related to computer-automated techniques and algorithms for identifying and assessing risk in outputs of sequence-based machine learning models for optimization.


BACKGROUND

Sequence models have demonstrated abilities in natural language tasks, including question answering (QA), where a model can receive a question as input and output a response answer. To robustly answer questions accurately, the model should understand context and ground its outputs in knowledge obtained from training data, which typically may contain conflicting information. When such models fail in QA tasks, it can be associated with a limited understanding of output confidence, out-of-domain data, ambiguity in inputs, inconsistent training data, and/or hallucinations. However, existing approaches lack a reliable way to determine when models' outputs can be trusted, which is essential for real-world applications.


One way to interpret this confidence is by relying on the softmax classifier probabilities. However, these probabilities usually do not reflect the actual confidence. Similarly, an out-of-domain (OOD) calibrator can be trained to detect OOD inputs but can require known or synthetic out-of-domain samples, and may not consider other sources of inaccuracies, such as over-represented features and/or ambiguous labels. Other approaches may include modeling and/or estimating model's uncertainty, fine-tuning calibrators to consider entropy, perplexity, and other metrics, and calculating output consistency.


Accordingly, there is a need for accurate and efficient ways to estimate uncertainty directly from the model given an input, without the need of external components, and a way to integrate this uncertainty with sequence models that output tokens iteratively.


SUMMARY

The disclosure generally describes technology for improving output quality of sequence models by utilizing estimated risk values associated with a given model output. More specifically, in some implementations, risk estimates can be integrated into a tree of thoughts (TOT) reasoning process (e.g., an algorithm or other computerized technique that can be used for finding best reasoning steps for a given question). The TOT can prompt a sequence model, such as a large language model (LLM), to output its reasoning step by step, where output at each step logically flows from a previous reasoning step. Risk estimates can be used to identify where and when logic breaks at some particular reasoning step(s), regenerate that particular reasoning step(s), and continue generation of the reasoning process in this way. In some implementations, the action of regenerating that particular reasoning step(s) can occur multiple times, the current reasoning step(s) that is selected can be, for example, the generated step(s) having a lowest risk value. In some implementations, risk estimates for a given output can be compared against a predetermined threshold. If the risk exceeds this threshold (which can indicate, for example, high model uncertainty), the model can be configured to decline responding to the prompt, enabling selective question answering.


Using the disclosed technology, a model can be transformed into its risk-aware variant. The risk-aware variant can be, for example, an uncertainty-aware model. The uncertainty-aware model can receive a prompt as input and provide output that includes, for example, a risk metric, such as uncertainty. In at least some implementations, the disclosed technology can provide an uncertainty-based framework for selective question answering that accounts for, by way of examples, epistemic and/or aleatoric uncertainty. Each output sequence, or thought, from the model can be a state in a tree. During each iteration, the model output can be evaluated on previous states, which can produce new states, and corresponding output uncertainty values or other risk metric value. New states can be added to the tree, for example if such new states result in lower uncertainty estimates, and/or other lower risk metric values. As a result of the disclosed techniques, the tree can be gradually built up in memory and successively become better at producing high quality output. In some implementations, a path of least uncertainty may also be used to generate optimized answers.


Although the disclosed technology is described from the perspective of extractive LLMs (e.g., masked-language models) and generative LLMs (e.g., autoregressive models), these are merely illustrative examples. The disclosed technology can apply to a variety of different use cases, models, and/or tasks. The disclosed technology is model-agnostic and data-agnostic, lightweight, and self-sufficient, meaning it does not rely on external models or systems. The external models or systems can include any models or systems that are not part of the disclosed technology, but may be needed for the operation or functioning of the disclosed technology. The external models can include but are not limited to pre-existing machine learning models, algorithms, or statistical models that the disclosed technology may rely on for predictions, analysis, or other tasks. The external systems can include but are not limited to external software systems, platforms, or infrastructure that the disclosed technology may rely on for resources, data processing, storage, or other services. For example, an external system can include a cloud service, a third-party API, and/or another application. While individual uncertainty quantification methods can increase performance on a selective QA task, a combination of such methods and uncertainty/risk metrics can yield improved accuracy. Accordingly, the disclosed technology can provide for converting a model into its risk-aware variant and composing metrics and methods automatically for the risk-aware variant to improve the model, and its accuracy in generating output.


One or more embodiments described herein includes a method for improving accuracy of model output generation. The method includes obtaining a risk-aware model and a user input and applying the risk-aware model to the user input. The method further includes receiving, based on the applying, model output and corresponding risk values, comparing the corresponding risk values to a threshold risk value, and regenerating the user input based on the comparing.


The method can optionally include one or more of the following features. For example, the method can include iteratively performing the applying, the receiving, the comparing, and the regenerating using the regenerated user input until one or more processing conditions is met. The method can additionally or alternatively include, in response to the one or more processing conditions being met, analyzing the outputted corresponding risk values against one or more risk criteria, selecting, based on the analyzing, a user input having an outputted corresponding risk value that satisfies the one or more risk criteria, and returning the selected user input to a user device. In at least some implementations, the method can include determining that the one or more processing conditions is met based on the corresponding risk values being less than the threshold risk value.


In some implementations, the user input can be regenerated in response to determining that the corresponding risk values are greater than the threshold risk value. The model output can include, for example, one or more sequences in a train-of-thought (TOT) of the risk-aware model. The corresponding risk values can include an aleatoric uncertainty value corresponding to each sequence, an epistemic uncertainty value corresponding to each sequence, and/or a bias value corresponding to each sequence.


The method can also include determining that the one or more processing conditions is met based on the model output being a final output of the risk-aware model. The method can additionally or alternatively include aggregating the corresponding risk values to generate an aggregated risk value, determining whether the aggregated risk value is less than the threshold risk value, and, in response to determining that the aggregated risk value is greater than the threshold risk value, performing the regenerating.


One or more embodiments described herein include a method for improving accuracy of model output generation. The method includes receiving a user input for a model and prepending logical instructions to the user input. The method further includes applying a risk-aware variant of the model to the user input having the prepended logical instructions and receiving, based on the applying, model output and corresponding risk values. Additionally, the method includes comparing the corresponding risk values to a threshold risk value and regenerating the user input having the prepended logical instructions based on the comparing.


The method can optionally include one or more of the above-mentioned features, and/or one or more of the following features, and/or one or more other features described herein. For example, the method can also include iteratively performing the applying, the receiving, the comparing, and the regenerating using the regenerated user input having the prepended logical instructions until a processing condition is met. Further, in response to the processing condition being met, the method can include analyzing the outputted corresponding risk values against one or more risk criteria, and selecting, based on the analyzing, a user input having an outputted corresponding risk value that satisfies the one or more risk criteria. In at least some instances, the user input having the prepended logical instructions can be regenerated in response to determining that the corresponding risk values are greater than the threshold risk value. As another example, the method can include aggregating the corresponding risk values to generate an aggregated risk value, determining whether the aggregated risk value is less than the threshold risk value, and, in response to determining that the aggregated risk value is greater than the threshold risk value, performing the regenerating.


One or more embodiments described herein includes a system for improving accuracy of model output generation. The system includes a computer system having one or more processors and memory storing instructions that, when executed by the one or more processors, cause the computer system to perform a process. The process includes obtaining a risk-aware user model and a user input and applying the risk-aware model to the user input. The process further includes receiving, based on the applying, model output and corresponding risk values, comparing the corresponding risk values to a threshold risk value, and regenerating the user input based on the comparing.


The system can optionally perform one or more of the above-mentioned features and/or one or more other features described herein.


One or more embodiments described herein includes a system for improving accuracy of model output generation. The system includes a computer system with one or more processors and memory storing instructions that, when executed by the one or more processors, cause the computer system to perform a process. The process includes receiving a user input for a model and prepending logical instructions to the user input. The process also includes applying a risk-aware variant of the model to the user input having the prepended logical instructions and receiving, based on the applying, model output and corresponding risk values. Still further, the process includes comparing the corresponding risk values to a threshold risk value and regenerating the user input having the prepended logical instructions based on the comparing.


The system can be configured to include one or more of the above-mentioned features and/or one or more other features described herein.


The disclosed technology provides greater accuracy in identifying uncertainty of a sequence than traditional approaches. Traditional approaches rely on softmax probability to determine the uncertainty of a sequence. However, softmax probability is shown to not capture model uncertainty and, therefore, cannot provide an accurate scoring function for capturing model risk. The disclosed technology, on the other hand, uses a risk value function to enhance the softmax method. These risk scores of sequences can be used as a threshold value to regenerate a new sequence when risk scores are above the threshold (i.e., giving the user a mechanism to know when to not trust a model output and regenerate a new output). This sequence rejection and regeneration mechanism can be used, for instance, for simply selective question answering tasks. The user can decide to not trust the answer of a sequence model when a question is given as input when the risk value of the output is higher than a threshold. It can also be used for TOT analysis by integrating it with graph traversal algorithms, such as Monte Carlo Tree Search (MCTS).





BRIEF DESCRIPTION OF THE DRAWINGS

This disclosure will be more fully understood from the following detailed description, taken in conjunction with the accompany drawings, in which:



FIG. 1 is a conceptual diagram of a system for assessing risk in a tree of thoughts (TOT) of a model to regenerate a prompt a predetermined number of times;



FIG. 2 is a conceptual diagram of a system that can be used to transform a model into a risk-aware variant of the model to be used with the disclosed technology;



FIGS. 3A and 3B together illustrate a flowchart of a process for optimizing a TOT of a model to cause the model to generate outputs with improved accuracy;



FIG. 4 illustrates an example algorithm for uncertainty-aware selective question answering using the disclosed technology;



FIG. 5 is a table illustrating accuracy of a model per uncertainty method using the disclosed technology;



FIG. 6 illustrates graphs of selective answering accuracy by confidence level;



FIG. 7 is a table illustrating example selective question answering accuracy from performing the disclosed technology;



FIG. 8 is a table illustrating an example efficiency benchmark per uncertainty method; and



FIG. 9 is a schematic diagram that shows an example of a computing system that can be used to implement the techniques described herein.





DETAILED DESCRIPTION

Certain exemplary embodiments will now be described to provide an overall understanding of the principles of the structure, function, manufacture, and use of the devices and methods disclosed herein. One or more examples of these embodiments are illustrated in the accompanying drawings. Those skilled in the art will understand that the devices and methods specifically described herein and illustrated in the accompanying drawings are non-limiting exemplary embodiments and that the scope of the present disclosure is defined solely by the claims. The features illustrated or described in connection with one exemplary embodiment may be combined with the features of other embodiments. Such modifications and variations are intended to be included within the scope of the present disclosure. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.


This disclosure generally relates to technology for integrating risk with sequence models to either discard or improve uncertain sequence outputs. Generating risk estimates for every reasoning step in a tree of thoughts (TOT) can allow for identifying where and when logic breaks at a particular step(s), regenerating that particular step(s) (potentially multiple times, and then selecting as the current reasoning step the regenerated step having a lowest risk value), and continuing generation of the reasoning process in this way. As a result of performing such techniques, the model can improve quality and accuracy of the generated outputs.


A person skilled in the art will appreciate the term “sequence” is not limited to a set of vector embeddings representing text tokens, but can also include embeddings representing, for example, images, audio signals, video frames, time series data, and/or any other form of data that can be processed by the model to condition its output and/or behavior. The disclosed technology can apply to any type of input, including but not limited to text prompt input, sound wave frequencies input, image input, etc. Although the disclosed technology is described from the perspective of sequence modeling where the elements of the sequence are tokens representing text data, this is merely an illustrative, non-limiting example.


Referring to the figures, FIG. 1 is a conceptual diagram of a system 100 for assessing risk in a TOT for a model to regenerate a model prompt having a lowest corresponding risk. The disclosed techniques can similarly be performed to assess risk in a chain-of-thought (COT). In a COT, each node, in at least some instances, can be permitted to have, at most, one (1) child node. Sometimes, a TOT can be viewed as multiple COTs in which some of the chains share common thought patterns (e.g., shared initial thought sequences). Accordingly, the disclosed technology can be applied to TOTs and/or COTs.


The system 100 can include a risk analysis computer system 102, which can communicate with a user device 104 and/or a data store 106 over network(s) 108. In some implementations, the user device 104 and/or the data store 106 may be part of the computer system 102. Sometimes, one or more of the user devices 104 and the data store 106 may be separate systems from the computer system 102 and/or remote from the computer system 102. The computer system 102 can be configured to execute software modules, engines, and/or instructions for performing the disclosed techniques.


Referring to the system 100 in FIG. 1, the risk analysis computer system 102 can receive a user model and prompt in block A (110).


The computer system 102 can transform the user model into a risk-aware variant of the model in block B (112). Refer to FIG. 2 for further description.


In a first invocation of the model on a given input, the computer system 102 can prepend the prompt with logical instructions, for example, to cause the model to generate multiple possible continuations of the input sequence (block C, 114). In case of other data modalities (e.g., image generation) where the model does not directly input text instructions, the disclosed techniques can introduce stochasticity into the model (e.g., dropout layers, Bayesian layers) and execute that model several times. In such instances, execution of the model can produce a different output each time. In some instances, execution of the model can provide a same output each time or sometimes. In yet some instances, two or more outputs can be similar and/or the same.


The computer system 102 can apply the risk-aware model to the prepended prompt to generate model output and corresponding risk values in block D (116). Block D (116) can be performed until processing conditions are met. A processing condition can be, for example, the model processing the prepended prompt to generate a full or complete thought.


The computer system 102 may regenerate the prompt a predetermined number of times based on analyzing the corresponding risk values against the processing condition(s) in block E (118). For example, the computer system 102 may regenerate the prompt if one or more risk values that are generated in block D (116) are greater than a predetermined threshold risk level/value. Optionally, block E (118) may not be performed, such as if the one or more risk values are less than the predetermined threshold risk level/value.


The computer system 102 can iterate through blocks D (116) and E (118) for each regenerated prompt based on analyzing the corresponding risk values and/or until the processing condition(s) is met in block F (120). For example, the computer system 102 can continue calculating the risk values for regenerated prompts and regenerating the prompts until the risk values of one of the regenerated prompts is less than the predetermined threshold risk level/value. The risk threshold can be determined heuristically, calculated algorithmically, and/or can be learned with a neural network (NN). The disclosed technology can implement one or more different techniques for determining the threshold. As an illustrative example, the threshold can be determined by executing the model described herein on a set of prompts and recording uncertainties for each output token, aggregating the recorded uncertainties with an aggregation function, then computing, for example, a value at 95th percentile. Then, if during later invocations of the model on other prompts, the output has higher risk than this value, it indicates that the prompt should be modified, and block 322 can be performed; elseblock 310 can be performed. If the risk value(s) is below the predetermined threshold, then the computer system can return the model output. In other words, low values for one or more risk estimates associated with the model output can suggest that the model successfully generalizes to the given prompt, and are indicative of higher-quality model output(s).


Once the computer system 102 has finished calculating the risk values and regenerating the prompts, the computer system 102 can analyze the outputted risk values for the prompts against one or more risk criteria (block F, 122). The one or more risk criteria can include the predetermined threshold risk level/value described above.


The computer system 102 can select at least one of the prompts that satisfies the one or more risk criteria (block G, 124). For example, the computer system 102 can select a regenerated prompt having one or more risk values (or an aggregated risk value) that is less than the predetermined threshold risk level/value. As another example, the computer system 102 can select a regenerated prompt amongst all the prompts having the lowest risk values or aggregated risk value.


The computer system 102 can then return the final model output and the selected prompt in block H (126). The prompt can be stored in the data store 106. The prompt can be transmitted to the user device 104. In some implementations, the prompt can be fed back into the computer system 102, as described further in reference to FIGS. 3A and 3B. Thus, the disclosed techniques allow for iteratively feeding the model with different prompts and selecting one or more of those prompts with the lowest output risk.



FIG. 2 is a conceptual diagram of a system 200 that can be used to transform a model into a risk-aware variant of the model to be used with the disclosed technology. One or more operations described herein for transforming the model into the risk-aware variant are described further in U.S. application Ser. No. 18/478,301, entitled “Systems and Methods for Automated Risk Assessment in Machine Learning,” filed on Sep. 29, 2023, which is incorporated by reference herein in its entirety. The uncertainty-aware model conversion provided for herein can produce a new model capable of estimating multiple different types of uncertainty. The model can estimate, for example aleatoric uncertainty. Additionally, or alternatively, the model can estimate epistemic uncertainty and/or bias.


In the example system 200, user input, which includes a user model (e.g., arbitrary model that is not previously seen) and list of metrics, can be received from the user device 104 (block A, 202) at the risk analysis computer system 102. The user input can include, for example, an original user model, and the original user model can include, for example, at least a model architecture.


The computer system 102 may then execute at least one wrapper-specific algorithm to generate a risk aware variant of the user model (block B, 204). For example, the computer system 102 can apply one or more model modifications to the model. As another example, the computer system 102 can apply one or more model augmentations to the modified model. As yet another non-limiting example, the computer system 102 can apply a loss function modified to the augmented and/or modified model. These applications can occur as standalone wrapper operations, or multiple applications can occur in the same risk aware variant analysis.


In some implementations, the computer system 102 can optionally apply one or more other transformations, such as search, add, delete, and/or replace transformations, to the model for a specific wrapper that was executed (block D, 208). The search transformations can include, for example, automatically searching through the model and identifying specific instances of locations/computational operations in the model that can be transformed. Multiple criteria can be used to identify what parts of the model need to be transformed. Another criterion can be based on understanding what specific transformations each wrapper is needed to apply. Accordingly, the transformations described herein can be represented as modifications of the model's computational graph without modifying actual user code that generated the model. As an illustrative example, a transformation to represent algorithmically may be an insertion of a mathematical operation in the graph representing the user model. Because the original model is received as input, the computer system 102 can identify what mathematical operations the model computes. The computer system may narrow down possible places in the model that are fit for inserting a layer. For example, for each wrapper, a set of mathematical operations in the graph that the computer system needs to search for can be identified and if the computer system finds these operations, the computer system may insert that layer after that found operation.


Any transformations of a user model inside a wrapper can be performed as described above in reference to searching components inside that model and inserting/deleting/replacing these components. The one or more model components can include at least one of: (i) one or more mathematical operations performed by the original model; (ii) subgraphs of the one or more mathematical operations, the subgraphs being one or more respective layers; and/or (iii) one or more groups of the subgraphs.


As part of applying transformations, the computer system 102 can add one or more layers to the model in one or more places that are specific to a particular wrapper. The computer system may automatically transform the model into its risk aware variant, for example by applying a wrapper to it. The wrapper can determine: (i) which layer or layers (groups of mathematical, computational operations) to add, remove, and/or replace; and (ii) what location(s) in the model to make the addition, removal, and/or replacement. An epistemic wrapper, for example, can be configured to receive a user model and transform that model by, for example, automatically adding one or more probabilistic layers (e.g., a group of mathematical operations that represent a probabilistic layer). The wrapper may also be configured to run the model T times to accurately calculate the uncertainty. The value T can be determined in a number of ways. As merely illustrative examples, the value T can be: determined empirically for a particular class of models; selected by the user based on, for instance, time and/or compute constraints; optimized for a custom cost function that takes into account uncertainty, time, and/or compute costs; learned with an NN; and/or any combination thereof. As yet another illustrative example, a lightweight multilayer perception NN (MLP) can be trained to predict the number of iterations required to achieve the optimal modified prompt (as indicated, for instance, by the output risk) within some threshold margin, thereby avoiding redundant computations. By way of non-limiting examples, any of the augmentations may include adding to the model (or to the original model) groups of model components whose outputs may predict standard deviations of ground truth labels (e.g., in addition to predicting the ground truth labels themselves). Such augmentations may include adding one or more model components after each model component of a particular type (and/or by any other criteria), and/or to a group or groups of layers of the original model (e.g., subgraphs of nodes in the computational graph representing the model).


Subsequently, the computer system 102 can return the risk-aware variant of the user model (block E, 210). The risk-aware variant of the user model may further be trained and executed to generate and output one or more risk factors associated with the user model. For example, the computer system 102 can determine representation bias, epistemic uncertainty, aleatoric uncertainty, any combination thereof, and/or other types of risk factors/categories (which can be defined by the user's list of metrics in at least some implementations). The computer system 102 can store the risk factors and/or the risk aware variant more generally in the data store 106 and in association with the original user model (e.g., using a unique identifier/ID that corresponds to and identifies the original user model). The computer system 102 can transmit the risk factors and/or the risk aware variant to the user device 104 for presentation at the device to a relevant user. The computer system 102 can also generate and return one or more recommendations for adjusting the original user model. Such recommendations can be stored in the data store 106 and/or transmitted to the relevant user device 104. As described further below, the risk factors and/or the risk aware variant may further be used by the computer system 102 to perform the disclosed techniques, such as identifying sub-optimal nodes and/or combinations of operations in the model and/or optimizing one or more of those nodes and/or combinations of operations.



FIGS. 3A and 3B are flowcharts of a process 300 for optimizing a TOT of a model to find model outputs having the lowest risk and improved accuracy. The process 300 can be performed by the risk analysis computer system 102 described herein. The process 300 can also be performed by any other computing system, computing device, network of computing devices, and/or cloud-based system. For illustrative purposes, the process 300 is described from the perspective of a computer system.


Referring to the process 300 of FIGS. 3A and 3B, the computer system can receive user input including a user model in block 302.


In block 304, the computer system can transform the user model into a risk-aware model. Refer to FIG. 2 for further discussion.


The computer system can receive a user input for the user model in block 306. Sometimes block 306 can be the same as the block 302. Block 306 can also be performed before, during, or after block 302 and/or block 304. If the user model is a language model, the user input can include additional information, such as instructions that can be added to the prompt to cause the model to break up its logic into composable pieces once it receives a prompt as input. These instructions can be used to cause the model reason consistently for purposes of TOT.


In block 310, the computer system can then apply the risk-aware model to the user input to generate model output and corresponding risk values. If the sequence model is language based, this output can include individual tokens that are appended to the original input autoregressively. During each autoregressive step, tokens can be selected from an output logit vector, for example by stochastically sampling the softmax probabilities of the logits with a predetermined temperature parameter.


In case of other data modalities (e.g., image generation) where the model does not directly input text instructions, the disclosed techniques can introduce stochasticity into the model (e.g., dropout layers, Bayesian layers) and execute that model several times, each time producing a different output.


For example, the model output can include output representing one or more tokens forming thoughts or image frames that can be appended to generate a video clip (block 312)


Additionally or alternatively, the corresponding risk values can include aleatoric uncertainty values corresponding to each output or sequence (block 314). Additionally or alternatively, the corresponding risk values can include epistemic uncertainty values corresponding to each output or sequence (block 316). Additionally or alternatively, the corresponding risk values can include bias values corresponding to each output or sequence (block 318).


In block 320, the computer system can determine whether processing conditions have been met to stop applying the model to the user-provided input. In a case of language modeling (e.g., natural language processing (NLP)), the determination can be made by the computer system based on verifying whether a sequence of generated outputs form a complete thought.


If the processing condition(s) is not met, then the computer system can return to block 310 and continue generating model output and corresponding risk values until the processing condition(s) is met. For example, the computer system can continue to perform blocks 310-320 until the model generates an end-of-thought token that indicates the model has finished generating the current thought.


If the processing condition(s) is met in block 320, the computer system can compute an aggregated risk value for the combined model outputs that satisfies the processing condition(s) (block 322). In other words, the computer system can compute an aggregated risk value based on, at least in part, all the aggregated outputs that was generated. For example, in the case of NLP, the computer system can determine an average of the risks for the generated tokens. The computer system may determine the aggregated risk value based on the aleatoric uncertainty value(s), the epistemic uncertainty value(s), the bias value(s), or any combination thereof.


The per-token uncertainty values can be aggregated into a single score, or set of scores, representing combined risk for the entire thought. The disclosure herein does not make assumptions about the computation for aggregating output risk into the scores associated with each node in the TOT. Any algorithm, including but not limited to an NN model, can be used to transform the output risk into the node score (e.g., reward associated with visiting the node). The aggregation can, in some implementations, be a complex calculation, which can include some combination of the risk scores and the original model outputs.


In block 324, the computer system can determine whether the last aggregated output is the final output of the sequence. If so, this output can be appended to the rest of the sequence in block 326 (in NLP, for example, this can correspond to concatenating all the vector embeddings representing different thoughts outputted by the model) before it is returned to the user in block 328.


If the last output of the sequence does not contain the final thought (in NLP, this could signal the final answer instead of thoughts, or an end-of-sequence token) in block 324, then the computer system can proceed to run a Monte Carlo Tree Search (MCTS) step, beginning with block 334, in which the computer system can determine which node in the tree (of all possible sequence continuations) to continue generating from.


Blocks 336, 338, 340, and 342 describe steps of an MCTS. In a tree structure defined by MCTS, each node can have a state that incorporates the concatenated steps of all parent nodes until a root node. Each node also can have an associated value defined by an expression score that incorporates variables, such as how many times that node has been explored, the estimated risk value for that step, or others. The disclosure can use risk associated with the output tokens to compute the score for each node, as described above.


During a selection phase of the MCTS (block 336), starting from the root node, a child can be selected iteratively by the computer system. This selection can be based, at least in part, on which child node has a highest score. This selection process can be done iteratively until a leaf node is reached.


During an expansion phase of the MCTS (block 338), the appended states of all parent nodes can be provided, by the computer system, as an input to the sequence model such that the model output may be based on the concatenated previous outputs. In the NLP example, this can include, for example, individual thoughts linked together in series (concatenated) such that the LLM is predicting the next thought or the final thought based on all previous thoughts. This expansion may, in at least some implementations, run only once per MCTS operation.


A simulation phase can then be performed by the computer system in block 340. Sometimes, the simulation phase may not require any additional compute. During the expansion operation (block 338), a new operation of the output can predict an aggregated risk value. This risk value can be used as the result simulation phase in block 340. In an illustrative example of a game of Go (a strategic game with a goal to control territory by surrounding empty points or your opponent's stones on a board), the simulation phase (block 34) may normally output a win probability for that particular operation. However, the “win probability” can be generalized by the computer system as an inverse of risk scores for that operation.


During a backpropagation phase (block 342), the calculated risk value can be backpropagated to all the parent nodes by the computer system. For each parent node, its risk value and the newly predicted risk value can be averaged together. This can result in a score change of parent nodes such that the next selection phase (block 336) can lead to an expansion of either the same or another leaf node. The score associated with each node can balance exploration (e.g., whether or not to select a different part of the tree that has not been explored enough) and exploitation (e.g., whether or not to select a part of the tree that is estimated to have a higher score).


The process 300 can stop after performing the backpropagation phase in block 342. In some implementations, after performing the backpropagation phase, the computer system can return to performing block 310 as described above.



FIG. 4 is an example algorithm 400 for uncertainty-aware selective question answering using the disclosed technology. Given an input prompt x, a QA task can be to output a prediction y{circumflex over ( )} where y{circumflex over ( )}∈Y(x), a set of all possible responses, that correctly answers the question of the QA task. In the selective QA task, the model also outputs σ, an uncertainty estimate for a given output y{circumflex over ( )}, where σ★R. Given a threshold γ∈R, the model outputs y{circumflex over ( )} if σ<γ and refrains from responding if this condition is not met. Each γ value results in quantities for coverage (e.g., a ratio of questions the model chose to respond to), and accuracy, (e.g., a number of predictions that correctly answer the question). The goal can be to maximize the number of questions that can be answered accurately, that is, to increase both coverage and accuracy. As described herein, a model can be converted into an uncertainty-aware variant that can output aleatoric, epistemic, and/or composed (e.g., bias) uncertainty estimates for every prediction. The model can learn data-dependent thresholds γ and outputs answers (y{circumflex over ( )}, σ) where σ<γ. The model can forgo providing responses to questions where no candidate with this requirement may exist.


In an extractive QA task, each input x represents (c, q), which is composed of context text c and a question q. The space of possible answer candidates y∈Y(x) is sequential segments in c as defined by start and end indices within c=c0, C1, . . . , cn with each index representing a token in the text. Stanford Question Answering Dataset (SQuAD) can be considered as an illustrative example, which is a collection of questions with corresponding answers presented as segments from passages of text. In an illustrative example of the disclosed techniques, Bidirectional Encoder Representations from Transformers (BERT) can be used, more specifically BERT-base-uncased 108M, which is a pre-trained transformer-based language model with 12 layers and 108 million parameters, designed for various natural language processing tasks and can process text without distinguishing between uppercase and lowercase letters (uncased), can be used with the WordPiece Tokenizer, as a base model to define probability distributions f(y|x) over Y(x). This model can be converted into an uncertainty-aware variant that outputs uncertainty for each index in the context text for every prediction.


In a generative QA task, input prompts x can be composed of sequences of tokens x0, x1, . . . xn representing questions. A model can be used to incrementally predict each subsequent token, starting from the last token in the input prompt, to compose a response y{circumflex over ( )}=y{circumflex over ( )}0, y{circumflex over ( )}1, . . . , y{circumflex over ( )}n that correctly answers the question. The space of answer candidates y∈Y(x) includes possible sequential combinations of tokens in a given vocabulary. In an illustrative example of the disclosed techniques, TruthfulQA question answering benchmark can be used, in which 817 questions divided into several categories are meant to represent questions commonly answered incorrectly by humans and therefore likely to be learned by models imitating human text. Llama 2, more specifically Llama 2-Chat 7B, can also be used, which was fine-tuned for dialogue use cases with a vocabulary of about 32,000 tokens as the generative model. This model can be converted into an uncertainty-aware variant that outputs uncertainty estimates for the entire vocabulary for every predicted token.


The extractive model, BERT-base-uncased described above, can be pre-trained using one or more different sources. The model can further be trained for one or more epochs (e.g., 3 epochs) on a range of training samples, such as within a range of 100 to 500 (e.g., 130 to 319, inclusive) training samples provided in the SQuAD 2.0 dataset or other training dataset. The uncertainty-aware variants can be created by converting the pre-trained model before training. All the models can be evaluated on the 11,873 questions, or more questions, or fewer questions in the SQuAD 2.0 test set, reporting accuracy for Exact Match and F1 metrics. In some implementations, a NVIDIA Modeling framework (NeMo framework) can be used to train and evaluate the models. The NeMo framework facilitates the creation, training, and deployment of deep learning models, particularly for applications in NLP, speech recognition, and other AI-driven tasks. The generative model, Llama 2-Chat 7B, can be pre-trained and fine-tuned for dialogue use cases. The uncertainty-aware variant can be created by converting this version of the model directly.


Aleatoric uncertainty captures incertitude resulting from data (e.g., irreducible noise, labeling errors, classes with low separation, etc.). This type of uncertainty quantifies what a model cannot understand given the data provided. Aleatoric uncertainty can be modeled using Mean and Variance Estimation (MVE). In regression, a layer can predict model output deviations and can be trained using a negative log-likelihood loss. It can be assumed that logits are drawn from a normal distribution and can stochastically sample from them using a re-parameterization trick. Stochastic samples can be averaged and backpropagated using cross entropy loss through logits and their inferred uncertainties.


Epistemic uncertainty captures uncertainty arising from a predictive process. This type of uncertainty quantifies inherent limitations in a model or lack of knowledge, intuitively representing what the model does not know. A Bayesian neural network can be approximated by stochastically sampling, during inference, from a model with probabilistic layers. Similarly, models of arbitrary depth that follow sampling-based procedures to temporarily remove units from all layers can be equivalent to approximations to the probabilistic deep Gaussian process and can be used to estimate predictive uncertainty. Epistemic uncertainty can be calculated, in some implementations, for example, using Monte Carlo sampling (MC) (e.g., running T stochastic forward passes and computing the first and second moments from these samples, yielding predictions and uncertainty estimates, respectively). Ensembles of N models, each a randomly initialized stochastic sample, can be another approach used to estimate epistemic uncertainty.


Traditionally, models output predictions in the form of y{circumflex over ( )}=fW(x). The disclosed techniques apply a conversion, Φ, to build an uncertainty-aware model to measure uncertainty metrics θ:





g(·)←Φθ(fW),





ŷ, σ=g(x)

    • where σ is the estimated uncertainty. The disclosed conversion procedure adds and modifies relevant model components while preserving structure and function. This allows the new model to serve as a drop-in replacement that can additionally be able to estimate uncertainty metrics. The modifications can be integrated into a custom, metric-specific forward pass and training step that integrates during training and inference.



FIG. 5 is a table 700 illustrating accuracy of a model per uncertainty method using the disclosed technology. As initial example evaluations, MVE, MC, and Ensemble variants of a pre-trained extractive QA model are created. In this illustrative example, these variants are trained for 3 epochs on approximately 130,000 samples provided in the SQuAD 2.0 dataset. Exact Match and F1 accuracy results can be collected for approximately 11,000 test questions. As shown by the table 500, the uncertainty-aware variants have performance that is consistent with the base model with no significant reductions in accuracy.



FIG. 6 illustrates graphs 600 and 602 of selective answering accuracy by confidence level. FIG. 7 is a table 700 illustrating example selective question answering accuracy from performing the disclosed technology. Referring to both FIGS. 6 and 7, accuracy of a model across increasing levels of confidence percentile thresholds (e.g., top to bottom, least to most confident) are shown when using the disclosed technology. Bold entries in the table 700 of FIG. 7 indicate top performing uncertainty method per confidence level.


Increasing values of logit probability may not correspond to increased question answering ability, despite being often misconceived as a measure of confidence. The disclosed techniques, on the other hand, report a reliable measure of confidence with increased confidence corresponding to increased accuracy. Having verified that QA performance remains consistent after model conversions, performance of different UQ variants can be compared on an uncertainty-guided selective QA setting. For an extractive QA task, models can be evaluated on approximately 5928 answerable questions in the SQuAD test set described herein, as an illustrative example. For a generative QA task, multiple responses to 817 questions in the TruthfulQA benchmark can be generated, as another illustrative example.


Coverage and accuracy obtained by the MVE, Ensemble, and/or MC UQ variants described above can be compared when answering questions using their uncertainty estimates as the selective prediction criteria. The UQ variants may also be compared to the original model, using baseline logit probabilities (e.g., from softmax) as a confidence value to guide selective predictions. The logit probabilities refer to the transformed output of a model's raw predictions, where logits are the unbounded values (often from the last layer of a neural network) that are converted into probabilities using a function like the softmax (for multi-class classification) or sigmoid (for binary classification). These results are shown by the graphs 600 and 602 of FIG. 6. These results are also shown in table 700 of FIG. 7, described further below.


Using UQ metrics as a measure of confidence can lead to increased accuracy while answering larger portions of the questions. On the other hand, using logit probability does not. Additionally, in both the extractive and generative cases, predictions are least accurate when logit probability confidence is highest. In fact, performance gradually deteriorates to 0% as confidence increases, indicating logit confidence percentile.


Probabilities (e.g., from softmax) may not be used reliably to determine answer confidence. Moreover, as shown in FIG. 6, the highest performance achieved using logit probability, 76.50% accuracy for 60% of questions in the extractive case, and 61.87% accuracy for 60% of questions in the generative case, is significantly lower than those achieved using UQ metrics. In the extractive case, MVE 612 and Ensemble 608 result in approximately 100% accuracy when answering questions in the top confidence percentile. MVE 612, Ensemble 608, and MC 610 obtain +90% accuracy with coverage rates of 5%, 15%, and 15%, respectively, and +80% accuracy with coverage rates of 50%, 65%, and 75%, respectively. Ensemble 608 is able to reach 100% accuracy with the highest overall performance across all confidence levels, as shown by the graph 600 of FIG. 6. All the converted models are able to consistently, throughout the entire set of questions, identify predictions likely to be incorrect. In the generative case, using baseline logit probability 604 to measure confidence leads to a maximum increase in accuracy of 4.7% for questions in the 4th lowest percentile. In comparison, the disclosed uncertainty-aware models 616 can attain accuracy rates of 100%, +90%, +80%, +70% when answering 1%, 8%, 9%, and 35% of the questions, respectively, and result in higher accuracy across all confidence percentiles. After generating 10 candidate answers for each question in the benchmark, answers with highest uncertainty were consistently incorrect, in an illustrate example use case of the disclosed technology. The converted model may also be able to output correct answers if it repeatedly generates predictions until one in the 99% confidence percentile is found.


The disclosed techniques can similarly be used to generate a compositional UQ method with strong accuracy that does not require the training of multiple independent models or incur significant computational costs. MC 610 can be used as an epistemic uncertainty metric given that it can be more computationally efficient than Ensemble models. The measurement can be combined with MVE 612, which estimates aleatoric uncertainty, to create a model that calculates this composed metric with every output. The composed approach, as shown by the graphs 600 and 602 of FIG. 6 and the table 700 of FIG. 7, attained significantly higher accuracy when compared to other independent UQ metrics. Importantly, the composed approach does not incur significant computational overhead required by Ensembling-based approaches. The automated uncertainty-awareness techniques described herein enable facile composition of different UQ methods to optimize for both performance and computational efficiency, and enables strong effectiveness on selective question answering tasks.



FIG. 8 is a table 800 illustrating an example efficiency benchmark per uncertainty method that can be used with the disclosed techniques. The table 800 provides results relating to computational performance. Models that are made risk-aware incur minimal overhead in inference time and a negligible increase in number of parameters. However, the Ensemble variant, on the other hand, results in significant computational cost, approximately five times that of the original model. The disclosed automatic uncertainty-aware conversion process is completed in about 0.0994 seconds for an example 108 million parameters model and in 1.3856 seconds for an example 7 billion parameters model. The latter model has significantly more learnable parameters.


Computing System(s) for Use with Present Disclosures


FIG. 9 is a schematic diagram that shows an example of a computing system 900 that can be used to implement the techniques described herein. The computing system 900 includes one or more computing devices (e.g., computing device 910), which can be in wired and/or wireless communication with various peripheral device(s) 980, data source(s) 990, and/or other computing devices (e.g., over network(s) 970). The computing device 910 can represent various forms of stationary computers 912 (e.g., workstations, kiosks, servers, mainframes, edge computing devices, quantum computers, etc.) and mobile computers 914 (e.g., laptops, tablets, mobile phones, personal digital assistants, wearable devices, etc.). In some implementations, the computing device 910 can be included in (and/or in communication with) various other sorts of devices, such as data collection devices (e.g., devices that are configured to collect data from a physical environment, such as microphones, cameras, scanners, sensors, etc.), robotic devices (e.g., devices that are configured to physically interact with objects in a physical environment, such as manufacturing devices, maintenance devices, object handling devices, etc.), vehicles (e.g., devices that are configured to move throughout a physical environment, such as automated guided vehicles, manually operated vehicles, etc.), or other such devices. Each of the devices (e.g., stationary computers, mobile computers, and/or other devices) can include components of the computing device 910, and an entire system can be made up of multiple devices communicating with each other. For example, the computing device 910 can be part of a computing system that includes a network of computing devices, such as a cloud-based computing system, a computing system in an internal network, or a computing system in another sort of shared network. Processors of the computing device (910) and other computing devices of a computing system can be optimized for different types of operations, secure computing tasks, etc. The components shown herein, and their functions, are meant to be examples, and are not meant to limit implementations of the technology described and/or claimed in this document.


The computing device 910 includes processor(s) 920, memory device(s) 930, storage device(s) 940, and interface(s) 950. Each of the processor(s) 920, the memory device(s) 930, the storage device(s) 940, and the interface(s) 950 are interconnected using a system bus 960. The processor(s) 920 are capable of processing instructions for execution within the computing device 910, and can include one or more single-threaded and/or multi-threaded processors. The processor(s) 920 are capable of processing instructions stored in the memory device(s) 930 and/or on the storage device(s) 940. The memory device(s) 930 can store data within the computing device 910, and can include one or more computer-readable media, volatile memory units, and/or non-volatile memory units. The storage device(s) 940 can provide mass storage for the computing device 910, can include various computer-readable media (e.g., a floppy disk device, a hard disk device, a tape device, an optical disk device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations), and can provide date security/encryption capabilities.


The interface(s) 950 can include various communications interfaces (e.g., USB, Near-Field Communication (NFC), Bluetooth, WiFi, Ethernet, wireless Ethernet, etc.) that can be coupled to the network(s) 970, peripheral device(s) 980, and/or data source(s) 990 (e.g., through a communications port, a network adapter, etc.). Communication can be provided under various modes or protocols for wired and/or wireless communication. Such communication can occur, for example, through a transceiver using a radio-frequency. As another example, communication can occur using light (e.g., laser, infrared, etc.) to transmit data. As another example, short-range communication can occur, such as using Bluetooth, WiFi, or other such transceiver. In addition, a GPS (Global Positioning System) receiver module can provide location-related wireless data, which can be used as appropriate by device applications. The interface(s) 950 can include a control interface that receives commands from an input device (e.g., operated by a user) and converts the commands for submission to the processors 920. The interface(s) 950 can include a display interface that includes circuitry for driving a display to present visual information to a user. The interface(s) 950 can include an audio codec which can receive sound signals (e.g., spoken information from a user) and convert it to usable digital data. The audio codec can likewise generate audible sound, such as through an audio speaker. Such sound can include real-time voice communications, recorded sound (e.g., voice messages, music files, etc.), and/or sound generated by device applications.


The network(s) 970 can include one or more wired and/or wireless communications networks, including various public and/or private networks. Examples of communication networks include a LAN (local area network), a WAN (wide area network), and/or the Internet. The communication networks can include a group of nodes (e.g., computing devices) that are configured to exchange data (e.g., analog messages, digital messages, etc.), through telecommunications links. The telecommunications links can use various techniques (e.g., circuit switching, message switching, packet switching, etc.) to send the data and other signals from an originating node to a destination node. In some implementations, the computing device 910 can communicate with the peripheral device(s) 980, the data source(s) 990, and/or other computing devices over the network(s) 970. In some implementations, the computing device 910 can directly communicate with the peripheral device(s) 980, the data source(s), and/or other computing devices.


The peripheral device(s) 980 can provide input/output operations for the computing device 910. Input devices (e.g., keyboards, pointing devices, touchscreens, microphones, cameras, scanners, sensors, etc.) can provide input to the computing device 910 (e.g., user input and/or other input from a physical environment). Output devices (e.g., display units such as display screens or projection devices for displaying graphical user interfaces (GUIs)), audio speakers for generating sound, tactile feedback devices, printers, motors, hardware control devices, etc.) can provide output from the computing device 910 (e.g., user-directed output and/or other output that results in actions being performed in a physical environment). Other kinds of devices can be used to provide for interactions between users and devices. For example, input from a user can be received in any form, including visual, auditory, or tactile input, and feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback).


The data source(s) 990 can provide data for use by the computing device 910, and/or can maintain data that has been generated by the computing device 910 and/or other devices (e.g., data collected from sensor devices, data aggregated from various different data repositories, etc.). In some implementations, one or more data sources can be hosted by the computing device 910 (e.g., using the storage device(s) 940). In some implementations, one or more data sources can be hosted by a different computing device. Data can be provided by the data source(s) 990 in response to a request for data from the computing device 910 and/or can be provided without such a request. For example, a pull technology can be used in which the provision of data is driven by device requests, and/or a push technology can be used in which the provision of data occurs as the data becomes available (e.g., real-time data streaming and/or notifications). Various sorts of data sources can be used to implement the techniques described herein, alone or in combination.


In some implementations, a data source can include one or more data store(s) 990a. The database(s) can be provided by a single computing device or network (e.g., on a file system of a server device) or provided by multiple distributed computing devices or networks (e.g., hosted by a computer cluster, hosted in cloud storage, etc.). In some implementations, a database management system (DBMS) can be included to provide access to data contained in the database(s) (e.g., through the use of a query language and/or application programming interfaces (APIs)). The database(s), for example, can include relational databases, object databases, structured document databases, unstructured document databases, graph databases, and other appropriate types of databases.


In some implementations, a data source can include one or more blockchains 990b. A blockchain can be a distributed ledger that includes blocks of records that are securely linked by cryptographic hashes. Each block of records includes a cryptographic hash of the previous block, and transaction data for transactions that occurred during a time period. The blockchain can be hosted by a peer-to-peer computer network that includes a group of nodes (e.g., computing devices) that collectively implement a consensus algorithm protocol to validate new transaction blocks and to add the validated transaction blocks to the blockchain. By storing data across the peer-to-peer computer network, for example, the blockchain can maintain data quality (e.g., through data replication) and can improve data trust (e.g., by reducing or eliminating central data control).


In some implementations, a data source can include one or more machine learning systems 990c. The machine learning system(s) 990c, for example, can be used to analyze data from various sources (e.g., data provided by the computing device 910, data from the data store(s) 990a, data from the blockchain(s) 990b, and/or data from other data sources), to identify patterns in the data, and to draw inferences from the data patterns. In general, training data 992 can be provided to one or more machine learning algorithms 994, and the machine learning algorithm(s) can generate a machine learning model 996. Execution of the machine learning algorithm(s) can be performed by the computing device 910, or another appropriate device. Various machine learning approaches can be used to generate machine learning models, such as supervised learning (e.g., in which a model is generated from training data that includes both the inputs and the desired outputs), unsupervised learning (e.g., in which a model is generated from training data that includes only the inputs), reinforcement learning (e.g., in which the machine learning algorithm(s) interact with a dynamic environment and are provided with feedback during a training process), or another appropriate approach. A variety of different types of machine learning techniques can be employed, including but not limited to convolutional neural networks (CNNs), deep neural networks (DNNs), recurrent neural networks (RNNs), and other types of multi-layer neural networks.


Various implementations of the systems and techniques described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. A computer program product can be tangibly embodied in an information carrier (e.g., in a machine-readable storage device), for execution by a programmable processor. Various computer operations (e.g., methods described in this document) can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output. The described features can be implemented in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, by a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program product can be a computer- or machine-readable medium, such as a storage device or memory device. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, etc.) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.


Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and can be a single processor or one of multiple processors of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer can also include, or can be operatively coupled to communicate with, one or more mass storage devices for storing data files. Such devices can include magnetic disks (e.g., internal hard disks and/or removable disks), magneto-optical disks, and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data can include all forms of non-volatile memory, including by way of example semiconductor memory devices, flash memory devices, magnetic disks (e.g., internal hard disks and removable disks), magneto-optical disks, and optical disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).


The systems and techniques described herein can be implemented in a computing system that includes a back end component (e.g., a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). The computer system can include clients and servers, which can be generally remote from each other and typically interact through a network, such as the described one. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.


While this specification contains many specific implementation details, these should not be construed as limitations on the scope of the disclosed technology or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular disclosed technologies. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment in part or in whole. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described herein as acting in certain combinations and/or initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. Similarly, while operations may be described in a particular order, this should not be understood as requiring that such operations be performed in the particular order or in sequential order, or that all operations be performed, to achieve desirable results. Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims.

Claims
  • 1. A method for improving accuracy of model output generation, the method comprising: obtaining a risk-aware model and a user input;applying the risk-aware model to the user input;receiving, based on the applying, model output and corresponding risk values;comparing the corresponding risk values to a threshold risk value; andregenerating the user input based on the comparing.
  • 2. The method of claim 1, the method further comprising: iteratively performing the applying, the receiving, the comparing, and the regenerating using the regenerated user input until one or more processing conditions are met.
  • 3. The method of claim 2, the method further comprising: in response to the one or more processing conditions being met, analyzing the outputted corresponding risk values against one or more risk criteria;selecting, based on the analyzing, a user input having an outputted corresponding risk value that satisfies the one or more risk criteria; andreturning the selected user input to a user device.
  • 4. The method of claim 2, the method further comprising determining that the one or more processing conditions are met based on the corresponding risk values being less than the threshold risk value.
  • 5. The method of claim 1, wherein the user input is regenerated in response to determining that the corresponding risk values are greater than the threshold risk value.
  • 6. The method of claim 1, wherein the model output comprises one or more sequences in a train-of-thought (TOT) of the risk-aware model.
  • 7. The method of claim 6, wherein the corresponding risk values comprise an aleatoric uncertainty value corresponding to each sequence.
  • 8. The method of claim 6, wherein the corresponding risk values comprise an epistemic uncertainty value corresponding to each sequence.
  • 9. The method of claim 6, wherein the corresponding risk values comprise a bias value corresponding to each sequence.
  • 10. The method of claim 1, the method further comprising determining that the one or more processing conditions are met based on the model output being a final output of the risk-aware model.
  • 11. The method of claim 1, the method further comprising: aggregating the corresponding risk values to generate an aggregated risk value;determining whether the aggregated risk value is less than the threshold risk value; andin response to determining that the aggregated risk value is greater than the threshold risk value, performing the regenerating.
  • 12. A method for improving accuracy of model output generation, the method comprising: receiving a user input for a model;prepending logical instructions to the user input;applying a risk-aware variant of the model to the user input having the prepended logical instructions;receiving, based on the applying, model output and corresponding risk values;comparing the corresponding risk values to a threshold risk value; andregenerating the user input having the prepended logical instructions based on the comparing.
  • 13. The method of claim 12, the method further comprising: iteratively performing the applying, the receiving, the comparing, and the regenerating using the regenerated user input having the prepended logical instructions until a processing condition is met;in response to the processing condition being met, analyzing the outputted corresponding risk values against one or more risk criteria; andselecting, based on the analyzing, a user input having an outputted corresponding risk value that satisfies the one or more risk criteria.
  • 14. The method of claim 12, wherein the user input having the prepended logical instructions is regenerated in response to determining that the corresponding risk values are greater than the threshold risk value.
  • 15. The method of claim 12, the method further comprising: aggregating the corresponding risk values to generate an aggregated risk value;determining whether the aggregated risk value is less than the threshold risk value; andin response to determining that the aggregated risk value is greater than the threshold risk value, performing the regenerating.
  • 16. A system for improving accuracy of model output generation, the system comprising: a computer system comprising one or more processors and memory storing instructions that, when executed by the one or more processors, cause the computer system to perform a process comprising:obtaining a risk-aware user model and a user input;applying the risk-aware model to the user input;receiving, based on the applying, model output and corresponding risk values;comparing the corresponding risk values to a threshold risk value; andregenerating the user input based on the comparing.
  • 17-19. (canceled)
  • 20. The system of claim 16, wherein the process further comprises iteratively performing the applying, the receiving, the comparing, and the regenerating using the regenerated user input until one or more processing conditions are met.
  • 21. The system of claim 20, wherein the process further comprises: in response to the one or more processing conditions being met, analyzing the outputted corresponding risk values against one or more risk criteria;selecting, based on the analyzing, a user input having an outputted corresponding risk value that satisfies the one or more risk criteria; andreturning the selected user input to a user device.
  • 22. The system of claim 16, wherein the user input is regenerated in response to determining that the corresponding risk values are greater than the threshold risk value.
  • 23. The system of claim 16, wherein the model output comprises one or more sequences in a train-of-thought (TOT) of the risk-aware model.
RELATED APPLICATIONS

The present disclosure claims priority to U.S. Provisional Patent Application No. 63/602,637, entitled “Uncertainty-aware Language Modeling for Selective Question Answering,” which was filed on Nov. 26, 2023, and which is incorporated by reference herein in its entirety.

Provisional Applications (1)
Number Date Country
63602637 Nov 2023 US