A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright or rights. @ 2023 Grammarly, Inc.
One technical field of the present disclosure is natural language processing (NLP) tasks using artificial intelligence (AI) and machine learning systems. Another technical field is prompt engineering for large language models (LLMs) and generative AI systems.
The approaches described in this section are approaches that could be pursued but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any approaches described in this section qualify as prior art merely by their inclusion.
Instruction-tuned large language models (LLMs) have achieved improved performance on various NLP tasks and are used extensively in many NLP applications, including writing assistants. However, little is known about their ability to follow fine-grained instructions consistently and rigorously, especially text editing instructions. Modern large-scale language models (LLMs) have been aligned to follow instructions. Such LLMs have shown the ability to perform several tasks like sentiment classification, question answering, text summarization, and machine translation when these tasks are described as instructions in suitable prompts (Wang et al., 2018, 2019; Hendrycks et al., 2020; Surameery and Shakor, 2023).
Despite instruction-tuned LLMs being used very widely, engineers and product managers do not clearly understand their instruction-following capability. Prior work on rigorously evaluating the instruction-following ability of LLMs is relatively scarce (Webson and Pavlick, 2022; Min et al., 2022; Kung and Peng, 2023; Li et al., 2023). Kung et al. (2023) suggest that some instruction-fine-tuned models might rely less on the instructions than previously thought but instead latch onto superficial artifacts like output format. Li et al. (2023) find that the influence of prior knowledge encoded in the training stage can limit the steerability of models. Still, a fundamental question remains: How good are current LLMs at following instructions? How does their effectiveness vary with the number of instructions, complexity, and the dependence relations between instructions?
Based on the foregoing, the referenced technical fields have developed an acute need for better ways to follow fine-grained text editing instructions.
The appended claims may serve as a summary of the invention.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
The text of this disclosure, in combination with the drawing figures, is intended to state in prose the algorithms that are necessary to program the computer to implement the claimed inventions at the same level of detail that is used by people of skill in the arts to which this disclosure pertains to communicate with one another concerning functions to be programmed, inputs, transformations, outputs and other aspects of programming. That is, the level of detail set forth in this disclosure is the same level of detail that persons of skill in the art normally use to communicate with one another to express algorithms to be programmed or the structure and function of programs to implement the inventions claimed herein.
This disclosure may describe one or more different inventions, with alternative embodiments to illustrate examples. Other embodiments may be utilized, and structural, logical, software, electrical, and other changes may be made without departing from the scope of the particular inventions. Various modifications and alterations are possible and expected. Some features of one or more of the inventions may be described with reference to one or more particular embodiments or drawing figures, but such features are not limited to usage in the one or more particular embodiments or figures with reference to which they are described. Thus, the present disclosure is neither a literal description of all embodiments of one or more inventions nor a listing of features of one or more inventions that must be present in all embodiments.
Headings of sections and the title are provided for convenience but are not intended to limit the disclosure in any way or as a basis for interpreting the claims. Devices described as in communication with each other need not be in continuous communication with each other unless expressly specified otherwise. In addition, devices that communicate with each other may communicate directly or indirectly through one or more intermediaries, logical or physical.
A description of an embodiment with several components in communication with one other does not imply that all such components are required. Optional components may be described to illustrate a variety of possible embodiments and to illustrate one or more aspects of the inventions fully. Similarly, although process steps, method steps, algorithms, or the like may be described in sequential order, such processes, methods, and algorithms may generally be configured to work in different orders unless specifically stated to the contrary. Any sequence or order of steps described in this disclosure is not a required sequence or order. The steps of the described processes may be performed in any order practical. Further, some steps may be performed simultaneously. The illustration of a process in a drawing does not exclude variations and modifications, does not imply that the process or any of its steps are necessary to one or more of the invention(s), and does not imply that the illustrated process is preferred. The steps may be described once per embodiment but need not occur only once. Some steps may be omitted in some embodiments or occurrences, or some steps may be executed more than once in a given embodiment or occurrence. When a single device or article is described, more than one device or article may be used in place of a single device or article. Where more than one device or article is described, a single device or article may be used instead of more than one device or article.
The functionality or features of a device may be alternatively embodied by one or more other devices that are not explicitly described as having such functionality or features. Thus, other embodiments of one or more inventions need not include the device itself. Techniques and mechanisms described or referenced herein will sometimes be described in singular form for clarity. However, it should be noted that particular embodiments include multiple iterations of a technique or manifestations of a mechanism unless noted otherwise. Process descriptions or blocks in figures should be understood as representing modules, segments, or portions of code, including one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of embodiments of the present invention in which, for example, functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
Using a benchmark suite that enables a controlled evaluation, the inventors have discovered, in an inventive moment, that state-of-the-art LLMs show varied performance and can struggle on even elementary text editing tasks revealing key insights into the limitations of current LLMs. The disclosure shows that further instruction tuning on text-editing instruction data can effectively improve performance on both seen and unseen text-editing tasks. The disclosure also presents embodiments of a computer system and computer-implemented method to automatically transform an input text to conform to a specified style guide.
Various embodiments encompass the subject matter of the following numbered clauses:
13. One or more non-transitory computer-readable storage media storing one or more sequences of instructions which, when executed using one or more processors, cause the one or more processors to execute: executing a document processing application; receiving a digitally stored electronic document, alone or in combination with one or more other relevant documents, and an engineered prompt; transmitting an application programming interface (API) call to an API of a pre-trained large language model (LLM), wherein the API call comprises the engineered prompt, wherein the engineered prompt comprises a plurality of objective instructions to the pre-trained LLM specifying transforming the electronic document according to a style guide to cause the pre-trained LLM to execute an inference stage over the electronic document and automatically generate output text based on the electronic document and the plurality of objective instructions that transforms the electronic document to conform to the style guide; and storing the output text using a storage device of a user computer, a hosted storage environment, or in memory associated with the document processing application.
14. The one or more non-transitory computer-readable storage media of clause 13, wherein the pre-trained LLM is any of a ChatGPT model, a GPT4 model, and a LLAMA2-13B model.
15. The one or more non-transitory computer-readable storage media of clause 13, wherein the document processing application is communicatively coupled via a network to the user computer, which hosts or executes a browser or word processor to create and store the electronic document.
16. The one or more non-transitory computer-readable storage media of clause 15, further comprising sequences of instructions which, when executed using the one or more processors, cause the one or more processors to execute: receiving a query for one or more documents that conform to a specified style guide; transmitting the query to a knowledge base that digitally stores text documents that the user computer has created and stored in states before or after transformation according to the style guide; in response to the query, creating and returning a result set of the one or more other relevant documents; transmitting the one or more other relevant documents via the API to the pre-trained LLM with a prompt to fine-tune the LLM, thereby causing the pre-trained LLM to create and return, or digitally store, a fine-tuned model; and using the fine-tuned model, receiving the electronic document as input and processing a request to automatically transform the electronic document according to the specified style guide.
17. The one or more non-transitory computer-readable storage media of clause 16, further comprising sequences of instructions which, when executed using the one or more processors, cause the one or more processors to execute: transmitting a query to the knowledge base to retrieve a result set of the one or more other relevant documents corresponding to prior transformations of documents according to a specified style guide; transmitting a processing request to the API, the processing request comprising: a copy of the electronic document to be transformed according to the specified style guide; the result set of the one or more other relevant documents; and the engineered prompt; and receiving, from the pre-trained LLM, automatically generated text that has been transformed according to the style guide.
18. The one or more non-transitory computer-readable storage media of clause 13, wherein the engineered prompt comprises one of a plurality of different prompts digitally stored in prompt storage and retrieved in response to a programmatic request from the document processing application.
19. The one or more non-transitory computer-readable storage media of clause 13, wherein the engineered prompt comprises a task description, a query, and a plurality of examples of transformations.
20. The one or more non-transitory computer-readable storage media of clause 19, wherein the engineered prompt comprises 10 to 1,000 examples of transformations.
21. The one or more non-transitory computer-readable storage media of clause 13, wherein the engineered prompt comprises a task description, a query, and at least: one or more first instructions specifying punctuation usage; one or more second instructions specifying abbreviation style; one or more third instructions specifying capitalization style; and one or more fourth instructions specifying number style.
22. The one or more non-transitory computer-readable storage media of clause 13, wherein the engineered prompt comprises 6 to 16 instructions.
23. The one or more non-transitory computer-readable storage media of clause 13, wherein the digitally stored electronic document comprises five or fewer sentences.
24. The one or more non-transitory computer-readable storage media of clause 13, wherein the pre-trained LLM comprises 70 billion or more parameters.
The following sections and the attached drawing figures are intended to illustrate the functional level at which skilled persons in the art to which this disclosure pertains communicate with one another to describe and implement a computer-implemented method, as described further herein and/or algorithms using programming. The flow diagrams are not intended to illustrate every instruction, method object, or sub-step that would be needed to program every aspect of a working program but are provided at the same functional level of illustration that is normally used at the high level of skill in this art to communicate the basis of developing working programs.
Embodiments provide new technical means for automatically applying communication styles to digitally stored electronic documents based on digital style guides and engineered prompts to an LLM. Characterizing the ability of LLMs to follow a set of instructions is useful in many practical text-editing applications where multiple operations need to be performed in one interaction. For example, given an input document, one might want to correct formatting and grammar, improve clarity, and rewrite to a specific tone. More generally, improving the ability of LLMs to follow a set of (complex) instructions would enable LLMs to rewrite text according to specific style guides, such as the Associated Press Style Guide, the Chicago Manual of Style, or others. Such style guides contain several instructions outlining how text must be written to conform to the specific style. They are adopted by publishing companies, businesses, and media houses to ensure consistency across their communications.
Embodiments can be based on data derived from a suite of text-editing tasks and associated benchmarks to probe the instruction-following ability of these models. One embodiment can use three dimensions to probe model performance and deliberately design text editing tasks that assess model performance in a controlled manner across representative points in the space spanned by these dimensions, as shown in
Embodiments provide new insights into the instruction-following ability of LLMs, introducing new tasks and evaluation protocols for evaluating the ability of LLMs to follow text editing instructions—an ability that is at the heart of writing assistants.
In an embodiment, a main set of text editing tasks falls into two major groups described below.
Elementary text editing tasks, in an embodiment, comprise synthetic text editing tasks that perform elementary string manipulation where each instruction is very simple, the expected output is completely unambiguous, and evaluation data can be very easily constructed computationally.
Copy. In this task, a list of instructions asks the LLM to output a specific sequence of tokens. The task helps probe model performance when: (a) each instruction is almost trivial in complexity and (b) every instruction is completely independent of others. An example sequence of instructions (in a single interaction) for this task is shown below:
Edit. We consider a list of instructions that transform (edit) a source string to a target string in terms of only INSERT/DELETE/REPLACE instructions on indices of the source. Such instructions can be efficiently computed using the classic Fisher-Wagner algorithm. An example list of instructions that transforms “coastal” to “postal” is shown below:
In contrast to the Copy task, even though each instruction is fairly simple (although not trivial), there is a strict sequential dependency on instructions where each instruction theoretically operates on the intermediate output of the prior instruction. Two settings, Edit-1 and Edit-N, correspond to settings with single and multiple instructions, respectively.
Finally, for both tasks, an embodiment can generate a test (N: 1000) and a training set (N: 10000, token/tuple disjoint from the test) with varying task parameters.
Instructions on text editing and manipulation are mentioned in popular language style guides like the AP (Associated Press) and MLA (Modern Language Association) style guides. Instructions mentioned in the AP style guide can be represented in a task denoted miniAP. In an embodiment, the task digitally encodes a set of instructions listed in the AP style guide related to the usage of (a) Punctuation, (b) Abbreviations, (c) Capitalization, (d) Plurals, (e) Number formatting, and (f) Date and Time formatting. An example is shown in TABLE 5 below. The instructions for these dimensions are relatively unambiguous; compliance can be computationally verified, and collecting relevant data is easier. Instructions related to other dimensions, like clarity, tone, and sensitivity, tend to be subjective and require human evaluation but could be encoded or used in other embodiments. The following example illustrates model instructions from the above set, identifies an input text, and asks an LLM to rewrite the text to comply with the provided instructions.
Embodiments can consider three settings with increasing complexity as follows. The task miniAP-1-1 considers exactly one instruction and one input sentence. This corresponds to the easiest configuration. In the task miniAP-N-1, the model is provided multiple instructions, but the input is a single sentence. This is a slightly more challenging setting since the majority of the instructions may not apply to the provided input sentence. Models need to ignore inapplicable instructions but still correctly apply the mostly one (or rarely few) applicable instructions. In the task miniAP-N-P, the model has multiple instructions and multiple input sentences (a paragraph of five sentences). This extends the multiple instruction setting above in that the input text is larger and has multiple (potentially) non-compliant sentences. Both miniAP-N-1 and miniAP-NP settings resemble a realistic use case where a user might provide several instructions from a style guide and ask that the LLM rewrite input as necessary to comply.
To generate test and training instances when needed, the SELFINSTRUCT paradigm can be used (Wang et al., 2022). An LLM with a public API, such as GPT-4, can generate candidate sentences related to a linguistic dimension (e.g., number usage). Then, regular expressions can manipulate these candidate sentences to obtain versions compliant and non-compliant with the specific instruction, optionally with a manual review for correctness. These steps create a dataset containing triplets of the form (a style guide instruction, non-compliant sentence, compliant version) and correspond to the miniAP-1-1 setting. To construct datasets for the other two settings, this dataset is modified as follows: For the miniAP-N-1 setting, replace the instruction field to contain all instructions. We further manually edit the compliant version if needed to ensure global compliance. Similarly, to construct a paragraph-level dataset (the miniAP-N-P setting), we randomly sample a batch of instances and concatenate the corresponding tuples, based on using a paragraph size of five sentences in experiments. For each setting, a test set (N: 1000) and a training set (N: 10000) instance are generated, ensuring that the train and test sets are recorded disjoint. Finally, to evaluate the ability of models not to make erroneous edits to compliant input, all datasets contain some instances where the input text already conforms to the provided instructions and no edits are needed.
Two additional text-editing tasks can evaluate the ability of text-editing instruction fine-tuned LLM models to generalize to new instructions/tasks.
miniAP-reversed Task. This set of instructions is formed from the miniAP-1-1 task by negating the original instructions and swapping the target and the input, thus asking the model to execute the reverse of the original instruction.
UpperLower task (UL) Task. The task and settings are identical to the Edit task but have one major change. Instead of instructions that insert, delete, or replace characters at specific indices and have sequential dependencies, the task comprises instructions that only transform characters to upper-case or lower-case characters at specific indexes. Furthermore, the instances are designed such that the final output does not theoretically depend on the instruction application order.
GENERIC-TE is a large-scale diverse text-editing instruction dataset. Prior works like ALPACA (Taori et al., 2023) have demonstrated that instruction fine-tuning on a very diverse set of instructions improves model performance on various downstream NLP tasks. Experiments can investigate the effect of instruction fine-tuning on a large, diverse set of text-editing instructions. In an embodiment, a dataset of 50,000 instances of instructions follows data largely adapting the process outlined in (Taori et al., 2023), except that in the SELFINSTRUCT, the model is asked to only focus on text-editing instructions while still focusing on generating diverse instruction set related to a large variety of linguistic dimensions like tone or sensitive language. This data is used only for supervised fine-tuning and not for evaluation since this dataset spans linguistic dimensions where evaluation can be subjective.
Prompting Strategy. The instructions and the input text are encoded in a prompt like the examples above, accommodating minor variations that may be needed, such as removing the system prompt. A best-effort approach to prompt engineering is acceptable, incorporating known best practices to design prompts. An implicit but practical assumption is that “prompt engineering effort” negatively correlates with “model steering/instruction-following ability.” Therefore, when a model requires extensive prompt engineering to follow instructions, intuitively, it implies the model is not as steerable and so ranks lower than a model that requires lesser effort.
Evaluation Metric. To measure the ability of models to follow text editing instructions precisely, an exact match with the compliant text is defined as a measure of success and accuracy. To maximize success potential, models can be flexible on the output format, and an embodiment can be programmed to perform appropriate output post-processing. The models are expected to be largely consistent in their output format. For example, some models output rewritten text between <output></output>tags, while some just output the rewritten text.
Experiments have used (a) ChatGPT, (b) GPT4, and (c) LLAMA2. ChatGPT and GPT-4 are closed models, but they are considered to be state-of-the-art and thus establish strong practical baselines. However, because these models are closed, embodiments can address the effect of aspects like (a) model scale, (b) training procedures, and (c) training data. In contrast, LLAMA2 is an open model, with the 13B model being representative. The chat variant of LLAMA2 can be used since the base model is not instruction fine-tuned. The 13B variant is an expedient choice and also more representative of practical use cases.
Overall Performance.
All models perform the Copy task almost perfectly. Recall that this task involves instructions where (a) each instruction is almost trivial in complexity and (b) instructions are completely independent. This suggests that very low instruction complexity and independence correlate with higher performance.
All models find the Edit task challenging, where even performance on the single instruction setting (edit-1) is low. Prior work has noted that LLMs (including GPT4) may still struggle at seemingly elementary tasks like counting, article swapping, and shift ciphers (McCoy et al., 2023). Acting on index-based string editing operations is likely yet another task. Furthermore, performance drops even further in the multiple instruction setting, suggesting that sequential dependencies between instructions can be challenging for current LLMs. Such models need to operate implicitly on intermediate outputs, thus introducing more points of failure.
The poor performance on the single instruction case is still only about 50% (GPT-4). This suggests that both ChatGPT and GPT4 find it challenging to follow basic operations on characters at specific indices consistently (note even with explicit instructions that 0-based indexing is used. Using explicit CoT and implicit CoT prompting also did not yield significant improvement). Second, sequential dependency between instructions can be additionally challenging because models need to maintain an internal scratch buffer since the output at any step depends on the intermediate output at the previous instruction. Therefore, the final predicted output is likely to be close in “edit-distance” similarity to the expected output, and the accuracy and “edit-distance” similarity should drop as the number of instructions increases.
On the miniAP tasks: (a) Performance on the single instruction, single sentence setting (miniAP-1-1) is generally higher than the other settings (miniAP-N-*), revealing a clear trend that model performance decreases as the input size increases. This observation further supports the observation around the general challenges of LLMs handling long contexts (Liu et al., 2023). The effect of how performance drops as input length increases is revealed by analyzing performance at first P sentences shown in
Finally, OpenAI models outperform LLAMA2 on average. Between ChatGPT and GPT4, GPT4 generally performs similarly to ChatGPT except for the Edit task, for which GPT4 significantly outperforms ChatGPT.
Effect of Model Scale.
Effect of Instruction Alignment with Prior Knowledge. Because LLMs are trained on extremely massive amounts of text (and potentially instructions) from diverse sources, including the Internet, one might conjecture that LLMs might have internalized many of the miniAP style instructions since they are generally dominantly reflected in language usage. Thus, to what extent is model performance a reflection of instructions being well-aligned with prior knowledge potentially internalized by models during their training phase? An embodiment can be programmed to measure model performance on counterfactual instructions that likely deviate from the LLM prior. The miniAP-reversed task consists of precisely these counterfactual instructions. TABLE 1 shows the result of this evaluation. A significant drop occurs in the performance of ChatGPT (0.82 versus 0.61) and LLAMA2 (0.47 versus 0.25) in the counterfactual setting compared to the likely well-aligned miniAP setting. This suggests that there is a significant effect of model priors on model performance, and these models are harder to override. GPT4, however, shows some drop but not a significant drop (0.71 versus 0.68), indicating that GPT4 likely is a more steerable model than the others.
miniAP Task Error Analysis. An error analysis of models on the miniAP task considers two main kinds of errors: false negatives, where the model either misses or makes incorrect edits on non-compliant input, and false positives, where the model erroneously makes edits to fully compliant input.
What linguistic dimensions pose the most challenge? Table 2 shows the breakdown of the errors on the miniAP-1-1 task for representative models from both model families. All models generally find instructions related to abbreviations easier than others. Instructions related to plural possessives, number formatting, and colon usage tend to pose quite a challenge for all models. This is likely because these instructions are nuanced and complex, consisting of many conditions outlining exceptions.
Instruction Application via Code Generation. Having observed that LLMs find several text editing instructions, many of which are simple deterministic rules challenging to follow, experiments can explore whether LLM task performance improves by leveraging the code-generation ability of LLMs to translate applicable human-readable instructions to Python programs that can be executed deterministically. A mixture-of-experts approach can be used
First, given the set of instructions, the LLM is prompted to decide which subset of instructions can be implemented as Python functions and generate relevant Python code that, when executed on an input, applies that subset. The application of the full set of instructions then has two steps—(a) Apply the subset of instructions that are encoded as Python functions by sampling executing those functions on the input string and (b) Apply the remaining instructions defaulting to the prompting strategy.
Overall results are mixed. Note the significant improvement in performance on the Edit-1 (0.99 vs. 0.5) and Edit-N (0.6 vs. 0.2) tasks relative to our default prompting approach. GPT4 results in
Significant gaps exist in model performance in the zero-shot setting. Therefore, an experiment can investigate whether further fine-tuning on potentially task-specific instructions following data helps boost performance. How well do such fine-tuned models generalize to unseen tasks like those outlined in Section 2.2? Further supervised fine-tuning considering task mixtures that progressively increase the number of held-out tasks can include:
For each main task included, an experiment can use 1,000 training examples per task and use all examples from GENERIC-TE when it is included.
In one experiment, the LLAMA2 (13B) model is fine-tuned using instruction-based fine-tuning and a “completions-only” loss where the loss is computed only on the expected completion (output tokens), although all tokens are attended. This encourages the model to focus on learning to generate the required output rather than learning to auto-complete parts of the input. One experimented trained all models for one epoch on one A100 instance with eight GPUs using a per-device batch size of eight.
Overall Performance. Tables 3 and 4 show the performance of instruction fine-tuned models.
On all tasks, in every setting, overall performance improves significantly compared to the baseline. As expected, the gains on the seen tasks are the highest. However, when trained on all seen tasks, significant gains on seen tasks may actually result in a performance drop on unseen tasks (see all-main and 1-Inst settings). This suggests that instruction fine-tuning on a small set of tasks adapts the model more aggressively to those specific tasks, thus limiting generalization to new tasks. There is some evidence that single-instruction fine-tuned models can generalize to apply multiple instructions and outperform the baseline (0.51 vs 0.36). Finally, both settings, which include the GENERIC-TE dataset, significantly boost overall performance, demonstrating that there is value in training on a large and very diverse set of instructions (see miniAP-held-out and all-held-out).
Several writers have addressed instruction fine-tuning (Mishra et al., 2022; Iyer et al., 2022; Chung et al., 2022; Taori et al., 2023). However, explicitly evaluating the instruction following the capability of LLMs is relatively scarce (Webson and Pavlick (2022); Kung et al. (2023); Zeng et al. (2023)). Webson and Pavlick (2022) investigate the extent to which prompt-based models understand their prompts. They note that good model predictions are made even when misleading prompts are provided, thus cautioning against attributing model performance to prompt understanding. Similarly, Kung et al. (2023) argue that many instruction fine-tuned models may rely on prompt artifacts like the output format and advise caution against ascribing the superior performance to their instruction-following ability.
The Li et al. (2023) paper investigates models' reliance on prior knowledge encoded during training when following instructions (e.g., instructions on classifying text). They show that model performance deteriorates significantly when instructions deviate from the model's prior knowledge, which the present experiments reinforce. However, their work is mainly concerned with the effect of a model's prior knowledge on performance. The present disclosure, in contrast, comprehensively characterizes the performance of LLMs in following text editing instructions and evaluates the impact of many additional factors like model scale, training data, and instruction complexity.
Zeng et al. (2023) look at a related problem, evaluating the efficacy of “LLM-evaluators,” where an LLM and a prompting strategy are used to rank other LLM model outputs according to some specific criterion like sensitivity. They propose a benchmark called LMBAR for evaluating such evaluators and conclude by noting that many evaluators vary in their preferred outputs, showing significant room for improvement.
Differing from the above, the present disclosure evaluates LLMs' ability to follow fine-grained text editing instructions across various dimensions in a controlled manner. A benchmark suite of tasks consists of elementary text-editing instructions and text-editing instructions derived from popular style guides. The disclosure addresses the effects of various model parameters, which can guide modeling improvements and practical applications.
The present disclosure introduces a task of “style guide application” and introduces a benchmark for this task. The benchmark comprises an engineered prompt that is capable of causing an LLM to automatically transform a digital electronic document to conform the document to a specified style guide. One example is the AP Style Guide, but others can be used.
The engineered LLM prompt can state instructions of several types. For example, the prompt can state objective instructions like those of TABLE 5. Additionally, or alternatively, the prompt can state instructions that are subjective and/or related to pragmatic intent, like “Keep the audience in mind and make sure to encourage them to respond,” and/or “Make the text witty and engaging to the audience.”
Embodiments can operate on document units such as sentences, groups of sentences, paragraphs, or complete documents.
In an embodiment, computer system 900 can be programmed to implement a prompt engineering approach, alone or in combination with retrieval-augmented generation (RAG), parameter-efficient fine-tuning (PEFT) with or without low-rank adaptation (LoRA), and/or full fine-tuning. In an embodiment, a large data corpus 902 trains a pre-trained large language model (LLM), which is externally programmatically accessible via an application programming interface (API) 906. Examples of LLM include GPT and LLAMA models, as previously identified and described. Typically the LLM is pre-trained and hosted in a computing environment of a first party, and other elements of the computer system 900 are hosted in other computing environments of other parties, but all elements also can be co-hosted, on-prem or in private, shared, or public cloud computing facilities like AMAZON AWS, GOOGLE CLOUD, and MICROSOFT AZURE, as examples.
A document processing application 908 is communicatively coupled via a network 930 to the API 906 and programmed to execute document processing functions including the automatic transformation of digitally stored electronic documents according to style guides, as described herein. The document processing application 908 may be hosted and/or executed using one or more server computers, server clusters, virtual compute and/or virtual storage instances, or a combination thereof, hosted on-prem at an enterprise or in a public or private data center or cloud computing environment.
In an embodiment, one or more user computers 910 are coupled to the document processing application 908 and can select, activate, and use one or more features or functions of the document processing application. Each of the user computer 910 can comprise any of a laptop computer, desktop computer, workstation, tablet computing device, smartphone, smart watch, or any other computer that is practical to use for text composition, editing, and display. In an embodiment, the user computer 910 hosts or executes an application program such as a browser or word processor to create and store an electronic document 911. The electronic document 911 can be locally stored using a storage device of the user computer 910, or stored in a hosted storage environment such as DROPBOX, BOX, GOOGLE DRIVE, MICROSOFT ONEDRIVE.
For a clear example, assume that the electronic document 911 does not conform to a specified style guide. As further described in other sections, the user computer 910 can invoke functions of the document processing application 908 to present the electronic document 911 to the pre-trained LLM 904, alone or in combination with other relevant documents 914, and an engineered prompt 916. As further described in other sections, in response, the pre-trained LLM 904 and/or fine-tuned model 920 execute an inference stage over the electronic document 911 and automatically generate output text 913 based on the electronic document that transforms it to conform to the specified style guide. In various embodiments, the output text 913 can be locally stored using a storage device of the user computer 910, stored in a hosted storage environment such as DROPBOX, BOX, GOOGLE DRIVE, MICROSOFT ONEDRIVE, or stored in memory associated with document processing application 908 or in any memory associated with it.
A knowledge base 912 can digitally store electronic documents, such as text documents that the user computer 910 has created and stored, in states before or after transformation according to a digital style guide. In one embodiment, the document processing application 908 can be programmed to implement search query functions or document query functions by which the user computer 910 can enter or present a query for one or more documents that conform to a specified style guide. In response, the knowledge base 912 creates and returns a result set of relevant documents 914. In one mode of operation, the relevant documents 914 can be input via API 906 to the pre-trained LLM 904 with a request or prompt to fine-tune the LLM, thereby causing the pre-trained LLM to create and return, or digitally store, a fine-tuned model 920. Thereafter, the fine-tuned model 920 can receive an electronic document as input and process a request to automatically transform the electronic document according to the specified style guide. In this manner and based on the experiments that have been previously described, the fine-tuned model 920 will more accurately transform a particular electronic document to conform to the specified style guide.
In an embodiment, the computer system 900 can also operate in a zero-shot mode, in which an engineered prompt 916 is presented to the pre-trained LLM 904 to cause the transformation of an electronic document to conform to a style guide that is partly or fully represented in the text of the engineered prompt. In an embodiment, the engineered prompt 916 shown in
In one embodiment, the document processing application 908 can implement a retrieval-augmented generation (RAG) mode of operation. In the RAG operation, the document processing application 908 is programmed to transmit a query to the knowledge base 912 to retrieve a result set of relevant documents 914 corresponding to prior transformations of documents according to a specified style guide. The document processing application 908 is further programmed to transmit a processing request to the API 906, the processing request comprising:
With this information, the pre-trained LLM 904 will process the input electronic document to be transformed and return automatically generated text that has been transformed according to the style guide. In all the foregoing examples, the specified style guide can comprise one particular, named style guide among a plurality of different available style guides. The document processing application 908 can be programmed to automatically select the RAG approach, fine-tuned training approach, or prompt-based approach based on a plurality of factors such as the availability or cost of using the pre-trained LLM 904 as compared to other available LLMs, a response time parameter that the user computer 910 has entered or selected using the document processing application 908, the length of the input document, and so forth.
Each of the prompt storage 918, the knowledge base 912, and the corpus 902 can be digitally stored using one or more virtual storage instances, digital storage devices, relational databases, object databases, or flat file systems. They may be co-located and co-hosted or independent.
Network 930 broadly represents any combination of one or more local area networks, wide area networks, campus networks, and/or internetworks using any of wired or wireless, satellite or terrestrial communications links. Lines in
According to one embodiment, the techniques described herein are implemented by at least one computing device. The techniques may be implemented in whole or in part using a combination of at least one server computer and/or other computing devices coupled using a network, such as a packet data network. The computing devices may be hard-wired to perform the techniques or may include digital electronic devices such as at least one application-specific integrated circuit (ASIC) or field programmable gate array (FPGA) that is persistently programmed to perform the techniques or may include at least one general purpose hardware processor programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. To accomplish the described techniques, such computing devices may combine custom hard-wired logic, ASICs, or FPGAs with custom programming. The computing devices may be server computers, workstations, personal computers, portable computer systems, handheld devices, mobile computing devices, wearable devices, body-mounted or implantable devices, smartphones, smart appliances, internetworking devices, autonomous or semi-autonomous devices such as robots or unmanned ground or aerial vehicles, any other electronic device that incorporates hard-wired and/or program logic to implement the described techniques, one or more virtual computing machines or instances in a data center, and/or a network of server computers and/or personal computers.
Computer system 1000 includes an input/output (I/O) subsystem 1002, which may include a bus and/or other communication mechanism(s) for communicating information and/or instructions between the components of the computer system 1000 over electronic signal paths. The I/O subsystem 1002 may include an I/O controller, a memory controller, and at least one I/O port. The electronic signal paths are represented schematically in the drawings, such as lines, unidirectional arrows, or bidirectional arrows.
At least one hardware processor 1004 is coupled to I/O subsystem 1002 for processing information and instructions. Hardware processor 1004 may include, for example, a general-purpose microprocessor or microcontroller and/or a special-purpose microprocessor such as an embedded system or a graphics processing unit (GPU), or a digital signal processor or ARM processor. Processor 1004 may comprise an integrated arithmetic logic unit (ALU) or be coupled to a separate ALU.
Computer system 1000 includes one or more units of memory 1006, such as a main memory, coupled to I/O subsystem 1002 for electronically digitally storing data and instructions to be executed by processor 1004. Memory 1006 may include volatile memory such as various forms of random-access memory (RAM) or other dynamic storage device. Memory 1006 also may be used for storing temporary variables or other intermediate information during the execution of instructions to be executed by processor 1004. Such instructions, when stored in non-transitory computer-readable storage media accessible to processor 1004, can render computer system 1000 into a special-purpose machine customized to perform the operations specified in the instructions.
Computer system 1000 includes non-volatile memory such as read-only memory (ROM) 1008 or other static storage devices coupled to I/O subsystem 1002 for storing information and instructions for processor 1004. The ROM 1008 may include various forms of programmable ROM (PROM), such as erasable PROM (EPROM) or electrically erasable PROM (EEPROM). A unit of persistent storage 1010 may include various forms of non-volatile RAM (NVRAM), such as FLASH memory, solid-state storage, magnetic disk, or optical disks such as CD-ROM or DVD-ROM and may be coupled to I/O subsystem 1002 for storing information and instructions. Storage 1010 is an example of a non-transitory computer-readable medium that may be used to store instructions and data, which, when executed by the processor 1004, causes performing computer-implemented methods to execute the techniques herein.
The instructions in memory 1006, ROM 1008, or storage 1010 may comprise one or more instructions organized as modules, methods, objects, functions, routines, or calls. The instructions may be organized as one or more computer programs, operating system services, or application programs, including mobile apps. The instructions may comprise an operating system and/or system software; one or more libraries to support multimedia, programming, or other functions; data protocol instructions or stacks to implement TCP/IP, HTTP, or other communication protocols; file format processing instructions to parse or render files coded using HTML, XML, JPEG, MPEG or PNG; user interface instructions to render or interpret commands for a graphical user interface (GUI), command-line interface or text user interface; application software such as an office suite, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games or miscellaneous applications. The instructions may implement a web server, web application server, or web client. The instructions may be organized as a presentation, application, and data storage layer, such as a relational database system using a structured query language (SQL) or no SQL, an object store, a graph database, a flat file system, or other data storage.
Computer system 1000 may be coupled via I/O subsystem 1002 to at least one output device 1012. In one embodiment, output device 1012 is a digital computer display. Examples of a display that may be used in various embodiments include a touchscreen display, a light-emitting diode (LED) display, a liquid crystal display (LCD), or an e-paper display. Computer system 1000 may include other type(s) of output devices 1012, alternatively or in addition to a display device. Examples of other output devices 1012 include printers, ticket printers, plotters, projectors, sound cards or video cards, speakers, buzzers or piezoelectric devices or other audible devices, lamps or LED or LCD indicators, haptic devices, actuators or servos.
At least one input device 1014 is coupled to I/O subsystem 1002 for communicating signals, data, command selections, or gestures to processor 1004. Examples of input devices 1014 include touch screens, microphones, still and video digital cameras, alphanumeric and other keys, keypads, keyboards, graphics tablets, image scanners, joysticks, clocks, switches, buttons, dials, slides, and/or various types of sensors such as force sensors, motion sensors, heat sensors, accelerometers, gyroscopes, and inertial measurement unit (IMU) sensors and/or various types of transceivers such as wireless, such as cellular or Wi-Fi, radio frequency (RF) or infrared (IR) transceivers and Global Positioning System (GPS) transceivers.
Another type of input device is a control device 1016, which may perform cursor control or other automated control functions such as navigation in a graphical interface on a display screen, alternatively or in addition to input functions. The control device 1016 may be a touchpad, a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1004 and for controlling cursor movement on an output device 1012, such as a display. The input device may have at least two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane. Another type of input device is a wired, wireless, or optical control device such as a joystick, wand, console, steering wheel, pedal, gearshift mechanism, or other control device. An input device 1014 may include a combination of multiple input devices, such as a video camera and a depth sensor.
In another embodiment, computer system 1000 may comprise an Internet of Things (IoT) device in which one or more of the output device 1012, input device 1014, and control device 1016 are omitted. Or, in such an embodiment, the input device 1014 may comprise one or more cameras, motion detectors, thermometers, microphones, seismic detectors, other sensors or detectors, measurement devices or encoders, and the output device 1012 may comprise a special-purpose display such as a single-line LED or LCD display, one or more indicators, a display panel, a meter, a valve, a solenoid, an actuator or a servo.
When computer system 1000 is a mobile computing device, input device 1014 may comprise a global positioning system (GPS) receiver coupled to a GPS module that is capable of triangulating to a plurality of GPS satellites, determining and generating geo-location or position data such as latitude-longitude values for a geophysical location of the computer system 1000. Output device 1012 may include hardware, software, firmware, and interfaces for generating position reporting packets, notifications, pulse or heartbeat signals, or other recurring data transmissions that specify a position of the computer system 1000, alone or in combination with other application-specific data, directed toward host computer 1024 or server computer 1030.
Computer system 1000 may implement the techniques described herein using customized hard-wired logic, at least one ASIC or FPGA, firmware, and/or program instructions or logic which, when loaded and used or executed in combination with the computer system, causes or programs the computer system to operate as a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1000 in response to processor 1004 executing at least one sequence of at least one instruction in main memory 1006. Such instructions may be read into main memory 1006 from another storage medium, such as storage 1010. Execution of the sequences of instructions contained in main memory 1006 causes processor 1004 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media,” as used herein, refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage 1010. Volatile media includes dynamic memory, such as memory 1006. Common forms of storage media include, for example, a hard disk, solid state drive, flash drive, magnetic data storage medium, any optical or physical data storage medium, memory chip, or the like.
Storage media is distinct but may be used with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, and wires comprising a bus of I/O subsystem 1002. Transmission media can also be acoustic or light waves generated during radio-wave and infrared data communications.
Various forms of media may carry at least one sequence of at least one instruction to processor 1004 for execution. For example, the instructions may initially be carried on a remote computer's magnetic disk or solid-state drive. The remote computer can load the instructions into its dynamic memory and send them over a communication link such as a fiber optic, coaxial cable, or telephone line using a modem. A modem or router local to computer system 1000 can receive the data on the communication link and convert the data to a format that can be read by computer system 1000. For instance, a receiver such as a radio frequency antenna or an infrared detector can receive the data carried in a wireless or optical signal, and appropriate circuitry can provide the data to I/O subsystem 1002, such as placing the data on a bus. I/O subsystem 1002 carries the data to memory 1006, from which processor 1004 retrieves and executes the instructions. The instructions received by memory 1006 may optionally be stored on storage 1010 either before or after execution by processor 1004.
Computer system 1000 also includes a communication interface 1018 coupled to a bus or I/O subsystem 502. Communication interface 1018 provides a two-way data communication coupling to a network link(s) 1020 directly or indirectly connected to at least one communication network, such as a network 1022 or a public or private cloud on the Internet. For example, communication interface 1018 may be an Ethernet networking interface, integrated-services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of communications line, for example, an Ethernet cable or a metal cable of any kind or a fiber-optic line or a telephone line. Network 1022 broadly represents a local area network (LAN), wide-area network (WAN), campus network, internetwork, or any combination thereof. Communication interface 1018 may comprise a LAN card to provide a data communication connection to a compatible LAN, a cellular radiotelephone interface that is wired to send or receive cellular data according to cellular radiotelephone wireless networking standards, or a satellite radio interface that is wired to send or receive digital data according to satellite wireless networking standards. In any such implementation, communication interface 1018 sends and receives electrical, electromagnetic, or optical signals over signal paths that carry digital data streams representing various types of information.
Network link 1020 typically provides electrical, electromagnetic, or optical data communication directly or through at least one network to other data devices, using, for example, satellite, cellular, Wi-Fi, or BLUETOOTH technology. For example, network link 1020 may connect through network 1022 to a host computer 1024.
Furthermore, network link 1020 may connect through network 1022 or to other computing devices via internetworking devices and/or computers operated by an Internet Service Provider (ISP) 1026. ISP 1026 provides data communication services through a worldwide packet data communication network called Internet 1028. A server computer 1030 may be coupled to Internet 1028. Server computer 1030 broadly represents any computer, data center, virtual machine, or virtual computing instance with or without a hypervisor or computer executing a containerized program system such as DOCKER or KUBERNETES. Server computer 1030 may represent an electronic digital service that is implemented using more than one computer or instance, and that is accessed and used by transmitting web services requests, uniform resource locator (URL) strings with parameters in HTTP payloads, API calls, app services calls, or other service calls. Computer system 1000 and server computer 1030 may form elements of a distributed computing system that includes other computers, a processing cluster, a server farm, or other organizations of computers that cooperate to perform tasks or execute applications or services. Server computer 1030 may comprise one or more instructions organized as modules, methods, objects, functions, routines, or calls. The instructions may be organized as one or more computer programs, operating system services, or application programs, including mobile apps. The instructions may comprise an operating system and/or system software; one or more libraries to support multimedia, programming, or other functions; data protocol instructions or stacks to implement TCP/IP, HTTP, or other communication protocols; file format processing instructions to parse or render files coded using HTML, XML, JPEG, MPEG or PNG; user interface instructions to render or interpret commands for a graphical user interface (GUI), command-line interface or text user interface; application software such as an office suite, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games or miscellaneous applications. Server computer 1030 may comprise a web application server that hosts a presentation layer, application layer, and data storage layer, such as a relational database system using a structured query language (SQL) or no SQL, an object store, a graph database, a flat file system or other data storage.
Computer system 1000 can send messages and receive data and instructions, including program code, through the network(s), network link 1020, and communication interface 1018. In the Internet example, server computer 1030 might transmit a requested code for an application program through Internet 1028, ISP 1026, local network 1022, and communication interface 1018. The received code may be executed by processor 1004 as it is received and/or stored in storage 1010 or other non-volatile storage for later execution.
The execution of instructions, as described in this section, may implement a process in the form of an instance of a computer program that is being executed and consists of a program code and its current activity. Depending on the operating system (OS), a process may comprise multiple threads of execution that execute instructions concurrently. In this context, a computer program is a passive collection of instructions, while a process may be executing those instructions. Several processes may be associated with the same program; for example, opening up several instances of the same program often means more than one process is being executed. Multitasking may be implemented to allow multiple processes to share processor 1004. While each processor 1004 or core of the processor executes a single task at a time, computer system 1000 may be programmed to implement multitasking to allow each processor to switch between tasks that are being executed without having to wait for each task to finish. In an embodiment, switches may be performed when tasks perform input/output operations when a task indicates that it can be switched or on hardware interrupts. Time-sharing may be implemented to allow fast response for interactive user applications by rapidly performing context switches to provide the appearance of concurrent execution of multiple processes. In an embodiment, for security and reliability, an operating system may prevent direct communication between independent processes, providing strictly mediated and controlled inter-process communication functionality.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.
This application claims the benefit under 35 U.S.C. 119 (e) of provisional application 63/609,423, filed Dec. 13, 2023, the entire contents of which are hereby incorporated by reference for all purposes as if fully set forth herein.
| Number | Date | Country | |
|---|---|---|---|
| 63609423 | Dec 2023 | US |