Supervised machine learning (ML) is used widely across industries to derive insights from data and support automated decision systems. Supervised ML models are trained by applying an ML algorithm to a labeled training dataset. Each data example (or element, in the form of variables, characteristics, or “features”) in the training dataset is associated with a label (or annotation) that defines how the element should be classified by the trained model. A trained model can operate on a previously unseen data example to generate a predicted label as an output.
The performance of an ML model is heavily dependent on the quality and quantity of training data used to produce it. If the model is trained on a training dataset where a significant portion of the data examples are labeled incorrectly (for example, due to human misinterpretation during the annotation process), then the model will learn to “predict” or infer the wrong labels and be of lower accuracy and quality. Conversely, if an ML model is trained on a large enough quantity of high-quality data, it will generalize better when considering previously unseen data points. Modern deep learning (DL) models require even larger quantities of high-quality training data than traditional ML models, as they rely on learning vector representations of data points in higher dimensional latent spaces.
The conventional process to create labeled training data sets relies on manual annotation, where a human annotator with expertise in the task the trained model is expected to perform reviews each data example and records a training label. As a result, large, high quality training data sets can be time-consuming and expensive to create, particularly for industry applications that rely on proprietary data. This is especially true for data that requires domain expertise to label (such as identifying pathologies in medical images) or data with privacy constraints (such as data in regulated financial industries). In both cases, the set of viable human annotators is limited, and their time can become prohibitively expensive.
Additionally, ML models frequently need to be retrained on new data sets to reflect changes in business objectives or underlying data distributions. For example, a spam email classifier typically needs to be retrained frequently to identify new spam tactics and patterns of threats, which continue to evolve (and often in response to the behavior of deployed versions of spam detectors).
These factors (individually or in combination) may limit the desire or ability to regularly collect or assemble large, high quality training data sets. In turn, this may disincentivize the initial adoption of ML for new use cases, the extension of existing ML use cases, or generating sufficient updates to existing models in production to maintain a desirable level of performance.
An alternative approach to manual annotation is to label data programmatically. In this approach, knowledge that domain experts would use to generate manual labels (such as text patterns or cross-references with knowledge bases) may be encoded (captured) by programming it in the form of a function, termed a labeling function herein. The labeling function or functions are applied to unlabeled data examples, and the outputs are aggregated into a final set of training labels using an algorithm or ruleset. This process is referred to as “weak supervision”.
While this approach can produce large quantities of training data at lower cost and more quickly than manual approaches, it still requires a development process to create a high-quality set of labeling functions. This can be especially time-intensive when working with large data sets consisting of unstructured data (such as plain text, PDF documents, or HTML web pages) as the characteristics of the data cannot be meaningfully summarized without further processing.
Embodiments of the disclosed systems, apparatuses, and methods introduce an approach to semi-automatically generate labels for data based on implementation of a clustering or language model prompting technique. An embodiment can be used to implement a form of programmatic labeling to accelerate the development of classifiers and other forms of models. The disclosed methodology is particularly helpful in generating labels or annotations for unstructured data.
Embodiments are directed to solving one or more disadvantages of conventional approaches to labeling or annotating data for use in training a machine learning model, either alone or in combination.
The terms “invention,” “the invention,” “this invention,” “the present invention,” “the present disclosure,” or “the disclosure” as used herein are intended to refer broadly to the subject matter disclosed in this document, the drawings or figures, and to the claims. Statements containing these terms do not limit the subject matter disclosed or the meaning or scope of the claims. Embodiments covered by this disclosure are defined by the claims and not by this summary. This summary is a high-level overview of various aspects of the disclosure and introduces some of the concepts that are further described in the Detailed Description section below. This summary is not intended to identify key, essential or required features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, to any or all figures or drawings, and to each claim.
In the context of this disclosure, a classifier is a model or algorithm that is used to segment input data into a category, such as by indicating the likelihood of the presence or absence of some characteristic in the data (where as examples, the data may be text or an image). A classifier may be used to assign an identifying label to a set of input data, where the label may represent a class, category, or characteristic of the data. Classifiers may be used to determine an expected or “predicted” output based on a set of input data. Classifiers may be used in the processing of data sets and may be implemented in the form of trained machine learning (ML) models, deep learning (DL) models, or neural networks. Training requires a set of data items and an associated label or annotation for each data item.
Embodiments of the disclosed systems, apparatuses, and methods introduce an approach to semi-automatically (that is, programmatically) generate labels for data based on implementation of a clustering or language model prompting technique and can be used to implement a form of programmatic labeling to accelerate the development of classifiers and other forms of models. The disclosed methodology is particularly helpful in generating labels or annotations for unstructured data. In some embodiments, the disclosed approach may be used with data in the form of text, images, or other form of unstructured data.
The disclosed methodology is intended to accelerate the development process for programmatic labeling by automatically identifying and visually representing clusters of salient patterns in data sets, or predictions from language models queried with specific input. Humans with domain knowledge can then review these model outputs and use them to programmatically label data.
Embodiments of the disclosure assist in model development by making the labeling of training data faster, while also improving the quality of the resulting training data. Embodiments provide a form of programmatic labeling to transform data labeling from a tedious, static effort done as a precursor to the “real” AI development workflow to a software-like experience that is central (and crucial) to the end-to-end AI workflow.
In one embodiment, the disclosure is directed to a method for automatically generating labels for a set of data used to train a machine learning model. The method may include the following steps, stages, functions, or operations:
In one embodiment, the disclosure is directed to a system for automatically generating labels for a set of data used to train a machine learning model. The system may include a set of computer-executable instructions, a memory or data storage element (such as a non-transitory computer-readable medium) in (or on) which the instructions are stored, and an electronic processor or co-processors. When executed by the processor or co-processors, the instructions cause the processor or co-processors (or a device of which they are part) to perform a set of operations that implement an embodiment of the disclosed method or methods.
In one embodiment, the disclosure is directed to one or more non-transitory computer-readable media including a set of computer-executable instructions, wherein when the set of instructions are executed by an electronic processor or co-processors, the processor or co-processors (or a device of which they are part) performs a set of operations that implement an embodiment of the disclosed method or methods.
Other objects and advantages of the systems, apparatuses, and methods disclosed will be apparent to one of ordinary skill in the art upon review of the detailed description and the included figures. Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the embodiments disclosed or described herein are susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are described in detail herein. However, embodiments of the disclosure are not limited to the exemplary or specific examples described. Rather, the present disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
Embodiments of the disclosure are described with reference to the drawings, in which:
One or more embodiments of the disclosed subject matter are described herein with specificity to meet statutory requirements, but this description does not limit the scope of the claims. The claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other existing or later developed technologies. The description should not be interpreted as implying any required order or arrangement among or between various steps or elements except when the order of individual steps or arrangement of elements is explicitly noted as being required.
Embodiments of the disclosed subject matter are described more fully herein with reference to the accompanying drawings, which show by way of illustration, example embodiments by which the disclosed systems, apparatuses, and methods may be practiced. However, the disclosure may be embodied in different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy the statutory requirements and convey the scope of the disclosure to those skilled in the art.
Among other forms, the subject matter of the disclosure may be embodied in whole or in part as a system, as one or more methods, or as one or more devices. Embodiments may take the form of a hardware implemented embodiment, a software implemented embodiment, or an embodiment combining software and hardware aspects. In some embodiments, one or more of the operations, functions, processes, or methods disclosed and/or described herein may be implemented by a suitable processing element or elements (such as a processor, microprocessor, CPU, GPU, TPU, QPU, state machine, or controller, as non-limiting examples) that are part of a client device, server, network element, remote platform (such as a SaaS platform), an “in the cloud” service, or other form of computing or data processing system, device, or platform.
The processing element or elements may be programmed with a set of computer-executable instructions (e.g., software instructions), where the instructions may be stored on (or in) one or more suitable non-transitory data storage elements. In some embodiments, the set of instructions may be conveyed to a user over a network (e.g., the Internet) through a transfer of instructions or an application that executes a set of instructions.
In some embodiments, the systems and methods disclosed and/or described herein may provide access to services through a SaaS or multi-tenant platform. The platform provides access to multiple entities, each with a separate account and associated data storage. Each account may correspond to a user, a set of users, an entity, a set or category of entities, a set or category of users, a set or category of data, a specific set of documents, an industry, or an organization, for example. Each account may access one or more services, a set of which are instantiated in their account, and which implement one or more of the methods or functions disclosed and/or described herein. An account may be associated with multiple users. Users within an account may be associated with one or more workspaces to restrict/control access.
In some embodiments, one or more of the operations, functions, processes, or methods disclosed and/or described herein may be implemented by a specialized form of hardware, such as a programmable gate array or application specific integrated circuit (ASIC). An embodiment of the disclosed and/or described methods may be implemented in the form of an application, a sub-routine that is part of a larger application, a “plug-in”, an extension to the functionality of a data processing system or platform, or other suitable form. The following detailed description is, therefore, not to be taken in a limiting sense.
Embodiments of the disclosed approach enable the efficient creation and clustering of embeddings generated from a dataset and use of the formed clusters to programmatically label data. This transforms a large unlabeled and unstructured dataset into labeled training data for use in developing a classifier or other form of machine learning model.
Embodiments of the disclosed approach provide several important benefits. These include the ability to explore and understand data more efficiently (even for cold-start problems), based on insight into semantic clustering of data points using embedding techniques. In addition, embodiments make this insight more actionable with programmatic labeling to intelligently auto-label data at scale (driven in some cases by a user's guidance). Further, training data labeling workflows may be accelerated and efficiently scaled using auto-generated cluster labeling functions which a user can accept and apply with the selection of a user interface element.
In some embodiments, language embedding methods may be used to generate “clusters” of data elements (where the data elements may be words or phrases, field labels, or similar information) that appear to be semantically related. The clusters resulting from a set of training data may vary depending on one or more of the embedding technique used, the metric used to determine similarity for purposes of clustering, or the metric threshold value suggesting that two data elements belong in the same cluster or do not belong in the same cluster (as examples).
Each cluster may be examined by a user and assigned a “label”, which is in turn assigned to each data point within the cluster for purposes of training a machine learning model. In some embodiments, a proposed label may be generated automatically and presented to the user for their acceptance or rejection. As a non-limiting example, the label assigned to a cluster may be one that occurs most frequently for datapoints in a cluster. As another example, if an embedding representation of the label itself exists in the same latent space (for example, by embedding a written description associated with the label), then the label assigned to a cluster may be one that is closest to the centroid of the cluster in the latent space. In general, when assigning a label to a cluster, the label is assigned to each of the individual data points to train a machine learning model over the individual data points.
Note that although language-based embedding represents one technique for determining relationships between elements of a set of data, other techniques or methods may also (or instead) be used. The technique chosen may depend on the task for which a model is being trained and/or the form of the available datapoints (text or images, as non-limiting examples). In such embodiments, a closeness or similarity metric may be applied to assist in grouping or clustering the results of the technique. Further, based on the results and a characteristic of the suggested grouping (such as a common category, wording, attribute, or topic, as non-limiting examples), a “label” (or identifier) for a cluster may be generated and suggested to a user.
A team building a model often needs to work with a dataset that they do not know much about. Together with domain experts, the team members may work through individual documents one-by-one to understand the types of labels to apply to elements of a dataset. For many tasks, this is a prerequisite to establishing a label schema for a project.
Work related to this disclosure has indicated that a helpful strategy is to compute embeddings for the data in a dataset and then use those to identify semantically similar groups (or data that is similar in another sense, such as a characteristic of the data). This is especially helpful when a user is not sure where to start with a labeling process. Clustering data using embedding distance (as an example of a metric) can surface “natural” groupings to inform how a user might define (or refine) a label schema.
However, while clustering of generated embeddings is a way to orient oneself while exploring a dataset, it is typically not actionable beyond that stage. Clusters formed from the embeddings are typically correlated with specific classes (such as topics or categories) but are rarely separable or clean enough for labeling ground truth data in bulk, and with a sufficient degree of reliability to be useful. As an example, a user may still face the daunting task of manually labeling tens or even hundreds of thousands of individual data points to provide sufficient training data for a model. In some cases, a user may be able to outsource the labeling function, or use tooling to marginally accelerate the labeling, but even so, a user is constrained by the time it takes to review and label a large number of documents or other forms of text one at a time.
One reason for this problem is that many forms or sets of data are not easily linearly separable by class. If they were, a user could draw a line to separate two classes and be finished with the process. Instead, data from different classes often mix into each other and require classifiers to help separate them. This is because the data is often complicated, for example images or text, and it is difficult to define rules that distinguish data in one class from another, and exceptions may occur. As a result, it is typically desirable to use a classifier, and in this case, the classifier is a human, and the human is generating a set of ground truth labels.
Labeled data may be generated or produced by one or more suitable processes, including but not limited to programmatic labeling, manual labeling, or a combination of the two techniques. The automated or programmatic labeling process may include techniques to enable a user to efficiently generate labels based on embeddings, clustering of embeddings, generation of suggested labels based on common attributes of the members of a cluster, or other suitable approach.
As a non-limiting example,
In one embodiment, an example of the workflow is as follows: data is uploaded to a platform; embeddings are computed over that data; Cluster View is used to explore the clustered data and evaluate possible labeling functions (LFs); a subset of these possible LFs are created; the LFs are used to train a model; that model is analyzed for errors; and the errors are corrected by using Cluster View to explore for additional data to label.
Since labeling functions (LFs) are snippets of code, they can be used to encode arbitrary signals, patterns, heuristics, external data resources, noisy labels from crowd-source workers, or weak classifiers, as non-limiting examples. And, as code, embodiments can benefit from other of the associated benefits such as modularity, reusability, or debuggability.
In one embodiment, a process may operate to remove noise from such labels using a data programming approach, comprising one or more of the following steps:
In some embodiments, labeling functions may be considered to implicitly describe a generative model. Given data points x, having unknown labels y that a user wants to predict, in a discriminative approach one would model P(y|x) directly, while in a generative approach one models this as P(x,y)=P(x|y)P(y). In the disclosed and/or described embodiments, one is modeling a process of training data labeling, P(L,y), where L are the labels generated by the labeling functions for objects x, and y are the corresponding (unknown) true labels. By learning a generative model, and directly estimating P(L|y), the process is essentially learning the relative accuracies of the labeling functions based on how they overlap and conflict.
Embodiments then use this estimated generative model over the labeling functions to train a noise-aware version of an end discriminative model. To do so, the generative model infers probabilities over the unknown labels of the training data, and then the process minimizes the expected loss of the discriminative model with respect to these probabilities.
Estimating the parameters of a generative model can be complicated, especially when there are statistical dependencies between the labeling functions used (either user-expressed or inferred). Work performed by the inventors suggests that given sufficient labeling functions, one can obtain similar asymptotic scaling as with supervised methods in many use cases of interest. The inventors also investigated how the process can learn correlations among the labeling functions without using labeled data and how that can improve performance of a model training process.
A weak supervision interaction model (parts of which are disclosed and/or described herein, and further disclosed and/or described in U.S. patent application Ser. No. 18/214,024) may be extended to other modalities or tasks, including richly formatted data and images, supervising tasks with natural language, and generating labeling functions automatically. Extending the core data programming model is expected to make it possible to specify labeling functions with higher-level interfaces such as natural language, as well as assist in combining labeling functions with other types of weak supervision, such as data augmentation.
Programmatic labeling is an approach to labeling that addresses a bottleneck limiting AI system development: creating high-quality training sets in a way that is scalable, adaptable, and governable. A primary difference between manual labeling and programmatic labeling is the type of input that a user provides. With manual labeling, user input is in the form of individual labels, created one by one. With the disclosed and/or described approach to programmatic labeling, users instead create labeling functions (LF), which capture labeling rationales and can be applied to large amounts of unlabeled data and aggregated to automatically label large training sets.
Labeling functions are essentially programs that encode the rationale behind a labeling decision, whether that be human insight, an existing organizational resource (such as existing noisy labels or legacy models), or as in cases disclosed and/or described herein and in U.S. patent application Ser. No. 18/214,024, a portion of the embedding space identified as being correlated with a particular class. This approach leads to multiple benefits over manual labeling, including:
Programmatic labeling can be applied to many types of supervised learning problems. As examples, it may be applied to text data (long and short), conversations, time series data, PDFs, images, and videos, as well as other forms of data. The disclosed and/or described “labeling function” is flexible enough that the same workflow and framework applies in most cases. As non-limiting examples, potential use cases may include:
As mentioned, Labeling functions (LFs) may be derived from an array of sources, including heuristics (rules, principles, or patterns, as examples) or based on existing knowledge resources (models, crowd-sourced labels, or ontologies, as examples). As a non-limiting example,
Embodiments of the disclosure are directed to systems, apparatuses, and methods for efficiently and reliably generating meaningful labels automatically for a set of training data to be used with a machine learning model. The disclosed approach makes a set of embedding-based clusters derived from a dataset actionable using programmatic labeling powered by labeling functions. These labeling functions may be programs, logic, algorithms, or heuristics that encode the rationale behind a labeling decision. The labeling decision may be based in whole or in part on human insight, an existing organizational resource (such as existing noisy labels or legacy models), or (as disclosed) a portion of the embedding space identified as being correlated with a particular class.
Note that it is not a problem if the labeling functions are noisy, if they label imperfectly, or if they conflict with one another in some places. The disclosed label model will intelligently aggregate and reconcile the labels to auto-label training datasets that are larger and have higher quality labels than an individual source would be expected to produce on its own.
Using the disclosed approach (referred to as “Cluster View” herein) may create a new labeling function or type of function. The created function type may be used to capture insights from the embeddings and apply them at scale. This is a powerful method to “warm-start” the labeling process. A user can label large swaths of a dataset upfront even before training a first model.
To accelerate the labeling workflow even further, the disclosed technique can auto-assign labels to each cluster using a relatively small amount of ground truth data. From there, a user can accept or reject them, rather than creating them from scratch. A reason for this behavior is that once the process develops and identifies a group of clusters, the process can use the ground truth labels in each cluster to generate an identifier for a cluster. This results in not needing as many of them as might be expected to make such an inference.
Creating a Cluster View
When building an application (such as a trained model) in Snorkel Flow (the name given by the assignee to the product or service which includes the disclosed process of automatically generating labels for training data), a user can select a button (or other user interface element) to create a cluster view using embedding techniques applied to a dataset. If a user already has high-value embeddings, those can be introduced into the processing flow. From there, the process may use “smart” clustering algorithms2 to improve and accelerate the clustering process. For example, Snorkel Flow identifies meaningful groups of data and displays them using an interactive data map (such as illustrated by
In addition to the data map, a user is provided data-driven “cards” of information for each cluster (such as illustrated by
Even more so than with image data, understanding a set of text documents is a difficult problem; for example, in contrast to images, there is no “thumbnail” view that is easy to scan and evaluate. In one embodiment, the disclosed processing flow addresses this in two ways.
First, the disclosed approach uses text mining strategies (such as counting n-gram frequencies for n=1 to n=3) to identify salient n-grams that distinguish a cluster of data from one or more of the others. Second, a user can review relevant snippets of individual documents in the same UI pane. This keeps a user's data front-and-center throughout the AI development workflow.
Beyond the initially generated clusters, a user can explore the data more granularly using a search functionality to filter on data points that match certain queries. For example, a user can inspect the embeddings for all documents that contain a certain keyword or match a given regular expression. As the user inspects and evaluates the data to develop a better understanding, the clusters are automatically recomputed to show the user the new distribution of the filtered documents across the clusters.
The re-clustering process uses the existing clustering algorithms but operates over the filtered set of data. Because of how clustering is dependent on the similarity between documents, if one re-runs the same algorithm on a subset of data, the clusters assigned to data points may be different than the originally assigned clusters.
One or more of the preceding steps or stages of the processing flow for the dataset have made exploration of the data from the embeddings more powerful, transparent, and granular. A next stage is to make the results actionable.
From Insight to Action
While the value of data exploration and understanding of a dataset by using the disclosed processing flow is beneficial, pairing Cluster View with the programmatic labeling technology developed by the assignee provides even greater benefits. For each of the clusters, the programmatic labeling process flow can use a relatively small amount of ground truth data (as an example, hundreds instead of thousands of labeled documents) to auto-generate cluster labeling functions (LFs) that a user can review and choose to accept for use as sources of weak supervision to label training data en-masse. For example, in one embodiment, data is grouped into clusters, and a classifier is trained for each cluster. Each classifier is thus a form or example of a cluster labeling function.
The proposed clusters are parameterized so that new data points that are added to the dataset in the future can be identified as belonging to that part of the embedding space. In one embodiment, this parametrization process is the SVM/classifier training process described, and the parameters are the ones that define the classifier. The resulting parameterization is the classifier parameters, and the “clusters” are defined by whether a classifier decides a new datapoint is in the cluster or not.
These parameterizations are “intelligently” selected and more than simple centroid or distance-based approaches that may suffer from the problem of dimensionality and tend to underperform in the higher dimensional spaces typical of unstructured text. Instead of a rule-based system that is determining whether a new point belongs in a cluster, the disclosed process uses a classifier to help determine if a new point belongs in a cluster. This is typically more accurate, as classifiers can learn subtle patterns that help deal with data that is not obviously separable. This is helpful because text is often represented in the form of embeddings in a high dimensional space, so two points that are far from each other might belong in the same cluster. This type of data might be mis-labeled using a simple rule-based approach.
To inform a user's decision about whether to save an auto-generated cluster labeling function, a user may use their “expert” judgment and insight into each cluster as well as the estimated precision and coverage of that proposed labeling function (which may be provided automatically). The same auto-generated labeling function option is available for filtered views of the proposed clusters as well, allowing a user to easily create targeted, granular labeling functions.
The auto-generated labeling functions provide a mechanism to bootstrap a labeling effort, and the insights from cluster exploration may provide inspiration for additional labeling functions that are useful for the dataset or for a different dataset.
In some embodiments, the disclosed and/or described processing flow takes a large, unstructured dataset of complex text (or other type of) documents and provides a visualization of embedding-based clustering. A user can inspect each cluster to understand the meaning behind it and explore explicit data points. A user can filter the proposed clusters using the search functionality to see how specific “slices” or segments of data distribute across clusters and uncover additional nuance to the dataset.
As a user explores and understands the proposed clusters, they can take informed actions by saving and applying auto-generated labeling functions that are used to programmatically label a dataset. This can be followed by continuing with the core functionality of the overall workflow to label data, generate a trained model, and adapt. This includes using feedback from one or more forms of model-based error analysis to identify error modes and iterate programmatically to improve the labeling and value of the data (as illustrated by
Clustering embeddings is a powerful way to visualize semantic similarities across a global view of a dataset, especially when that data is complex. For AI/ML teams who need to better understand unstructured text to build high-performing applications and process flows, these visualizations surface insights that might otherwise be difficult to discover. Yet, while clustering embeddings is a way to glean directional insights or identify ways to explore data, it is often unclear what rationale is behind a given cluster, or how to act on the meaning. As a result, embeddings have largely been considered “black box” artifacts; they are interesting to contemplate, but do not always concretely move projects forward.
In contrast, the disclosed process flow (Cluster View and the associated functions) is designed and implemented in a way to “unlock” the value of embeddings by providing one or more of the following features and benefits:
As a data-centric AI platform, the goals behind Cluster View are to strengthen data exploration and understanding and make data labeling programmatic rather than manual. The disclosed approach is also intended to make these workflows as efficient as possible to reduce overhead and increase the pace of delivery for enterprise customers.
Once a set of clusters have been created, a user can explore them at varying levels of detail to understand what the rationale behind a grouping is and whether it is intuitive based on the user's knowledge of the data and task. As mentioned, understanding groups of text documents is a difficult problem. To address this obstacle, embodiments may use text mining strategies to identify salient, discriminative text that distinguishes a cluster of documents from those in other clusters. A user can also “drill” deeper by reviewing relevant snippets of individual documents directly in the same (or an adjacent) UI pane.
Importantly, a user can rely on their own experience and “expert” judgment and insight into each cluster to decide whether to save each auto-generated labeling function (LF). In addition, in some embodiments, the platform provides an estimated precision and coverage for each suggested labeling function. The same LF creation process is available on filtered views of a cluster, allowing a user to create targeted, granular labeling functions.
Embodiments permit a user to inspect each of a set of proposed clusters to understand the “meaning” inherent in the clustering and explore explicit data points. A user can filter the clusters using a search functionality to see how specific slices of data are distributed across clusters. This can assist in discovering more subtle aspects of a dataset and the relationships between data and the clusters. As a user develops a greater understanding of the clusters and their contents, the user can take informed action by saving and applying auto-generated labeling functions that are used to programmatically label a dataset. Next, as mentioned, the core workflow processes of label, model, and adapt are executed (as suggested by
As mentioned,
In general, an embodiment may be implemented using a set of software instructions that are executed by a suitably programmed processing element (such as a GPU, CPU, TPU, QPU, microprocessor, processor, controller, state machine, or computing device, as non-limiting examples). In a complex application or system such instructions are typically arranged into “modules” with each such module typically performing a specific task, process, function, or operation. The entire set of modules may be controlled or coordinated in their operation by an operating system (OS) or other form of organizational platform.
Each application module or sub-module may correspond to a particular function, method, process, or operation that is implemented by the module or sub-module. Such function, method, process, or operation may include those used to implement one or more aspects of the disclosed and/or described systems, apparatuses, and methods.
The modules and/or sub-modules may include a suitable computer-executable code or set of instructions, such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language.
A module may contain instructions that are executed by a processor contained in more than one of a server, client device, network element, system, platform, or other component. In some embodiments, a plurality of electronic processors, with each being part of a separate device, server, platform, or system may be responsible for executing all or a portion of the software instructions contained in an illustrated module. Thus, although
As shown in
Modules 402 may contain one or more sets of instructions for performing a method, operation, process, or function described with reference to the Figures, and/or the disclosure provided in the specification. These modules may include those illustrated but may also include a greater number or fewer number than those illustrated. Further, the modules and the set of computer-executable instructions that are contained in the modules may be executed (in whole or in part) by the same processor or by more than a single processor. If executed by more than a single processor, the co-processors may be contained in different devices, for example a processor in a client device and a processor in a server.
Modules 402 are stored in a (non-transitory) memory 420, which typically includes an Operating System module 404 that contains instructions used (among other functions) to access and control the execution of the instructions contained in other modules. The modules 402 in memory 420 are accessed for purposes of transferring data and executing instructions by use of a “bus” or communications line 416, which also serves to permit processor(s) 430 to communicate with the modules for purposes of accessing and executing instructions. Bus or communications line 416 also permits processor(s) 430 to interact with other elements of system 400, such as input or output devices 422, communications elements 424 for exchanging data and information with devices external to system 400, and additional memory devices 426.
Each module or sub-module may correspond to a specific function, method, process, or operation that is implemented by execution of the instructions (in whole or in part) in the module or sub-module. Each module or sub-module may contain a set of computer-executable instructions that when executed by a programmed processor or co-processors cause the processor or co-processors (or a device, devices, server, or servers in which they are contained) to perform the specific function, method, process, or operation. As mentioned, an apparatus in which a processor or co-processor is contained may be one or both of a client device or a remote server or platform. Therefore, a module may contain instructions that are executed (in whole or in part) by the client device, the server or platform, or both. Such function, method, process, or operation may include those used to implement one or more aspects of the disclosed system and methods, such as for:
In some embodiments, the functionality and services provided by the system, apparatuses, and methods disclosed herein may be made available to multiple users by accessing an account maintained by a server or service platform. Such a server or service platform may be termed a form of Software-as-a-Service (SaaS).
The platform provides access to multiple entities, each with a separate account and associated data storage. Each account may correspond to a user, a set of users, an entity, a set or category of entities, a set or category of users, a set or category of data, a specific set of documents, an industry, or an organization, for example. Each account may access one or more services, a set of which are instantiated in their account, and which implement one or more of the methods or functions disclosed and/or described herein. An account may be associated with multiple users. Users within an account may be associated with one or more workspaces to restrict/control access.
In some embodiments, the system or services disclosed herein may be implemented as microservices, processes, workflows or functions performed in response to the submission of a set of input data. The microservices, processes, workflows or functions may be performed by a server, data processing element, platform, or system. In some embodiments, the data analysis and other services may be provided by a service platform located “in the cloud”. In such embodiments, the platform may be accessible through APIs and SDKs. The functions, processes and capabilities disclosed herein and described with reference to one or more of the Figures may be provided as microservices within the platform. The interfaces to the microservices may be defined by REST and GraphQL endpoints. An administrative console may allow users or an administrator to securely access the underlying request and response data, manage accounts and access, and in some cases, modify the processing workflow or configuration.
Note that although
System 510, which may be hosted by a third party, may include a set of data processing and other services 512 to assist in automatically generating labels for training data for use in training a model or system, and a web interface server 514, coupled as shown in
Services 512 may include one or more functions or operations for the processing of a set of data, generating representations of the datapoints, forming clusters from the generated representations, and generating labeling functions/labels for data to be used to train a model.
As examples, in some embodiments, the set of functions, operations or services made available through the platform or system 510 may include:
The platform or system shown in
Examples of suitable computing devices include personal computers, server computers 604, desktop computers 606, laptop computers 607, notebook computers, tablet computers or personal digital assistants (PDAs) 610, smart phones 612, cell phones, and consumer electronic devices incorporating one or more computing device components (such as one or more electronic processors, microprocessors, central processing units (CPU), TPUs, GPUs, QPUs, state machines, or controllers). Examples of suitable networks 614 include networks utilizing wired and/or wireless communication technologies and networks operating in accordance with any suitable networking and/or communication protocol (e.g., the Internet).
The distributed computing service/platform (which may be referred to as a multi-tenant data processing platform) 608 may include multiple processing layers or tiers, including a user interface tier 616, an application server tier 620, and a data storage tier 624. The user interface tier 616 may maintain multiple user interfaces 617, including graphical user interfaces and/or web-based interfaces. The user interfaces may include a default user interface for the service to provide access to applications and data for a user or “tenant” of the service (depicted as “Service UI” in the figure), as well as one or more user interfaces that have been specialized/customized in accordance with user specific requirements (e.g., represented by “Tenant A UI”, . . . , “Tenant Z UI” in the figure, and which may be accessed via one or more APIs).
The default user interface may include user interface components enabling a tenant to administer the tenant's access to and use of the functions and capabilities provided by the service platform. This may include accessing tenant data, launching an instantiation of a specific application, or causing the execution of specific data processing operations, as examples.
Each application server or processing element 622 shown in the figure may be implemented with a set of computers and/or components including computer servers and processors, and may perform various functions, methods, processes, or operations as determined by the execution of a software application or set of computer-executable instructions. The data storage tier 624 may include one or more datastores, which may include a Service Datastore 625 and one or more Tenant Datastores 626. Datastores may be implemented with a suitable data storage technology, including structured query language (SQL) based relational database management systems (RDBMS).
Service Platform 608 may be multi-tenant and may be operated by an entity to provide multiple tenants with a set of business-related or other data processing applications, data storage, and functionality. For example, the applications and functionality may include providing web-based access to the functionality used by a business to provide services to end-users, thereby allowing a user with a browser and an Internet or intranet connection to view, enter, process, or modify certain types of information. Such functions or applications are typically implemented by the execution of one or more modules of software code/instructions by one or more servers 622 that are part of the platform's Application Server Tier 620. As noted with regards to
Rather than build and maintain such a platform or system themselves, a business may utilize systems provided by a third party. A third party may implement a system/platform as disclosed herein in the context of a multi-tenant platform, where individual instantiations of a business' data processing workflow (such as the clustering and programmatic labeling services and processing disclosed herein) are provided to users, with each business representing a tenant of the platform. One advantage to such multi-tenant platforms is the ability for each tenant to customize their instantiation of the data processing workflow to that tenant's specific needs or operational methods. Each tenant may be a business or entity that uses the multi-tenant platform to provide services and functionality to multiple users.
As noted,
Examples of graphical user interface elements include buttons, menus, checkboxes, drop-down lists, scrollbars, sliders, spinners, text boxes, icons, labels, progress bars, status bars, toolbars, windows, hyperlinks, and dialog boxes. Application programming interfaces may be local or remote and may include interface elements such as parameterized procedure calls, programmatic objects, and messaging protocols.
The application layer 710 may include one or more application modules 711, each having one or more sub-modules 712. Each application module 711 or sub-module 712 may correspond to a function, method, process, or operation that is implemented by the execution of instructions contained in the module or sub-module (e.g., a function or process related to providing data processing and services to a user of the platform). Such function, method, process, or operation may include those used to implement one or more aspects of the disclosed system and methods, such as for one or more of the processes or functions described with reference to the Figures and/or disclosed or described in the specification:
The application modules and/or sub-modules may include any suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, GPU, TPU, QPU, state machine, or CPU, as examples), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language. Each application server (e.g., as represented by element 622 of
The data storage layer 720 may include one or more data objects 722 each having one or more data object components 721, such as attributes and/or behaviors. For example, the data objects may correspond to tables of a relational database, and the data object components may correspond to columns or fields of such tables. Alternatively, or in addition, the data objects may correspond to data records having fields and associated services. Alternatively, or in addition, the data objects may correspond to persistent instances of programmatic data objects, such as structures and classes. Each datastore in the data storage layer may include each data object. Alternatively, different datastores may include different sets of data objects. Such sets may be disjoint or overlapping.
Note that the example computing environments depicted in
Although embodiments of the disclosure have been described with reference to use of a set of one or more trained models or classifiers to generate labels as part of a programmatic labeling approach, the inventors have also developed an approach based on another form of representation model. This alternative approach is based on prompting a foundation (or large language) model to label individual data points, optionally with consideration of user-specified context.
Powerful new large language or foundation models such as GPT-3.5, GPT-4, FLAN-T5, and others have become very popular and are being used beyond technical practitioners, thanks to capabilities related to text generation, image synthesis, and other modes of creating content. However, enterprises face fundamental barriers to using these foundation models on real, scaled, high-value use cases. These barriers or challenges include (a) adaptation to complex, domain-specific tasks and (b) deployment within existing cost and governance controls.
In some embodiments, one or more of the disclosed and/or described processes, operations, or functions may be used as part of training or using one or more large language models (LLMs) or other form of foundation model. This may enhance the ability to use a foundation model for a specific task and permits the introduction of foundation models into enterprises and as solutions to important business problems.
Massively scaled, self-supervised foundation models may be used to accomplish generative tasks, such as producing long-form text, images, and even videos, with remarkable realism and flexibility, using (in some cases) multi-billions of parameters learned over terabytes of data (typically in the form of open web text). However, enterprise organizations (e.g., a large US bank, government agency, or healthcare system) are reluctant to deploy foundation models in production for critical applications. This is at least in part because enterprises face two key challenges in using modern foundation models: (1) adaptation to complex, domain-specific tasks; and (2) deployment within governance and cost constraints.
Unlike generative or simpler, generic tasks, real-world enterprise use cases require fine-tuning foundation models on relatively large, labeled datasets. However, creating high-quality training datasets with traditional, manual data labeling approaches is unrealistically slow and expensive. Foundation models are costly and slow to maintain and serve in production. Even more importantly, there are likely to be significant governance challenges given their complex, emergent behaviors, which researchers are just beginning to understand. This is especially true for critical use cases and within regulated industries (e.g., health or finance).
Unfortunately, existing foundation model solutions do not do anything to solve these critical adaptation and deployment problems which often prevent use on real enterprise problems. Typically, existing solutions relegate the task of adapting to the user (for example, by requiring advanced prompting techniques or extensive manual data labeling for fine-tuning) and do not support deployment options beyond direct inference APIs, which currently are not feasible for most high-value enterprise use cases.
Given that foundation model advances are increasingly achieved with data-centric approaches from the ground up, it is not surprising that data (and data development) is a key to solving the major challenges of adapting and deploying such models in enterprise settings. When adapting foundation models to be performant in production on complex predictive tasks, fine-tuning on custom-labeled training data is critical from initial self-supervised training to final target task fine-tuning.
However, as disclosed and/or described herein, a solution to this obstacle is using foundation models to power the data-centric development of smaller, more targeted deployment models, which results have shown to deliver even higher accuracy.3
In one sense, foundation models are powerful “generalists” that typically require adaptation for “specialist” problems. A standard technique for doing this is referred to as fine-tuning, in which a foundation model is partially retrained on labeled training data, which needs to be updated and relabeled every time there is a shift in the input data or output objectives. In this regard, embodiments support fine-tuning models and offer a faster way to label and develop the training data needed using programmatic labeling and data-centric development techniques.
The disclosed and/or described Data-Centric Foundation Model Development capabilities enable enterprise AI/ML teams to overcome challenges that may be preventing them from using foundation models. In some embodiments, this is accomplished by:
In some enterprise settings, fine-tuning and deploying a large foundation model to production is not an option. In some embodiments, users can instead apply the power of foundation models to accelerate labeling and development of training data. They then use this high-quality dataset to train smaller models that can be deployed within existing cost and governance controls, and on existing MLOps infrastructure. In some embodiments, this Data-Centric Foundation Model Development workflow is enabled by the disclosed Foundation Model Warm Start and Prompt Builder features.
Foundation Model Warm Start is a feature for first pass, auto-labeling powered by foundation models, combined with state-of-the-art zero- and few-shot learning techniques. Warm Start provides a way of distilling the relevant information from foundation models to quickly jumpstart development on a new problem. Warm Start uses class names and descriptions (where available), and if a user selects few-shot learning, a small amount of ground truth data to auto-label the simpler or easier parts of a set of training data. This provides a jump-start to refine and adapt low-confidence slices of a dataset using other aspects of the disclosed data-centric AI workflow. As a non-limiting example, in one embodiment, this feature or process flow may be implemented by the following steps, stages, functions, or operations:
Foundation models may make mistakes on complex, real-world problems. The same applies to Warm Start, which is why it is viewed as a starting point. Foundation Model Prompt Builder is a feature or capability that provides a solution to this problem. Prompt Builder provides an interface for users to develop and refine prompts, viewing the labeling output of the foundation model on a sample of their data points.
In addition, Prompt Builder may provide a mechanism for a user to define custom code to translate a model's raw output to one of the predefined labels. This might be used in situations where the foundation model does not reliably respond with one of the predefined labels, or more nuanced logic is otherwise necessary. As non-limiting examples, implementation of a mapping or conversion process may include one or more of:
With the Foundation Model Prompt Builder feature/functionality, a user can more efficiently query foundation models with specific questions or prompts to extract domain-relevant knowledge. For example, to use the canonical example of a spam classifier, one might auto-label some types of spam with Warm Start, then use the Prompt Builder to create a more targeted prompt labeling function (LF) asking “Is this email asking for my password?” In doing so, a user conveys domain knowledge about a particular type of spam (phishing).
As a non-limiting example, in one embodiment, the Foundation Model prompt builder feature or process flow may be implemented by the following steps, stages, features, functions, or operations:
In some embodiments, prompts may be considered a type of labeling function. They fit into the disclosed data-centric workflow, which supports combining, tuning, and integrating their outputs using theoretically grounded modeling and (in some cases) human-in-the-loop error analysis techniques. With the combination of the Prompt Builder feature and functionality, and the data-centric AI development loop, users can identify specific Foundation Model errors, correct them, and refine their output by pruning, modifying, and developing new model prompts.
The labeling functions created by the disclosed Warm Start and Prompt Builder capabilities can be combined with other sources, which are expressed as labeling functions. While foundation models represent an advancement for AI, as mentioned, they are generalists. However, with the Warm Start and Prompt Builder processes as implemented within the disclosed processing flow, instead of relying on singular foundation models, a user may combine multiple such prompts and multiple such models with a user's enterprise knowledge sources as inputs to a programmatic labeling process. Using multiple models may be desirable if certain models perform better at labeling certain classes or subsets of data points. Examples of such enterprise knowledge sources may include previously labeled data (even if imperfect), heuristics from subject matter experts, business logic, and other sources of enterprise data and information. These inputs are intelligently combined and reconciled in the labeling process flow using the weak supervision algorithms implemented by the disclosed and/or described process flow(s).
With regards to the labeling functions (LF), as non-limiting examples, pattern-based labeling functions may be used to correct domain-specific errors, manual annotations used to correct tricky slices of the data, and a range of other programmatic labeling techniques may be used to automatically generate clean, unified training data.
Using one or more of the disclosed techniques, high-quality labeled training data can be generated and used to train a model for deployment. Further, the resulting trained model is often a dramatically smaller, more accurate model that can be deployed within an existing governance and MLOps environment.
As an example, once a user has labeled training data—for example, using Foundation Model Warm Start, Prompt Builder, and/or other data labeling and iteration capabilities of the disclosed process flow, the resulting training data can be used to train a model for deployment. As one example, this can result in 1000×+smaller and more accurate models that can be deployed in an existing governance and MLOps environment.
The disclosed set of Foundation Model training and implementation capabilities provides a relatively fast, efficient, and effective way for AI/ML teams to put foundation models to use. For some projects, this means fine-tuning a foundation model for production dramatically faster by creating programmatically labeling training data. For others, the optimal solution may use the disclosed process flow's distill, combine, and correct approach to extract the most relevant knowledge from foundation models and encode that value into right-sized models for a specific use case.
The disclosure includes the following clauses and embodiments:
The disclosed system and methods can be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the present invention using hardware and a combination of hardware and software.
In some embodiments, certain of the methods, models, processes, or functions disclosed herein may be embodied in the form of a trained neural network or other form of model derived from a machine learning algorithm. The neural network or model may be implemented by the execution of a set of computer-executable instructions and/or represented as a data structure. The instructions may be stored in (or on) a non-transitory computer-readable medium and executed by a programmed processor or processing element. A neural network or deep learning model may be characterized in the form of a data structure in which are stored data representing a set of layers, with each layer containing a set of nodes, and with connections (and associated weights) between nodes in different layers. The neural network or model operates on an input to provide a decision, prediction, inference, or value as an output.
The set of instructions may be conveyed to a user through a transfer of instructions or an application that executes a set of instructions over a network (e.g., the Internet). The set of instructions or an application may be utilized by an end-user through access to a SaaS platform, self-hosted software, on-premise software, or a service provided through a remote platform.
In general terms, a neural network may be viewed as a system of interconnected artificial “neurons” or nodes that exchange messages between each other. The connections have numeric weights that are “tuned” during a training process, so that a properly trained network will respond correctly when presented with an image, pattern, or set of data. In this characterization, the network consists of multiple layers of feature-detecting “neurons”, where each layer has neurons that respond to different combinations of inputs from the previous layers.
Training of a network is performed using a “labelled” dataset of inputs in an assortment of representative input patterns (or datasets) that are associated with their intended output response. Training uses general-purpose methods to iteratively determine the weights for intermediate and final feature neurons. In terms of a computational model, each neuron calculates the dot product of inputs and weights, adds a bias, and applies a non-linear trigger or activation function (for example, using a sigmoid response function).
Machine learning (ML) is used to analyze data and assist in making decisions in multiple industries. To benefit from using machine learning, a machine learning algorithm is applied to a set of training data and labels to generate a “model” which represents what the application of the algorithm has “learned” from the training data. Each element (or example) in the form of one or more parameters, variables, characteristics, or “features” of the set of training data is associated with a label or annotation that defines how the element should be classified by the trained model. A machine learning model can predict or infer an outcome based on the training data and labels and be used as part of decision process. When trained, the model will operate on a new element of input data to generate the correct label or classification as an output).
Any of the software components, processes or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as Python, Java, JavaScript, C, C++, or Perl using conventional or object-oriented techniques. The software code may be stored as a series of instructions, or commands in (or on) a non-transitory computer-readable medium, such as a random-access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM. In this context, a non-transitory computer-readable medium is almost any medium suitable for the storage of data or an instruction set aside from a transitory waveform. Any such computer readable medium may reside on or within a single computational apparatus and may be present on or within different computational apparatuses within a system or network.
According to one example implementation, the term processing element or processor, as used herein, may be a central processing unit (CPU), or conceptualized as a CPU (such as a virtual machine). In this example implementation, the CPU or a device in which the CPU is incorporated may be coupled, connected, and/or in communication with one or more peripheral devices, such as display. In another example implementation, the processing element or processor may be incorporated into a mobile computing device, such as a smartphone or tablet computer.
The non-transitory computer-readable storage medium referred to herein may include a number of physical drive units, such as a redundant array of independent disks (RAID), a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DV D) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, synchronous dynamic random access memory (SDRAM), or similar devices or other forms of memories based on similar technologies. Such computer-readable storage media allow the processing element or processor to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from a device or to upload data to a device. As mentioned, with regards to the embodiments described herein, a non-transitory computer-readable medium may include almost any structure, technology, or method apart from a transitory waveform or similar medium.
Certain implementations of the disclosed technology are described herein with reference to block diagrams of systems, and/or to flowcharts or flow diagrams of functions, operations, processes, or methods. It will be understood that one or more blocks of the block diagrams, or one or more stages or steps of the flowcharts or flow diagrams, and combinations of blocks in the block diagrams and stages or steps of the flowcharts or flow diagrams, respectively, can be implemented by computer-executable program instructions. Note that in some embodiments, one or more of the blocks, or stages or steps may not necessarily need to be performed in the order presented or may not necessarily need to be performed at all.
These computer-executable program instructions may be loaded onto a general-purpose computer, a special purpose computer, a processor, or other programmable data processing apparatus to produce a specific example of a machine, such that the instructions that are executed by the computer, processor, or other programmable data processing apparatus create means for implementing one or more of the functions, operations, processes, or methods described herein. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means that implement one or more of the functions, operations, processes, or methods described herein.
While certain implementations of the disclosed technology have been described in connection with what is presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not to be limited to the disclosed implementations. Instead, the disclosed implementations are intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
This written description uses examples to disclose certain implementations of the disclosed technology, and to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural and/or functional elements that do not differ from the literal language of the claims, or if they include structural and/or functional elements with insubstantial differences from the literal language of the claims.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and/or were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and similar referents in the specification and in the following claims are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “having,” “including,” “containing” and similar referents in the specification and in the following claims are to be construed as open-ended terms (e.g., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value inclusively falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation to the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to each embodiment of the present invention.
As used herein (i.e., the claims, figures, and specification), the term “or” is used inclusively to refer to items in the alternative and in combination.
Different arrangements of the components depicted in the drawings or described above, as well as components and steps not shown or described are possible. Similarly, some features and sub-combinations are useful and may be employed without reference to other features and sub-combinations. Embodiments of the invention have been described for illustrative and not restrictive purposes, and alternative embodiments will become apparent to readers of this patent. Accordingly, the present invention is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications can be made without departing from the scope of the claims below.
This application claims the benefit of U.S. Provisional Application No. 63/425,938, entitled “Systems and Methods for Programmatic Labeling of Training Data for Machine Learning Models via Clustering and Language Model Prompting,” filed Nov. 16, 2022, the disclosure of which is incorporated, in its entirety (including the Appendix), by this reference.
| Number | Date | Country | |
|---|---|---|---|
| 63425938 | Nov 2022 | US |