The fields of artificial intelligence and machine learning are increasingly impacting how organizations conduct business and research. While the actual machine learning models are critical to the performance of a machine learning or artificial intelligence project, data to train and maintain the model is just as important. Most models require a large amount of input data for training as well as for use of the model. For example, an image classification model may require thousands of input images to initially train the model, and then thousands more to improve the performance of the model. As another example, once a model is in use, it may require input data to allow the model to perform its intended function. The collection and preparation of datasets is the most time consuming and expensive aspect of most machine learning projects.
It would be desirable to provide systems and methods to substantially reduce the time and complexity of collecting and manipulating input data sets.
According to some embodiments, systems, methods and computer program code are provided to provide artificial intelligence collectors. In some embodiments, systems, methods and computer program code are provided to process an input data object from a data source, including identifying at least a first collector associated with the data source, adding the input data object to a queue of the at least first collector, and applying a post-queue workflow to the input data object to determine whether to pass the input data object from the queue to an output data sink.
Pursuant to some embodiments, a pre-queue workflow is provided to determine whether to allow the input data object to be added to the queue. In some embodiments, the pre-queue workflow is a sampling workflow. In some embodiments, the pre-queue workflow is a thresholding workflow. Pursuant to some embodiments, the post-queue workflow operates asynchronously from the pre-queue workflow. In some embodiments, the post-queue workflow is a thresholding workflow.
A technical effect of some embodiments of the invention is an improved and computerized way of obtaining relevant input data for applications such as machine learning models. With these and other advantages and features that will become hereinafter apparent, a more complete understanding of the nature of the invention can be obtained by referring to the following detailed description and to the drawings appended hereto.
Typical data streaming applications (such processing data input streams to provide data to machine learning systems) do not have an intelligence layer to make decisions on which data should be allowed through the process or which should be filtered out of the stream. Further, model predictions are commonly performed in a system that is different than the systems used for improving or training those models. To provide models with the additional data that is typically needed for training and improvement, manual processes are performed, or processes involving additional programming and integration with other systems (such as extract, transform and load or “ETL” processes). Embodiments ensure that a stream of data is collected for future use in model building (and with other data sinks) that matches the distribution of data that those models require.
The selection of input data is typically a manual and batch process that is applied to multiple items of input data. Embodiments apply unique considerations to every data point in an input data stream to decide whether each particular input should be passed through a collector for delivery to an application or other data sink. Embodiments perform such collector operations without additional coding or integration.
An enterprise may want to collect or otherwise identify a large amount of data for the purpose of training machine learning or artificial intelligence (“AI”) models (generally referred to herein as “models”). By way of example, an enterprise that is developing a model to identify vehicles in images may need to tag or “annotate” a large number of images to train and improve one or more models to identify those vehicles. This can require a large number of input images. Collecting these images can be expensive and time consuming. Often, enterprises have access to data associated with other projects (e.g., such as other machine learning projects) or data sources. Embodiments allow “collectors” to be created that automatically collect data for use in an application (such as a machine learning application). The data may include images, video, audio, text or any other data type used as input to a model for prediction (or as output from a model as a prediction). A “collector” as used herein may include one or more filters, sampling algorithms, models or other components composing one or more workflow graphs of computation that are configured to determine if an item of data should be output (or “collected”). In this way, a number of different data sources can be used as inputs, and the components of the collector can be configured to only collect data of interest.
For convenience and ease of exposition, a simple illustrative example will be used to describe features of some embodiments. Those skilled in the art, upon reading the present disclosure, will appreciate that the example is not limiting and that embodiments may be used to create collectors for a large variety of different types of data as well as a large number of different types of machine learning applications.
In the illustrative example, an enterprise is developing and operating models to identify types of vehicles, including a model to identify trucks (referred to herein as the “truck identification model”). Training the model requires a large number of input images, including images that include trucks as well as images that do not include trucks. The enterprise has access to outputs from another model (a “general” model) that identifies items in images. The general model, through its normal operation, often identifies images with vehicles in them. Embodiments allow the enterprise to build and deploy a collector that accesses the outputs from the general model to collect outputs in which a vehicle has been identified. Further, the collector may be configured to only collect outputs from the general model where the prediction confidence of the model is above a threshold. In this manner, as the general model performs predictions, the truck identification model collects input data for its own use. Multiple ones of these collectors may be created, allowing the truck identification model to benefit from the outputs of multiple data sources. As will be described further below, collectors may be configured with a number of different data sources, and the description of the use of a prediction output as a data source is for illustrative purposes only. As used herein, the term “automated” or “automatic” may refer to, for example, actions that can be performed with little or no human intervention.
Features of some embodiments will now be described by first referring to
The system 100 may generally be referred to herein as being (or as a part of) a “machine learning system”. The system 100 can include one or more models that may be stored at model database 132 and interacted with via a component or controller such as model module 112. In some embodiments, one or more of the models may be so-called “classification” models that are configured to receive and process inputs 102 and generate output data 136. As used herein, the term “classification model” can include various machine learning models, including but not limited to a “detection model” or a “regression model.” Embodiments may be used with other models, and the use of a classification model as the illustrative example is intended to be illustrative but not limiting. As a result, the term “model” as used herein, is used to refer to any of a number of different types of models (from classification models to segmentation models or the like). As used herein, the term “classification model” can include various machine learning models, including but not limited to a “detection model” or a “regression model.” Embodiments may be used with other models, and the use of a classification model as the illustrative example is intended to be illustrative but not limiting. As a result, the term “model” as used herein, is used to refer to any of a number of different types of models (from classification models to segmentation models or the like).
For convenience and ease of exposition, to illustrate features of some embodiments, the term “confidence score” is used to refer to an indication of a model's confidence of the accuracy of an output (such as a “concept” output from a model such as a classification model). The “confidence score” may be any indicator of a confidence or accuracy of an output from a model, and a “confidence score” is used herein as an example. In some embodiments, the confidence score is used as an input to one or more collectors to determine further processing actions as will be described further herein.
The present application includes a platform 120 that includes one or more collectors that are created to monitor one or more input data sources 102 (which may be, for example, prediction streams or other data sources) to collect input data therefrom for delivery to one or more output data sinks 136. For example, the input data may be obtained from a prediction stream associated with a machine learning model (such as a classification model or the like) and collector rules (including pre- and post-collector workflows as will be described below) may operate on that input data to ensure that only relevant or desired input data is passed from the collector to an output data sink or other destination.
According to some embodiments, the platform 120 may include one or more “automated” collectors that may automatically receive or monitor input data from one or more data sources, perform processing on that data, and output the data to one or more data sinks or output destinations. The processing may allow the selection of specific input data for output to the output locations, allowing complex operations to be performed to ensure appropriate data is output. The result is a system that ensures accurate data is presented at an output without manual intervention or further processing outside the system 100.
In some embodiments, a user device 104 may interact with the platform 120 via a user interface (e.g., via a web browser) where the user interface is generated by the platform 120 and more particularly by the user interface module 114. In some embodiments, the user device 104 may be configured with an application (not shown) which allows a user to interact with the platform 120. In some embodiments, a user device 104 may interact with the platform 120 via an application programming interface (“API”) and more particularly via the interface module 118. For example, the platform 120 (or other systems associated with the platform 120) may provide one or more APIs for the submission of inputs 102 for processing by the platform 120.
For the purpose of illustrating features of some embodiments, the use of a web browser interface will be described; however, those skilled in the art, upon reading the present disclosure, will appreciate that similar interactions may be achieved using an API. An illustrative (but not limiting) example of a web browser interface pursuant to some embodiments will be described further below in conjunction with
The system 100 can include various types of computing devices. For example, the user device(s) 104 can be mobile devices (such as smart phones), tablet computers, laptop computer, desktop computer, or any other type of computing device that allows a user to interact with the platform 120 as described herein. The platform 120 can include one or more computing devices including those explained below with reference to
The devices of system 100 (including, for example, the user devices 104, inputs 102, platform 120 and databases 132, 134 and 136) may communicate using any communication platforms and technologies suitable for transporting data and/or communication signals, including any known communication technologies, devices, media, and protocols supportive of data communications. For example, the devices of system 100 may exchange information via any wired or wireless communication network such as the Internet, an intranet, or an extranet. Note that any devices described herein may communicate via one or more such communication networks.
Although a single platform 120 is shown in
Reference is now made to
In the illustrative collector 200 shown in
As discussed above in the illustrative example, a collector 200 may be created to collect only those input images and data that are relevant to the vehicle model (e.g., to only collect or accept inputs that have vehicles in the image). The collector 200 may, for example, be configured with a pre-queue workflow 204 that samples data in the prediction stream from the general model (the data source 202) and only allows inputs that the general model has inferred as having the concept of a “vehicle” to be added to the collector queue 206. This pre-queue workflow 204 may be implemented using, for example, code configured to monitor the prediction stream from the data source 202 and to add inputs to the collector queue 206 when the input has been identified as having a concept of “vehicle” (or a variant thereof). In some embodiments, the pre-queue workflow 204 may also require that the concept have a certain confidence score associated with it (e.g., only inputs having a concept of vehicle with a confidence score of greater than 50% may be added to the queue). In general, a pre-queue workflow 204 may be configured to determine whether an item of potential input data should be allowed into the collector queue 206. Pursuant to some embodiments, the pre-queue workflow 204 monitors data in the data source 202 to identify inputs that may be allowed into the collector queue 206.
In this way, embodiments create a queue 206 that only includes relevant data from a data source 202. The data in the queue 206 may be further filtered or processed using a post-queue workflow 208. For example, in some embodiments, the pre-queue workflow 204 may serve to sample data from the data source 202, allowing a subset of data from the data source 202 to enter the queue 206. The post-queue workflow 208 may then be configured to only permit high quality data to be passed to an output or data sink 210. In the illustrative example, the queue 206 may be populated with images that have been classified as including a vehicle. The post-queue workflow 208 may be configured to only pass those images which the general model (the input data source 202) classified as including a vehicle with a high confidence score (e.g., only those images having the concept of “vehicle” with a confidence score of greater than 90% may be passed to the output). In this manner, collectors 200 may be configured to apply unique considerations to a large set of potential input data to determine whether each item should be “collected” or provided to the output. In general, the post-queue workflow 208 may perform asynchronous processing on the items that were allowed into the collector queue 206. The asynchronous processing is performed to determine if the data in the collector queue 206 should be written or otherwise outputted to the data sink 210.
In the collector 200 of
Referring to
In general, the pre-queue workflow 204 is used to pre-process inputs before an input is passed into the queue or before it is passed to an application or other data sink 210 which will use the input. Put another way, a pre-queue workflow 204 may be used to specify one or more sampling rules or filtering rules that can trigger ingestion of an input into a queue. A pre-queue workflow 204 may, for example, be configured to randomly sample inputs (e.g., by allowing one random input out of every five or ten inputs presented in a prediction stream). The pre-queue workflow 204 may also be configured to filter inputs based on metadata associated with an input (e.g., a specific attribute, such as a specific date/time or date range of metadata associated with an input may be required for an input to be passed into the queue 206). The pre-queue workflow 204 may also be configured to filter inputs based on confidence score (e.g., to permit inputs into the queue when the confidence score is above or below or between a specified threshold). In the illustrative example, a pre-queue workflow 204 may be configured to require that inputs from the general model prediction stream have the concept “vehicle” or “truck” with a confidence score of at least 70% for an input to be passed into the queue.
Further, the pre-queue workflow 204 may be configured to perform some knowledge graph mapping from the input data source 202 to a desired output sink 210. For example, a pre-queue workflow 204 may be configured to change the label of a concept from an input source (such as a prediction stream from a general model) to a labeling scheme used in an output data sink 210 (such as a concept labeling convention of the vehicle identification model).
In this way, the use of pre-queue workflows allows the use of an intelligence layer to make decisions on which data should be allowed into a collector process, thereby reducing processing and storage time and costs associated with operating the collector 200.
Collectors pursuant to some embodiments may be used in a number of different applications. For example, collectors aid the process of building machine learning models by enabling the automated collection of training data. This is particularly true when embodiments are used in conjunction with other models that are in production usage and that produce a stream of prediction data. Further, when collectors are used in conjunction with pre-existing models to provide smart filtering of data for specific application, desirable results may be achieved.
Further, as more and more training data is collected, the risk of data drift can be mitigated by adding or adjusting the filtering criteria of a collector 200. For example, in a collector 200 sampling “cat” and “dog” training data, a workflow might be implemented to sample a smaller percentage of “dog” data if this class of data is overrepresented in the data. As the data distribution changes over time, models in the collector selection workflows could be used to detect data drift and dynamically compensate the collector's selection criteria (e.g., by changing the configuration of a pre-queue workflow) to include data to rebalance the training data for future versions of the model to incorporate the underlying changes of the data distribution it is predicting on.
Embodiments may also be used in search indexing applications. For example, searching through a large data collection can be a computationally-expensive task. A collector 200 may be created to analyze a dataset as it is being imported into the platform and to store indexing information about the dataset. This information may then be used to quickly search over a large dataset. Other applications of collectors pursuant to the present invention include use in outlier detection, continuous model improvement, data analytics, data synchronization, data ingestion and ETL processes, active learning or the like.
In some embodiments, a collector platform 120 may allow users to create “applications” or “workflows” which specify how inputs are to be processed. Workflows or applications may invoke other workflows (for example, workflows may be nested or chained). In some embodiments, a workflow may include one or more collectors 200.
Reference is now made to
Collector creation process 300 may begin at 302 where a user interacts with a platform such as the platform 120 of
Other types of input data sources may be used. For example, inputs may be obtained from data storage systems (such as hard drives, cloud storage such as Amazon S3, Google Cloud Storage, Dropbox, or the like), data warehouses or data lakes (such as Snowflake, etc.), social media data streams, ETL data streams, APIs or the like. As discussed above, for the purposes of illustration, the data source will be described as being a prediction stream.
Once the input data source has been selected, processing continues at 304 where an optional pre-queue workflow may be associated with the collector. In some embodiments, a collector may have at least one of a pre-queue workflow or a post-queue workflow associated with it. In general, the workflows apply one or more operations to the input data to help ensure that desired data is output from the collector. In general, a pre-queue workflow may be used to selectively permit data from the input data stream to be added to the collector queue. As an example, a pre-queue workflow may be associated with the collector which randomly samples inputs from the input data source. As a particular example, the pre-queue workflow may be a random sample model that allows a random 10% of inputs to be passed into the collector queue. In some embodiments, the pre-queue workflow associated at 304 may be a pre-existing workflow that performs a desired function (e.g., such as performing a random sampling of inputs, or performing a filter function, or some other decisioning), or is adapted or modified to perform a desired function.
In many scenarios, a user will only want to ingest a sample, or subset of a given data source into a model or application. Pre-queue workflows allow users to pre-process inputs so that they are sampled and filtered before it is ever added to an application or model. Pre-queue workflows also allow users to specify sampling rules for triggering data ingestion. Common pre-queue workflows are designed to: (i) randomly sample inputs; (ii) filter inputs by metadata; (iii) filter inputs with a maximum probability below a given threshold; (iv) filter inputs with a minimum probability above a given threshold; (v) filter specific concept probabilities above a given threshold; or (vi) perform knowledge graph mapping from other models (e.g., such as the general model discussed above) and their concepts to a custom model.
Processing continues at 306 where an optional post-queue workflow may be associated with the collector. In general, a post-queue workflow may asynchronously determine which data in the collector queue should be passed as an output from the collector. As an example, a post-queue workflow may be created and associated with the collector that uses one or more threshold models to only allow input data that matches pre-defined thresholds to pass to the output of the collector. In some embodiments, the posts-queue workflow associated at 306 may be a pre-existing workflow that performs a desired function (e.g., such as applying a threshold, performing a filter function, or performing some other decisioning), or is adapted or modified to perform a desired function.
Processing continues at 308 where a collector queue is created. In general, the creation of the collector queue may be automatically performed with the creation of a collector (and is shown as a separate step in
Processing continues at 310 where the data to be output from the collector may be mapped (e.g., if the output location or data sink requires a different format or layout of data). Processing at 310 may also include specifying any permissions (e.g., such as user or API permissions) to allow the output data to be passed to a data sink.
Processing continues at 312 where the collector is created. Creation of the collector may, in some embodiments, put the collector into use such that it immediately receives inputs from the input data source and processes those inputs. In some embodiments, the collector may require activation or scheduling to be activated for use. While creation of a single collector was discussed, in practical application multiple collectors may be created and may run at the same time. Further, multiple collectors may produce output that is delivered to the same data sink (e.g., for use by another model or the like). In the illustrative example introduced above, the truck identifier model may receive input data from a number of different collectors, providing the truck identifier with input data for training or the like.
In some embodiments, the collector creation process 300 may be performed by interacting with an API (and one or more of the steps of
A caller may be specified at 510. The caller may be a user or a group of users that are permitted to invoke the collector. In some embodiments, a user without express permissions will not be able to use or invoke the collector. An API key 512 may also be selected to identify the API key to use to allow new inputs to be posted as outputs from this collector. In some embodiments this key may be an ID or API key associated with the post-queue workflow ID of the workflow to run to after the collector has processed the queued input. In some embodiments, the API key is specified along with the scopes (or permissions) that the collector needs. As a specific example, the API key must have a scope permitting it to post inputs, since it grants the collector the authority to post inputs from the collector to a data sink or application.
Information identifying the application to which the data is to be posted or saved may be identified in field 516 and information identifying the source of the input data is identified at 518. In the dropdown box at 518, a user may, for example, select the model that input data is to be received from. The collector will then automatically receive those inputs and, (based on pre- and post-queue workflow(s)) post the data to an output destination specified in 520. A tabular view of available models and sources may be shown in area 522. In this way, a user can quickly and efficiently create a collector pursuant to the present invention.
In some embodiments, a collector platform 120 may be implemented in an environment in which multiple data sources are available. For example, reference is now made to
The embodiments described herein may be implemented using any number of different hardware configurations. For example,
The processor 610 also communicates with a storage device 630. The storage device 630 may comprise any appropriate information storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, mobile telephones, and/or semiconductor memory devices. The storage device 630 stores a program 612 and/or one or more software modules 614 (e.g., associated with the user interface module, model module, threshold module, and interface module of
The programs 612, 614 may be stored in a compressed, uncompiled and/or encrypted format. The programs 612, 614 may furthermore include other program elements, such as an operating system, a database management system, and/or device drivers used by the processor 610 to interface with peripheral devices.
As used herein, information may be “received” by or “transmitted” to, for example: (i) the collector platform 600 from another device; or (ii) a software application or module within the collector platform 600 from another software application, module, or any other source.
The following illustrates various additional embodiments of the invention. These do not constitute a definition of all possible embodiments, and those skilled in the art will understand that the present invention is applicable to many other embodiments. Further, although the following embodiments are briefly described for clarity, those skilled in the art will understand how to make any changes, if necessary, to the above-described apparatus and methods to accommodate these and other embodiments and applications.
Although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with embodiments of the present invention (e.g., some of the information associated with the databases described herein may be combined or stored in external systems).
The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.