Data processing workflow systems usually allow a user to manage the workflow, such that the user can set up the data processing workflow and store the workflow for retrieval and/or execution (e.g., via a database). To execute a workflow, the user must typically select input data (e.g., from a storage source) and trigger execution of the workflow using the input data. The workflow is then executed to generate output data.
The inventors have recognized and appreciated that improved workflow management systems and various applications using the improved workflow management systems can be devised. Such improved systems and methods in accordance with some embodiments employ computational and AI processes to utilize hardware requirements (in building a workflow) and allow users to control instrumentation and samples tracking in a variety of applications, e.g., in chemistry workflows. The systems and methods may store data passed between processes in association with one or more processes in the workflow. For example, a data processing workflow and data between the processes in the workflow may be stored in a graph database, as a pipeline execution record. The systems and methods enable the user to query the pipeline execution record in any suitable stage in the workflow by the structure of the workflow (e.g., a graph or a sub-graph as search query). The systems and methods provided herein are particularly advantageous over convention systems such that data and workflows, which are complex and dynamically changing, may be scaled up.
Some embodiments are directed to a system for workflow management, the system comprising at least one processor, the at least one processor is configured to: obtain a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; execute one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generate a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
Some embodiments are directed to a method for workflow management, the method comprising, using at least one processor: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
Some embodiments are directed to a non-transitory computer-readable media comprising instructions that, when executed, cause at least one processor to perform operations comprising: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
Additional embodiments of the disclosure, as well as features and advantages thereof, will become more apparent by reference to the description herein taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale. Moreover, in the figures, like-referenced numerals designate corresponding parts throughout the different views.
For the purposes of promoting an understanding of the principles of the present disclosure, reference will now be made to the embodiments illustrated in the drawings, and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended.
Workflow management systems can be used to perform various tasks. In an example application, a data processing workflow can be created and/or configured in a workflow management system to perform certain functions, such as predicting properties of a molecule. A workflow can be executed multiple times, each time using different input data, such as the structure of a different molecule that can be processed to predict the properties of the molecule.
In conventional workflow systems, the input/output data is entirely decoupled from the workflow. While input data is specified for use with a pipeline (e.g., a flow of data through execution of a workflow or a portion of a workflow) in conventional approaches, the input data is not associated with either the pipeline or a particular execution of the pipeline (e.g., including the pipeline configuration for that execution, since the pipeline can change over time). Accordingly, the information for each particular execution of the pipeline is typically lost. Similarly, while the final output data can be saved, it would need to be manually associated with a particular pipeline, which is typically not done but even if done, it just stores the result of the processing pipeline—the output data is not associated with the input data and/or the particular pipeline configuration that was executed to generate the output data. Further, none of the data/metadata is stored throughout execution of each of the pipeline components in association with the pipeline components themselves. Therefore, conventional systems do not provide a record of a particular pipeline and its associated execution(s). Thus, it is not possible to use conventional approaches to retrieve at a particular pipeline configuration or execution of that pipeline to see what data was generated step-by-step in the execution of the pipeline. Furthermore, it would not be possible to look back at pipelines over time to see how they have changed, or to look at portions of previous pipeline configurations, or the data provided to/generated from those portions, etc. Additionally, each user is separately responsible for his/her own workflow and data, where no user can share other users' workflow and/or data thereof, or repeat a portion of other users' workflow.
Accordingly, the inventors have developed techniques for combining data and execution-driven pipelines. Described herein are various techniques, including systems, computerized methods, and non-transitory instructions, that integrate data with the pipeline workflow, where data passed by the processes in the workflow are stored in association with the workflow components (e.g., in a database).
An example data processing workflow may include a plurality of processes that are linked in certain configurations. Each of the one or more processes may be associated with respective input data and output data, and the plurality of processes may be linked, such that output data of one process may be provided as input to one or more other processes in the workflow. When the data processing workflow is executed, the data flows through the plurality of processes in the workflow as configured via the associated links.
A specification of a data processing workflow may include data describing the configuration of the plurality of processes in the data processing workflow, such as how the plurality of processes are linked. For example, the plurality of processes in a data processing workflow may be linked in serial, in parallel, or in a combination of serial and parallel manners. In some embodiments, a specification of a workflow may be represented in a digital representation, such as a specification file describing the workflow. For example, a specification file can be an XML file, a JSON file, a graph file, a flat file, and/or any other suitable format.
The techniques described herein can therefore execute one or more processes of the plurality of processes of a data processing workflow, for example, based on the specification of the workflow. In some embodiments, the techniques may create a pipeline execution record associated with executing a pipeline (e.g., flow of data through execution of one or more processes in the data processing workflow). The pipeline execution record may include, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof. As such, the pipeline execution record contains data that records one or more instances of execution of a data processing workflow (or a portion thereof) and input/output data or other execution metadata associated with the execution(s).
In some embodiments, a pipeline execution record may be stored in a database in any suitable format. For example, the pipeline execution record may be represented in a graph and stored in a graph database. In some embodiments, the techniques can also enable a user (who created the pipeline execution record), or other users (who did not create the pipeline execution record) to query the pipeline execution record to see how the data flowed through the pipeline, how the pipeline was executed, or the configuration of the pipeline for a particular execution. In a non-limiting example, the techniques enable a user to search the pipeline execution record using a workflow query for one or more processes of a data processing workflow that match the workflow query. For example, whereas the pipeline execution record is a graph database, a workflow query may be represented in a sub-graph representing one or more processes of a portion of the graph database.
In response to a workflow query, one or more processes of the data processing workflow in the pipeline execution record may be matched to the workflow query. As a result, the techniques may output data associated with the one or more processes of the data processing workflow that match the workflow query, without re-executing the one or more processes of the data processing workflow. Alternatively, and/or additionally, when the workflow query does not match any process in the pipeline execution record, the techniques may execute one or more processes of the data processing workflow that are represented in the workflow query to generate the new output data.
In various embodiments, the techniques may display the pipeline execution record, for example, in a graph representation. The techniques may provide a user interface (e.g., graphical user interface) that receives user selection(s) defining at least a portion of the graph as the workflow query. Similarly, the techniques may also provide a user interface that enables the user to define a data processing workflow. For example, the techniques may provide a graphical user interface, and receive, via the graphical user interface, user selection(s) defining the one or more processes of the plurality of processes. In a non-limiting example, the user selection(s) may include selection(s) of one or more processes from a library of user selectable processes. The resulting data processing workflow defined by the user may be stored in a specification such as a specification file described herein above.
Whereas various embodiments have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible. Accordingly, the embodiments described herein are examples, not the only possible embodiments and implementations. Furthermore, the advantages described above are not necessarily the only advantages, and it is not necessarily expected that all of the described advantages will be achieved with every embodiment.
Returning to
In some embodiments, the pipeline execution record may be represented in a graph representation, where the input/output data associated with each process in the data processing workflow may be represented by a respective node, and where each process in the data processing workflow may be represented by a link between the nodes. For example, with reference to
Returning to
Accordingly, system 100 may include a workflow search engine configured to receive, from one or more users, a workflow query, and use the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query. The system may obtain from the pipeline execution record output data associated with the one or more processes of the data processing workflow that match the workflow query, and return the obtained output data to the user(s). This enables the user to quickly obtain the output data without the system re-executing any part of the data processing workflow.
In some embodiments, the user may identify one or more processes of a workflow the user would like to search. The user may use the identified one or more processes as a search query to search the associated pipeline execution record. In some examples, the search query may be in a graph representation, such as a sub-graph shown in
In some embodiments, the system may obtain data (e.g., input/out data, or other metadata) associated with execution of the matched process(es) in the workflow, where the data may be stored in the workflow database (see
In the case of re-running the workflow with new input data, the system may create/append new pipeline execution record that includes the new output data associated with the workflow and the new input data. In an example implementation, the system may use a version control to manage different sets of data associated with a workflow (or a portion of a workflow). It is appreciated that the system main maintain a single pipeline execution record for each workflow, where the pipeline execution record may include multiple data sets each associated with an execution of a workflow (or a portion of a workflow). In some embodiments, the system may store multiple pipeline execution records associated with a workflow, where each record is associated with an execution of the workflow (or a portion of the workflow).
With further reference to
In some embodiments, the system enables flexibility of data types associated with a process in a data processing workflow.
The various embodiments described herein may be implemented to build and search data processing workflows in various applications, such as the workflows shown in
As illustrated, data associated with a process in a data processing workflow may include data or datalake. In some embodiments, data may include data itself or a pointer to a memory or external data source that stores the data. The datalake is an abstracted object for data storage that supports raw, native, or processed files (e.g., S3, Azure, Google Storage). The datalake itself may include metadata that allows for the use of search data (to a certain extent) without other components. A datalake may be available from a data storage device and/or platform (e.g., cloud storage) and can be downloadable locally (e.g., for faster execution).
In various examples, a process in a data processing workflow may be any of a machine learning process (e.g., machine learning training, machine learning inference), a molecule object creation process (e.g., having SMILES as input data and molecule object as output data), a molecule state process that defines an electronic state of the molecule that is needed for quantum chemistry processes (e.g., having molecule object as input data and molecule state as output data), a molecule-to-conformer process that will calculate 3D coordinates for number of lowest conformers of the input molecule (e.g., having molecule as input data and coordinates as output data), an geometry optimization using quantum chemistry density functional theory calculation (OPT-DFT) process (e.g., having coordinate and molecule state as input data, where the output data may include energy data, electronic data, coordinate data, or OPT-DFT datalake data for raw/unparsed outputs from the process).
In some embodiments, a process in a data processing workflow may also be a single point time dependent density functional theory quantum chemistry calculation to predict molecular electronic excited states SP-TDDFT (having coordinate and molecule state as input data, where the output data may have energy data, electronic data, excitation data, and/or SP-TDDFT datalake data). In some embodiments, a process in a data processing workflow may also include a geometry optimization process (having coordinates as input data, where the output data may also include coordinates), or an excited state calculation process (having coordinates as input data, where the output data include excited states). In some embodiments, a process in a data processing workflow may include a combination of multiple of processes. For example, as shown in
With reference to
With further reference to
With reference to
In some embodiments, governance in the data and workflow manager may provide appropriate controls for data, code, and execution. In some embodiments, all communications between components and services may be encrypted, authenticated, and authorized. These security schemes protect the system against threats that may exist both inside and outside the network so that the processes in the data processing workflow may be executed securely.
With further reference to
In some embodiments, the main goal of the data and workflow manager is to support extensible abstraction for programming, execution, and query of heterogeneous workflows (mixed computational and experimental) and their associated data. The technical challenge of any collaborative platform that is scalable and intended for use by different organizations and people is to support ever-changing data formats (and schema), workflows and tools (e.g., both AI/ML and instrument). Accordingly, the system as described herein in various embodiments is designed according to the following assumptions and features: (i) Agility: flexibility of data schema, data and metadata schema will be changed in future, support for older schema or migration to new schema, end users can define their own schema, part of metadata collected automatically; (ii) Search and query: search by metadata and data, free text search, search by knowledge and workflow graphs, search by different users on the cloud; (iii) Security: data is immutable (append only), audit of data access and processing, authorization for data access, encryption of data; (iv) Reproducibility: Metadata has enough information that Data can be regenerated (with high probability or with a similar probabilistic distribution for random processes); and (v) Scalability: system can scale horizontally.
In some embodiments, machine learning, physical modeling, and experimental pipelines share many characteristics. The main difference for experimental pipelines is that they need to be synchronized with physical processes and objects in the lab. Therefore, using the same engine to execute and track both workflows provide the ability to rapidly introduce experimental and hypothesis-based workflows. This integration of workflows provides for richer searchability of the data and the construction of hypotheses and models from data generated by different workflows. Those workflows can be programmed by end users via simple YAML definition files, DSL (domain specific language), SDKs, and/or other tools. In some embodiments, the workflows may be tested, executed, and monitored via command line interface (CLI), Jupyter notebooks, web interfaces (such as what is shown in
Various embodiments as described above may store data in a graph database (e.g., workflow database in
Accordingly, the systems and methods described in various embodiments of
The various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of numerous suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a virtual machine or a suitable framework.
In this respect, various inventive concepts may be embodied as at least one non-transitory computer readable storage medium (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, etc.) encoded with one or more programs that, when executed on one or more computers or other processors, implement the various embodiments of the present invention. The non-transitory computer-readable medium or media may be transportable, such that the program or programs stored thereon may be loaded onto any computer resource to implement various aspects of the present invention as discussed above.
The terms “program,” “software,” and/or “application” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of embodiments as discussed above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion among different computers or processors to implement various aspects of the present invention.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in non-transitory computer-readable storage media in any suitable form. Data structures may have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a non-transitory computer-readable medium that convey relationship between the fields. However, any suitable mechanism may be used to establish relationships among information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationships among data elements.
Various inventive concepts may be embodied as one or more methods, of which examples have been provided. The acts performed as part of a method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Some embodiments are directed to a system for workflow management, the system comprising at least one processor, the at least one processor is configured to: obtain a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; execute one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generate a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
In some embodiments, the process data record includes one or more pointers that reference to data in one or more external data sources.
In some embodiments, the process data records includes a plurality of datasets each associated with a respective execution of the one or more processes of the plurality of processes of the workflow.
In some embodiments, obtaining the specification of the data processing workflow comprises: receiving, via a graphical user interface, user selection defining the one or more processes of the plurality of processes.
In some embodiments, the user selection includes selection of one or more processes from a library of user selectable processes.
In some embodiments, the specification of the data processing workflow comprises a script file.
In some embodiments, the pipeline execution record is stored in a graph database.
In some embodiments, the at least one processor is further configured to: receive a workflow query; use the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtain output data associated with the one or more processes of the data processing workflow that match the workflow query.
In some embodiments, the at least one processor is further configured to: receive an input data query; determine if the input data query matches input data associated with the one or more processes of the data processing workflow that match the workflow query; in response to determining a match of the input data query, obtain the output data by retrieving output data associated with the one or more processes of the data processing workflow in the pipeline execution record; and in response to determining a non-match of the input data query: (1) execute the one or more processes of the data processing workflow that match the workflow query to generate the new output data; and (2) obtain the new output data as the output data associated with the one or more processes of the data processing workflow that match the workflow query.
In some embodiments, the pipeline execution record is stored in a graph database; and the workflow query comprises a sub-graph.
In some embodiments, the at least one processor is further configured to: display the pipeline execution record in a graph; and receive user selection defining at least a portion of the graph as the workflow query.
In some embodiments, the process data record comprises one or more pointers that reference to data in one or more external data sources, and obtaining the output data associated with the one or more processes of the data processing workflow that match the workflow query comprises: retrieving the output data from at least one of the one or more external data sources using at least one of the one or more pointers.
Some embodiments are directed to a method for workflow management, the method comprising, using at least one processor: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
In some embodiments, the process data record includes: one or more pointers that reference to data in one or more external data sources; or optionally, a plurality of datasets each associated with a respective execution of the one or more processes of the plurality of processes of the workflow.
In some embodiments, obtaining the specification of the data processing workflow comprises: receiving, via a graphical user interface, user selection defining the one or more processes of the plurality of processes.
In some embodiments, the user selection includes selection of one or more processes from a library of user selectable processes.
In some embodiments, the method further comprises: receiving a workflow query; using the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtaining output data associated with the one or more processes of the data processing workflow that match the workflow query.
In some embodiments, the method further comprises: receiving an input data query; determining if the input data query matches input data associated with the one or more processes of the data processing workflow that match the workflow query; in response to determining a match of the input data query, obtain the output data by retrieving output data associated with the one or more processes of the data processing workflow in the pipeline execution record; and in response to determining a non-match of the input data query: (1) executing the one or more processes of the data processing workflow that match the workflow query to generate the new output data; and (2) obtaining the new output data as the output data associated with the one or more processes of the data processing workflow that match the workflow query.
Some embodiments are directed to a non-transitory computer-readable media comprising instructions that, when executed, cause at least one processor to perform operations comprising: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
In some embodiments, the operations further comprise: receiving a workflow query; using the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtaining output data associated with the one or more processes of the data processing workflow that match the workflow query.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.” As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This allows elements to optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed. Such terms are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term).
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing”, “involving”, and variations thereof, is meant to encompass the items listed thereafter and additional items.
Having described several embodiments of the invention in detail, various modifications and improvements will readily occur to those skilled in the art. Such modifications and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only, and is not intended as limiting.
Various aspects are described in this disclosure, which include, but are not limited to, the following aspects:
This application claims the benefit of U.S. Provisional Application No. 63/282,584, filed Nov. 23, 2021, entitled, “TECHNIQUES FOR COMBINED DATA AND EXECUTION DRIVEN PIPELINE,” the entire content of which is incorporated herein by reference.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2022/050675 | 11/22/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63282584 | Nov 2021 | US |