The invention pertains to the field of cloud computing, and in particular relates to a general description language data system for directed acyclic graph type automatic task flow.
Nowadays, the trend of professional segmentation in the field of scientific computing is obvious. The development of detailed problem-solving algorithms and the actual value-oriented engineering application have been divided into two development directions. How to combine the use of subdivided professional methods to complete the goal has become an indispensable requirement. The smaller and smaller method granularity also makes the learning cost of a large number of methods and the labor cost of combining them higher and higher, and the automated workflow technology is also widely used in various fields. In the field of scientific computing: The SBP (Seven Bridge Platform) developed by SBG (Seven Bridges Genomics) in the United States is regarded as the company's core technology. It integrates algorithms, data hosting, and computing resources and constructs a flexible genome data analysis process and performs calculations through a visual interface; in the open source field, there is also a common task flow description language standard CWL (Common Workflow Language), which is mainly for data-intensive scientific computing, use text format task flow description to connect different task executions with command line tools. Mainstream cloud computing resource providers such as AWS, Google, AliCloud, and computing standards such as HPC and Spark provide support for CWL.
For the task flow description language, there are two core functions: defining specific tasks and defining the flow between tasks. By defining specific tasks to describe the input, output and execution methods of tasks and by defining the flow of tasks to define the order of task execution and the path of data flow, the core information are provided to the specific task flow engine for parsing, and the automated execution of the task flow can be completed.
For the closed task flow platform, it lacks openness and ease of use, and can only use the algorithm tasks provided by internally for orchestration, which is difficult to meet the rapid development of computing needs and cannot access flexible computing resources, so we are mainly concerned about a general task flow description language.
For the existing general description language, the main defects are as follows:
1. Rough granularity of task description:
The single task description granularity of the existing general description language is very rough. The user needs to define the input parameter group and the acquisition method, but the language does not involve the type and detailed structure of the data, so that the user needs to have a deep understanding of the specific characteristics of the task when using it. Even if the wrong data type and structure are provided, data checking and verification cannot be provided.
2. The threshold of knowledge in the computer field is high, and it is inconvenient for non-computer professionals to orchestration:
The existing general description language is tightly coupled with computer programming technology, exposing a large number of low-level details and computer-specific terms. Users need to have certain knowledge in the computer field to write, and it cannot meet the needs of algorithm writers (computer engineers) and algorithm users (scientists) at the same time.
3. The information is complex and lacks data reuse
When the existing general description language is applied to high-performance scientific computing, due to its versatility, a large number of parameters brought by the characteristics of the scientific computing field needed to be defined, and the parameters need to be repeatedly input, and there is no data template and data overlay filling ability.
4. Lack of automatic parallel source language
In the field of scientific computing, parallelism is an indispensable ability due to huge computing requirements. The existing general description language lacks parallel description primitives like map-reduce or scatter-gather, and cannot provide automatic parallelism for single-point tasks.
Based on this, it is necessary to provide a general description language data system for directed acyclic graph automatic task flow, the system comprises Step definition layer, Workflow definition layer and Template definition layer.
The Step definition layer is the description of single task. For the input and output declarations of a docker image or other executor, it is necessary to specifically declare the name, type, file, and parameters of each input and output item.
The Workflow definition layer is a workflow composed of one or more Steps. The dependency topology of these Steps needs to be defined and shared parameters can also be defined.
The Template definition layer is based on a Workflow definition layer. The Template definition layer pre-sets the parameters, and supplies the descriptions of the parameters, the checkers of the parameters or the data source definitions of the parametes.
The present invention adopts the above technical solutions and customize the type through the TypeDef definition layer. The TypeDef definition layer supports multiple keywords such as type, required, symbols, value, const, ref, embed_value, etc., and expands the details of the data through these keywords to improve the data description granularity. The Step definition layer describing a single task refers to different TypeDef definition layers in the inputs and outputs declarations to declare its own input and output, so as to achieve a high-precision and complete task description. Thus solve the problem that the single task description granularity of the existing general description language is very rough.
The specific steps of Workflow definition layer to construct the Template definition layer: parse the values item in the Template definition layer. The values item will contain data such as $step2/$in_arg2: {const: 1.0}. The first field is the Step name before the slash, and the input item Name after the slash, locate the specific mapping in Workflow definition layer through these two parameters. In the second field, the first parameter before the colon indicates the property of the data. In an embodiment, the data is a constant, and the latter parameter is specific data. When applying Template definition layer, override the property and data represented by the second parameter to the Step name and input item name represented by the first parameter.
The references among all levels are implemented through url; for example, (type:{circumflex over ( )}typedef/common/version/1.0.0) means that the type of the variable introduces a TypeDef definition layer named common version 1.0.0, obtain the original text of the definition by requesting these parameters from the data center.
Preferably, the system also comprises a TypeDef definition layer. If users need to use special custom types, they need to write the TypeDef definition layer. The TypeDef definition layer mainly abstracts the definitions of general or complex composite types, which is convenient for reference and management.
The present invention further adopts the above technical solution, and its advantage is that the required, type, value, const, serizalier, symbols and other keywords in TypeDef definition layer not only provide type declarations, but also support detailed declarations such as default values, constants, and enumeration values. By matching the input data with these declarations, more refined data inspection and verification can be achieved, so as to provide data checking and verification.
The present invention also provides a solution for realizing the reference of data between the four definition layers before, comprises:
1. Step definition layer will only reference TypeDef definition layer, which is achieved by filling in the reference url when defining the type of data, such as:
Indicates that the Step definition layer refers to the type definition named jobArgs in the TypeDef definition layer named common.
2. Workflow definition layer will only refer to the Step definition layer, which is achieved by filling in the url when declaring the run field of the Step definition layer used, such as:
Indicates that the Workflow definition layer refers to a Step definition layer named demo with version 1.0.0.
Template definition layer will only refer to Workflow definition layer, which is achieved by filling in the url in the workflow field declared in the metadata, such as:
Indicates that the Template definition layer is applied to the Workflow definition named some_workflow.
Correspondingly, the present invention also provides a parse method using the system, comprises the following steps:
recursive analysis: pull the input files and all files that the input files depend on from the data center to the local. Then the parser recursively traverses each value of each input file. If it is an external link beginning with {circumflex over ( )}, then download the corresponding file of the link through the data center Client-side, and repeat this step for the new file until all dependent links are ready.
syntax tree analysis: since each value of the description language has a priority and coverage relationship, in order to realize the data logic of the layer coverage, it is necessary to construct and apply the coverage layer by layer from the bottom layer;
parse the Template files, traverse the specific variables and values in the Template, and index to a certain input and output value of a Step in the Workflow object for override operation.
object loading: after the parsing is completed, an object tree will be obtained, and the workflow object is the root node. The workflow contains all the Step objects through the steps property, and the Step objects contain all the TypeDefs objects through the inputs/outputs properties.
Besides constructs a well-defined object tree and hierarchical assignment, the second important algorithm of the parser is to perform topological sorting on Workflow objects. The user defines the dependencies among different Step objects, and the topological sorting algorithm can work out the most efficient running solution of Step.
Preferably, the parse method includes:
first step parses the type definition file whose Class is TypeDef. And all objects of TypeDef definition layer are constructed from the contents of the file and stored in memory as a K:V mapping.
Second step, constructs the Step objects and parses all files whose class is Step. The Step objects are constructed from the content of the file. The inputs/outputs properties of the Step objects contain several TypeDef objects. If the Step applies a variable of a custom type, then take the loaded object from the TypeDef K:V mapping to replace the object in Step, and perform the value override operation.
Third step, constructs the Workflow object, parses the file whose class is Workflow, and constructs the Workflow object from the content of the file. The steps property of the Workflow object contains all the Steps involved in the workflow, which is stored in the mapping mode of StepName: StepObject. Workflow object fetches all its dependent Step objects from the Step definition layer and stores them in its own steps property, and overrides the values in the Workflow definition layer with the values in the Step object according to the content of the file.
Preferably, the method of the topological sorting algorithm comprises the following steps:
Step A: find a FollowMap through the ref link marked in the inputs of each Step, and the FollowMap is a mapping of the list of <depended on Step: dependent Step>;
Step B: after getting the FollowMap, invert the mapping of FollowMap to get LeaderMap, which is the mapping of <stepName: Step list on which this Step depends>;
Step C: introduce the concept of Distance, which is abbreviated as Dis in the flowchart, which means the dependent distance to be run, and the default is 1;
Step D: traverse all Step objects, if a Step object has not been checked, traverse the LeaderSteps of the Step object. If a Step object does not have a Leader, it means that the Step object has no dependencies, and it is deemed to have been checked and set Dis to 1; if a Step object do have a Leader, it means that the Step object is dependent, the Dis of the Step is added with the Dis of the LeaderSteps, and so on.
By declaring the topological dependency among the Step objects in the input of the Step, the core of the recursive idea draws on the topological sorting algorithm of the mathematics. FollowMap and LeaderMap are two forms of adjacency matrix. The starting point is determined by LeaderMap, and the starting point Step is set to 1 through the concept of Dis. The Dis of the intermediate point Step is the Dis sum of the Step objects from the starting point and the path Step objects to this point. By sorting Dis, we can get the most efficient running sequence. And when a Step object is executed, we only need to recursively subtract the Dis of the subsequent nodes according to the FollowMap to update the running sequence of the current state.
The present invention further adopts the above technical solutions to solve the parameter problem caused by the need to define a large number of characteristics of the scientific computing field and solve the problem of data template and data coverage and filling ability.
The present invention brings the following beneficial effects:
1. Detailed description of input and output: It describes in detail the matching of input and output by checking whether the type keywords are exactly the same.
2. Use the ref keyword to specify the data source of the input item as the realization type and value of a certain output item (for example, $arg1: {ref: $step1/$output_list1} in the Workflow definition layer embodiment indicates that the input item arg1 is linked to the output item named output_list1 of step1). And support custom types.
3. Implemented by the doc keyword contained in the TypeAndValueDefine substructure. The doc keyword can add description information, which is only displayed as a comment, and does not perform analytical calculations. Additional description texts that are not related to calculations are added and specific input detection and data Conversion can be provided based on type information.
4. The decoupling of domain knowledge is achieved through the separation of the four layers structure. In the use scenario, professionals in the computer field write TypeDef definition layer, professionals in the scientific computing field write Step definition layer, and reference Step definitions layer to write Workflow definition layer, task operation professionals use Workflow definition layer in combination, and write Template definition layer based on submitted experience summaries, and get domain knowledge decoupling: a single task is completely decoupled from workflow orchestration, and there is no need to understand computer-related knowledge and specific algorithm task details, as long as the input and output types of the task definition can match , the connection can be arranged.
5. Automatic concurrency primitives: declare specific distribution parameters through the scatter_gather keyword, split the input data list into several data groups according to the distribution parameters, create multiple subtasks and send each data group to each subtask for execution parallel computing to support the declaration of scatter-gather automatic concurrent subtasks.
6. The parallel capability provided by automatic concurrent subtasks enables the workflow defined by the description language to be used in multiple scenarios such as computing acceleration, data analysis, and streaming computing. It is no longer limited to a simple description of providing input and output for calculation. And the domain decoupling ability is strengthened. Algorithm experts focus on solving abstract problems for development. The concept of parallel ability in the computer field does not need to be considered. When actual computing problems require batch parallel processing, the description language meets this ability and expands the use of tasks and capabilities.
7. Data template application, data coverage transmission: the data among the four definition layers all have a reference relationship, which can be covered according to priority, and different data templates can be used to achieve one-click configuration or default parameter configuration.
The present invention is only a set of language standards, the language provides definitions of all necessary information, and specific tasks need to be used with an interpreter, data center and task execution tools, and programming languages need to be used to implement the corresponding tools. The data center needs to be able to store each definition file and index to the corresponding file through the reference link. The interpreter needs to read all the definition content and assign the corresponding data to the definition structure according to the reference link. Task execution tools need complete structured data obtained through the interpreter, and follow these information to schedule and submit tasks.
The TypeDef definition layer is not necessary. If users need to use special custom types, they need to write the TypeDef definition layer. The TypeDef definition layer mainly abstracts the definitions of general or complex composite types for easy reference and management; the Step definition layer is the description of single task, for the input and output declarations of a docker image or other executor, it is necessary to specifically declare the name, type, file, and parameters of each input and output item; the Workflow definition layer is a workflow composed of one or more Steps. The dependency topology of these Steps needs to be defined, and shared parameters can also be defined; the Template definition layer is based on a Workflow definition layer. The Template definition layer pre-sets the parameters, and supplies the descriptions of the parameters, the checkers of parameters or the data source definitions of the parameters. A Template definition layer must declare a unique Workflow definition layer; references among layers are implemented through url, for example (type:{circumflex over ( )}typedef/common/version/1.0.0) is indicated that the type of the variable introduces a TypeDef definition layer named common version 1.0.0.
The xwlVersion describes the version of the description language, used to distinguish version iterations brought about by the continuous addition of functions; class describes the type of this file, there are four types (TypeDef definition layer, Step definition layer, Workflow definition layer, Template definition layer); version describes the definition of Version; author describes the author's information; doc describes the annotations for the file; name describes the name of the file, the author needs to keep the name unique when writing the same type of file;
A substructure named TypeAndValueDefine is defined in the description language, which contains the type, name, value, and several properties, which are used to define a variable in detail. The following are three representative examples of TypeAndValueDefine:
The specific steps are:
first step: define the name of the substructure at the outermost layer, the outermost is the name of the substructure, and follows the language principle to start with $;
second step: define the general keywords needed to describe the properties of the substructure: type keyword is type description, supports int, double, boolean, string, array, file, dir, dict, record, obj, and can be identified as a list by adding [] suffix; const and value are mutually exclusive keywords, indicating the value represented by the substructure definition, value is a variable value, const is an immutable value; ref is a keyword mutually exclusive with value/const to identify the source of the ref value as a reference to another TypeAndValueDefine substructure; the required keyword is whether the TypeAndValueDefine substructure must have a value, the default is true; the doc keyword is the description; the symbols keyword is an enumerated value range, which is used when the TypeAndValueDefine substructure needs to limit value range;
third step: define the special keywords that describe the substructure in a specific type.
In the second example, the substructure is a definition of type folder. The autoSyncInterval keyword can be defined as the time interval for automatic synchronization; the autoSyncIgnore keyword is a list of file names that are ignored by default, and supports regular syntax.
In the third example, the substructure is a definition with a type of custom object. The serializer keyword can be defined as the codec definition required by the object definition; the saveType keyword is the storage method, which can be file/string; the fileExt is the suffix name of the stored file, used when the saveType is file; the encoder keyword is the encoder url, the encoder linked to needs to be an executable method that accepts an object and returns string data; the decoder keyword is the decoder url, and the connected decoder needs to be an executable method that accepts a string data, returns an object. The codec follows the external link guidelines, using the {circumflex over ( )} prefix, and py: identifies it as a python method.
The following is an embodiment of a TypeDef definition named common (the general information part will not be repeated):
The specific steps are:
first step: define the typeDefs keyword at the outermost layer. The typeDefs keyword contains some TypeAndValueDefine substructures. For example, the definition declares a record data named struct, the fields is a subkey declaration of the record type, and it contains two properties, cores and memory;
second step: in the TypeAndValueDefine substructure that uses the type definition, it is declared in the type through a fixed format link. The following is a TypeAndValueDefine substructure example that uses the typeDef:
The Step definition layer contains a specific description of a calculation task. The following is an embodiment of a Step definition layer (the general information part has been omitted)
First step: define four main keywords describing the Step definition layer property: entryPoint, jobArgs, inputs, outputs. The entryPoint is the execution entry of the Step definition layer, such as the loader.py file located in the /home/job/run directory and executed with python in the embodiment. The jobArgs is the execution parameter of Step definition layer, the referenced TypeDef definition layer is used in the embodiment, and the default value of 24000 MB for 16 cores is given.
Second step is to define the input and output items: inputs/outputs are the input and output parameters list of the Step definition layer, and there are several TypeAndValueDefine substructures inside.
The Workflow definition layer contains several Step declarations and the parameter dependencies among Steps. Here is an embodiment of a Workflow definition layer (the general information part has been omitted):
The specific steps are:
first step: define the shared variable pool vars that needs to be reused at the outermost layer: the vars keyword is a group used to define the pool for the shared variables in the file. If multiple steps in the workflow need to share a group of inputs, it can be referenced by the ref keywords; the step keywords are the Step objects used in the workflow and their dependent topology, and the internal declaration is step name and Step object as key-value pair;
second step: define the Steps used and their topological relationship.
Under the steps keyword, there are two step declarations named step1 and step2.
In the declaration of step1, the run is the specific definition url of the step, which is represented by an external link starting with {circumflex over ( )} that follows the guidelines, which means to introduce the 1.0.0 version definition named demo; the jobArgs keyword maps to the jobArgs defined in Step, a default value is assigned to it here; the in keyword is the declared input parameter, and the value of a parameter named arg1 is declared here to refer to the value of share_arg1 in the shared variable. The naming in in needs to be consistent with the name of the input item in the input in the Step definition layer; the out keyword is a parameter that is enabled in the workflow, and the name needs to be consistent with the name of the output item in the outputs in the Step definition layer.
In the declaration of step 2, an automatic concurrency step declared using the scatter-gather primitive is shown. The jobArgs can be omitted when the default value is not assigned; the scatter keyword declares that this is a concurrent step;
The scatter keyword distributes each element in the received input list to the same number of subtasks as the input list through the zip mapping. Under the scatter definition: the maxJob/minJob keywords are the concurrent number range of the task; the zip is concurrent batch parameters mapping of this task. There are several TypeAndValueDefine substructures under zip. Since the task definition is oriented to a single input, a parameter mapping needs to be defined to indicate how multiple parameters received are mapped to the subtask input items that need to be concurrent. For example, this embodiment declares an array type named scatter_in_arg1, which accepts the result of the step1 task named output_list1; the jobIn keyword is the original input of the step, and there are several TypeAndValueDefine substructures inside. The name must be consistent with the input name of the Step definition layer inputs. For example, in_args1 here declares that the value comes from scatter_in_arg1 in the zip map. It means that each element in the list received by scatter_in_arg1 will be distributed to the in_arg1 item of each sub job when it is running.
The gather keyword aggregates the output results of multiple subtasks through unzip mapping into an output list. Under the definition of gather: the failTol is the failure tolerance rate of the subjob, which is a decimal in the range of 0-1. If the proportion of failed tasks is greater than this decimal, the step is considered to have failed and retrying is abandoned; the retryLimit is the maximum number of failed retries allowed. If some subtasks fail and the proportion of failures is less than the fault tolerance rate, the retry will not exceed retryLimit; the jobOut is the output item in the original Step definition layer that is enabled, and the name needs to be consistent with the output item in the Step definition layer; the unzip is the mapping of parameter aggregation. For example, in this embodiment, the unzip declares a definition named gather_outputs that aggregates the outputs items of all subtasks.
The output keyword at the outermost layer means the final output of the workflow. For example, in this embodiment, an output named out_wf_arg1 is defined, and its value is derived from the aggregate result of step2 gather_outputs.
The Template definition layer is used to specify a set of preset values as a data template to be applied to the workflow. The following is an embodiment of a Template definition layer:
The specific steps are:
First step: define the target Workflow definition layer keyword applied by the Template definition layer: workflow defines the url for the workflow to which the Template definition layer is applied;
Second step: define the pre-filled values for the workflow before: the values are used to fix some values that need to be filled, and only support data in the form of value/const. As in the above embodiment, the defined value named share_args1 in the shared variable vars is filled with the variable value 233, and the defined value named in_arg2 in the step2 input is the immutable value 1.0.
The specific process of publishing Step definition layer is as follows.
Data center:
the data center is a simple C/S architecture service, which manages the index through the Server-side database and the file system manages specific data content; the Client-side performs simple parsing, uploading, and downloading.
Upload workflow:
the user submits a description language file to the Client-side. The Client-side reads the content of the file, obtains the specific type, name, and version parameters by analyzing the class, name, and version fields, then requests the Server-side with the file content; the Server-side indexes the database through the corresponding parameters; if a file with the same type, name, and version already exist, the parameter check failure will be returned; if it does not exist, a new file address is generated, and the detailed information is added to the database. Then the Server-side accesses the file system to store the file in the new file address, and then returns the result to the Client-side.
Download workflow:
The user carries the type, name, and version parameters to access the Server-side; the Server-side indexes the database through the corresponding parameters, returns a NotFound error if it does not exist a same file; obtains the specific file address if it is exist a same file, and accesses the file system through the file address and obtains the file content, and returns the result to the Client-side.
The advantage of the above scheme is that using the file system to store description language files instead of directly storing them in the database not only preserves the original granularity of the data, but also ensures the integrity of the description language files. Using the file system to store larger description language files also improve the performance of the database. When requesting files in batches, it can index addresses faster and use multi-threading to speed up file reading.
Parser:
The parser is an independent and offline data analysis tool, mainly through recursive analysis, syntax tree analysis, object loading, application linking, and application of values layer by layer and other steps to parse the complete definition.
Before parsing the content, first, pull the input files and all files that the input files depend on from the data center to the local. The parser recursively traverses each value of the first input file. If it is an external link beginning with {circumflex over ( )}, then download the corresponding file of the link through the data center Client-side, and this step will be repeated for the new file until all dependent links are ready.
Since each layer of the description language has a priority and coverage relationship, in order to realize the data logic of the layer coverage, it is necessary to construct and apply the coverage layer by layer from the bottom layer. First step, parses the type definition file whose Class is TypeDef. All objects of TypeDef definition layer are constructed from the contents of the file and stored in memory as a K:V mapping.
Second step, constructs the Step objects and parses all files whose class is Step. The Step objects are constructed from the content of the files. The inputs/outputs properties of the Step objects contain several TypeDef objects. If the Step applies a variable of a custom type, the loaded object is taken from the TypeDef K:V mapping to replace the object in Step and the value override operation is performed.
Third step, constructs the Workflow object, parses the file whose class is Workflow, and constructs the Workflow object from the content of the file. The steps property of the Workflow object contains all the Steps involved in the workflow, which is stored in the mapping form of StepName: StepObject. Workflow object fetches all its dependent Step objects from the Step definition layer and stores them in its own steps property, and overrides the values in the Workflow definition layer with the values in the Step object according to the content of the file.
Finally, the Template definition layer is parsed, the specific variables and values in the Template definition layer are traversed, and a certain input and output value of a Step in the Workflow definition layer is indexed for override operation.
After the parsing is completed, an object tree will be obtained; the workflow object is the root node. The workflow contains all the Step objects through the steps property and the Step objects contain all the TypeDefs objects through the inputs/outputs properties.
Besides constructs a well-defined object tree and hierarchical assignment, the second important algorithm of the parser is to perform topological sorting on Workflow objects. The user defines the dependencies among different Steps, and the topological sorting algorithm can solve the most efficient operation solution of the step.
Find a FollowMap through the ref link marked in the inputs of each Step. And the FollowMap is a mapping of the list of <Dependened on Step: Dependent Step>.
After getting the FollowMap, invert the FollowMap mapping to get the LeaderMap, which is the mapping of <stepName: Step list on which this Step depends>.
Introduce the concept of Distance, which is abbreviated as Dis in the flowchart, which means the distance of dependence from being run. The default is 1 (can be run directly).
Traverse all Step objects, if a Step object has not been checked, traverse the LeaderSteps of the Step object, if a Step object does not have a Leader, it means that the Step object has no dependencies, and it is deemed to have been checked and Dis is set to 1. If a Step object do have a Leader, it means that the Step object is dependent, the Dis of the Step is added with the Dis of its LeaderSteps, and so on.
The core of the recursive idea draws on the topological sorting algorithm of the mathematics. FollowMap and LeaderMap are two forms of adjacency matrix. The starting point is determined by LeaderMap, and the starting point Step is set to 1 through the concept of Dis. The Dis of the intermediate point Step is the Dis sum of the Steps from the starting point and the path Step to this point. By sorting Dis, we can get the most efficient running sequence. And when a Step is executed, we only need to recursively subtract the Dis of the subsequent nodes according to the FollowMap to update the running sequence of the current state.
Taking the above-mentioned ideal embodiments based on this application as enlightenment, through the above description, relevant staff can make various changes and modifications without departing from the scope of the technical idea of this application. The technical scope of this application is not limited to the content in the decryption, and its technical scope must be determined according to the scope of the claims.
This application is a 371 of international application of PCT application serial no. PCT/CN2020/120660, filed on Oct. 13, 2020. The entirety of each of the above mentioned patent applications is hereby incorporated by reference herein and made a part of this specification.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/120660 | 10/13/2020 | WO |