A cluster is a group of interconnected computers, or nodes, which combine their individual processing powers to function as a single, high performance machine. A cluster may be used for a number of different purposes, such as load balancing, high availability (HA) server applications and parallel processing.
A high performance computing (HPC) system may include a large number (e.g., hundreds, if not thousands or tens of thousands) of compute nodes (e.g., servers) that are networked together in a cluster for purposes of combining their individual processing powers to solve computationally-intensive problems. As examples, the computationally-intensive problems may be related to modeling, genome study, DNA study, computational biology, computational chemistry, earth science-related studies, space study, and so forth.
In general, machine-readable instructions (e.g., program code) may set forth a logical workflow of transformations on data to solve a computationally-intensive problem. In this context, a “transformation” refers to a function that operates on input data to produce an output data. The data may be partitioned into logical partitions, or datasets (e.g., resilient distributed datasets (RDDs)), which allows compute nodes to perform different parts of some transformations (e.g., map, transformations, filter transformations, reduce transformations, and so forth) individually and in parallel. Some transformations (e.g., a shuffle operation in which data is shuffled among a set of compute nodes) may be performed by a group of compute nodes in a nonparallel fashion.
A logical workflow of transformations does not inform a cluster how to specifically perform the workflow. For this purpose, a logical workflow of transformations may be converted into a physical execution plan, which sets forth specific tasks for compute nodes of the cluster. A physical execution plan may include a sequence of stages. Each stage may be associated with a set of tasks (called a “taskset” herein) that are to be performed by a set of compute nodes of a cluster computing system. The stage boundaries may be determined in a manner that assigns similar transformation types to the same stage. For example, if a logical workflow has first transformations in which parallel processing may be used followed by a shuffling transformation, then the corresponding part of the physical execution plan may a first stage that includes a taskset for performing the parallel processing transformations and a second, subsequent stage that includes a taskset for performing the shuffling transformation.
A cluster may include a scheduler that processes a physical execution plan for purposes of scheduling the tasks of the physical execution plan for execution by compute nodes of the cluster. The physical execution plan may be in the form a graph, such as a directed acyclic graph (DAG). In general, a “graph” refers to a collection of vertices, or nodes, which are connected by edges. Each node may represent a particular stage of the physical execution plan, and correspondingly, each node may correspond to a taskset related to one or multiple transformations that operate on and produce datasets. A “DAG” is a specific type of graph. A DAG is “directed” in that each edge of the graph is directed, or represents a one-way direction between a pair of nodes (i.e., a predecessor node and an immediate successor node) of the graph. The DAG is “acyclic” in that the graph has no directed cycles, i.e., no portion of the graph cycles, or forms a closed loop.
A “taskset,” as used herein, generally refers to a set of one or multiple tasks that are associated with a particular stage of a physical execution plan and are associated with a node of the DAG. A “task” refers to a unit of execution for a particular compute node of the cluster. As used herein, “predecessor” and “successor” are used to refer to an order in which the DAG is traversed via a directed edge. “Immediate” is used when discussing a pair of nodes to mean that the nodes are adjacent to each other, i.e., the nodes are directly connected to each other by an edge. Therefore, a directed edge may extend from a predecessor node to an immediate successor node, or stated differently, the directed edge may extend to the successor node from an immediate predecessor node.
A DAG scheduler may process a DAG for purposes of scheduling tasks among nodes of a cluster. The DAG scheduler may traverse a DAG such that for each node of the DAG, the DAG scheduler schedules the tasks of the taskset corresponding to the node, before the DAG scheduler proceeds with scheduling the tasks of the taskset corresponding to the immediate successor node.
Using a DAG to represent a physical execution plan may be rather inflexible for purposes of responding to a dynamic environment. An example of a dynamic environment is a self-driving, or autonomous, vehicle, which may operate without human involvement in response to the vehicle sensing its environment. The environment may unpredictively change, such as, for example, when the vehicle senses an unexpected object in the vehicle’s current path. When a DAG represents a physical execution plan for an autonomous vehicle, there may be multiple tasksets (corresponding to nodes of the DAG) that are directed to handling the path planning, monitoring and control of the vehicle.
For example, a particular car motion control taskset may be related to tasks directed to controlling the motion of the autonomous vehicle along a particular course. There may be more than one possible scenario that might preempt the car motion control taskset and should be handled by another taskset. For example, the car motion control taskset may be preempted by actions taken by a human operator setting a cruise control speed, and this preemption may involve transitioning to a cruise control taskset. As another example, the car motion control taskset may be preempted by a progressive brake control taskset for purposes of reducing the speed of the vehicle. As another example, the car motion control taskset may be preempted due to an object being detected, and which may involve transitioning to executing an object detection handling taskset.
A DAG that has only single directed edges from its node does not accommodate the situation in which conditions for more than one scenario are simultaneously satisfied. For example, for the above-described autonomous vehicle motion control, a user may be operating cruise control buttons or levers on the vehicle simultaneously with an unexpected object being detected in the vehicle’s path. Stated differently, there may be more than one scenario that warrants transitioning from a node of the DAG associated with the motion control taskset.
In accordance with example implementations that are described herein, a DAG may have multiple candidate paths (or “execution paths”) that originate from a given node. In this manner, in accordance with example implementations, a DAG may have multiple candidate paths from a given predecessor node, where each candidate path extends from the given predecessor node along through an associated directed edge and to a different immediate successor node. For example, a DAG for an autonomous vehicle may have multiple candidate paths (corresponding to multiple directed edges and multiple immediate successor nodes) that originate from a node corresponding to the motion control taskset. A first candidate path may include a first immediate successor node corresponding to a cruise control taskset. A second candidate path may include a second immediate successor node corresponding to a progressive brake control taskset. A third candidate path may include a third immediate successor node corresponding to the object detection handling taskset.
A potential challenge with multiple candidate paths in DAG scheduling is that the condition(s) for transitioning to more than one candidate path may be simultaneously satisfied. For the example, the condition(s) for executing a progressive brake control taskset may be simultaneously satisfied with the condition(s) for executing an object detection handling taskset.
A “priority DAG” is described herein in accordance with example implementations. In this context, a “priority DAG” generally refers to DAG that has alternate candidate paths (or “candidate execution paths”), and the candidate paths are associated with relatively priorities (which may also be called “multipath priorities, “relatively multipath node priorities,” or “RMP node priorities” herein). Sated differently, in accordance with example implementations, for multiple candidate paths, or choices, from a given predecessor node, each candidate path may be assigned a different relative priority than the other path(s), such that if conditions are concurrently satisfied for more than two candidate paths, the DAG scheduler selects the candidate path that has the relatively highest priority. For the autonomous vehicle example discussed above, the object detection handling taskset may have a higher relative priority than the cruise control taskset and the progressive brake control taskset, so that the DAG scheduler always schedules the object handling taskset in the event that an unexpected object is detected. Continuing the example, the progressive brake control taskset may be associated with a higher priority than the cruise control taskset, so that if the condition for transitioning to both tasksets are simultaneously satisfied, the DAG scheduler will schedule the progressive brake control taskset for execution.
The priorities may be associated with nodes of the priority DAG. For the examples above, the multiple candidate paths originate with a predecessor node of the priority DAG, and there is a choice, from the predecessor node, among multiple immediate successor nodes that correspond to the multiple execution paths. The choice of node may occur in the reversed order as well. In this manner, in accordance with example implementations, a priority DAG may have multiple candidate paths that terminate at a particular successor node. Stated differently, there may be multiple choices of immediate predecessor nodes for a given successor node. In this manner, a first candidate path may traverse a first predecessor node and terminate at an immediate successor node, and a second candidate path may traverse a second predecessor node and terminate at the same immediate successor node. For this example, both candidate paths may have associated relative priorities (i.e., the predecessor nodes may have associated RMP node priorities), which control which candidate path the DAG scheduler selects if conditions for both paths are satisfied.
In accordance with example implementations, one or multiple priorities of a priority DAG may be dynamically assigned. For example, in accordance with some implementations, an executing taskset of the DAG may evaluate and possibly adjust an RMP node priority of the same DAG. As a more specific example, a priority DAG may relate to networking route optimization, i.e., selecting an optimum route for packets through data path elements of software defined network (SDN). In this manner, the priority DAG may contain nodes (and corresponding tasksets) that correspond to data path devices of the SDN. One or multiple of these nodes may have a choice of candidate immediate successor nodes, which corresponding to multiple candidate execution paths and multiple data plane routing paths. The RMP node priorities of the candidate immediate successor nodes may be dynamically adjusted based on current network performance metrics (e.g., metrics representing bandwidth, latency, numbers of dropped packets, and so forth). In accordance with some implementations, executing tasks of the DAG’s tasksets may dynamically adjust the RMP node priorities based on the current network performance metrics so that over time, preference is given to the currently better performing segments of the data plane.
Referring to
For the example implementation that is depicted in
As examples, in accordance with some implementations, the physical, non-transitory storage media devices may include one or more of the following: semiconductor storage devices, memristor-based devices, magnetic storage devices, phase change memory devices, a combination of devices of one or more of these storage technologies, storage devices for other storage technologies, and so forth. Moreover, in accordance with some implementations, the physical, non-transitory storage media devices may be volatile memory devices, non-volatile memory devices, or a combination of volatile and non-volatile memory devices. In accordance with some implementations, the non-transitory storage media devices may be part of storage arrays, as well as other types of storage subsystems.
The node 120, in accordance with example implementations, may be a computer platform (e.g., a blade server, a laptop, a router, a rack-based server, a gateway, a supercomputer and so forth), a subpart of a computer platform (e.g., a compute node corresponding to one or multiple processing cores of a blade server), or multiple computer platforms (e.g., a compute node corresponding to a cluster). The nodes 120 may all have the same architecture or may have different architectures, depending on the particular implementation.
Although
In accordance with example implementations, a client 190 (e.g., a computer platform that receives input from a user) may provide data representing a logical execution workflow to a DAG generator 180 of the computer system 100. The data may be, for example, machine-readable instructions (e.g., program code) that represents a logical execution flow of transformations to be performed on partitioned datasets (e.g., RDDs 154), beginning with one or multiple input datasets and ending with one or multiple datasets that represent the end result of the logical execution flow. As an example, the data representing the logical execution workflow may be in the form of a file that is uploaded and is accessible by the DAG generator 180. As another example, the data representing the logical execution workflow may be provided through a command line interface of the DAG generator 180. As another example, the data representing the logical execution flow may be provided to the DAG generator 180 through a graphical user interface (GUI) of the DAG generator 180. Regardless of how the logical execution workflow is provided to the DAG generator 180, in accordance with example implementations, the DAG generator 180 converts the logical execution workflow into one or multiple priority DAGs 150. These priority DAG(s) 150 represent a physical execution plan that corresponds to the logical execution workflow, and the priority execution plan contains tasks to be executed by compute nodes 120 of the computer system 100.
In general, the vertices, or nodes, of a priority DAG 150 represent respective stages of a physical execution plan. Each node of the priority DAG 150, in accordance with example implementations, corresponds to a stage that is associated with a taskset. A “taskset” includes one or multiple tasks to be executed either by a single compute node 120 or by multiple compute nodes 120 working in concert. Here “in concert” refers to the compute nodes 120 working in either a parallel fashion or in a nonparallel fashion. As an example, multiple compute nodes 120 may execute tasks of a taskset to perform respective transformations on partitioned datasets (e.g., RDDs 154) in parallel. As another example, multiple compute nodes 120 may execute tasks of a taskset to transform data in a nonparallel fashion, such as a transformation that involves shuffling data among the compute nodes 120.
Due to the edges of the priority DAG 150 being directed (i.e., representing a one-way direction from one node to another node), the priority DAG 150 defines an execution sequence, or order, for executing the tasksets that correspond to the nodes of the DAG 150. In the context used herein, a “priority DAG” refers to a DAG that includes one or multiple nodes that have RMP node priorities assigned to them. For a given predecessor node of the priority DAG, which is connected, via multiple edges, to multiple immediate successor nodes, RMP node priorities may be assigned to the immediate successor nodes (i.e., each immediate successor node has a different RMP node priority). For a given successor node of the priority DAG, which is connected, via multiple edges, to multiple immediate predecessor nodes, RMP node priorities may be assigned to the immediate predecessor nodes. Regardless of whether a given node is connected to multiple immediate successor nodes or is connected to multiple immediate predecessor nodes, the node is part of respective multiple candidate execution paths. Due to the existence of multiple candidate execution paths, the priority DAG 150 may alternatively be referred to as a “multipath” DAG, or “multipath priority DAG 150.”
In accordance with example implementations, the computer system 100 includes a DAG scheduler 130 that processes the priority DAGs 150 for purposes of scheduling tasks that are to be executed by compute nodes 120 of the computer system 100. The DAG scheduler 130 may be a particular node 120 or group of nodes 120 of the computer system 100, in accordance with example implementations. In general, the DAG scheduler 130 includes a DAG scheduling engine 140 to process the priority DAGs 150 and schedule the tasks for execution by compute nodes 120.
The DAG scheduling engine 140, in the processing of a given priority DAG 150, traverses the nodes of the priority DAG 150 according to the execution order, or sequence, that is set forth by the directed edges of the priority DAG 150. Each node of the priority DAG 150 may be associated with one or multiple tasksets, and the taskset(s) include tasks to be executed by the compute nodes 120 as part of a stage of a physical execution plan. For a given node of the priority DAG 150, the DAG scheduling engine 140 identifies one or multiple compute nodes 120 to execute the tasks associated with the given node. A given node of the priority DAG 150 may itself be another priority DAG 150.
The DAG scheduling engine 140 may base the identification of the compute node(s) 120 for executing a given taskset based on any of a number of different criteria, such as availability of the compute nodes 120, performance metrics (e.g., bandwidth, latency, processor utilization, and so forth) associated with the compute nodes 120, the number of tasks in the taskset, the nature of the transformation(s) associated with the taskset, the availability of cached data from previous compute node 120 processing results, and so forth. In accordance with some implementations, the DAG scheduling engine 140 directly communicates the tasks with task schedulers of the identified compute nodes 120 for purposes of scheduling the tasks for execution by these compute nodes 120. In accordance with further implementations, the DAG scheduling engine 140 communicates the compute node identifications to a DAG task scheduler (not shown) that handles communicating tasks with the task schedulers for individual compute nodes 120.
In accordance with example implementations, the DAG scheduling engine 140 processes a priority DAG 150, one node at a time, for purposes of scheduling the tasks for compute nodes 120 to execute. After scheduling the tasks for a given node, the DAG scheduling engine 140 selects the next node to process. In some cases, the “next node” may be only choice. However, in other cases, selecting the “next node” may involve the DAG scheduling engine 140 evaluating multiple choices (i.e., evaluating candidate next nodes) based on respective RMP node priorities associated with these choices. To be a valid candidate node for consideration, one or multiple conditions for selecting the candidate node are satisfied, with the remaining criterion for selecting the candidate node being the RMP node priority associated with the candidate node. In accordance with example implementations, either 1. the candidate node is one of multiple candidate immediate successor nodes, or 2. the candidate node is one of multiple candidate immediate predecessor nodes, as further described herein in connection with example priority DAGs 150-1 and 150-2 of
Still referring to
In accordance with example implementations, the priority DAG data 162 further includes data that represents one or multiple RMP node priority queue sets 160. Each RMP node priority queue set 160 corresponds to a particular priority DAG 150 and contains a group, or set, of queues that contain data that represent the RMP node priorities for the priority DAG 150. For example, in accordance with some implementations, each queue of the RMP node priority queue set 160 may be associated with a particular node of the DAG 150 that is associated with a set of candidate nodes; and the queue contains the RMP priorities of the candidate nodes. In accordance with some implementations, queues of an RMP node priority queue set 160, such as the RMP node priority queue sets that are depicted in Table 1 below and depicted in
The DAG scheduling engine 140 may, in accordance with example implementations, be implemented via machine-readable instructions (i.e., “software”) that are executed by a hardware processor. In accordance with further implementations, the DAG scheduling engine 140 may be implemented by dedicated hardware (e.g., logic, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a complex logic device (CPLD), and so forth) that does not execute machine-readable instructions, or by a combination of dedicated hardware and a hardware processor that executed machine-readable instructions. For the implementation that is depicted in
In the following discussion, it is assumed that the DAG scheduler 130 performs its scheduling-related actions using the DAG scheduling engine 140.
The cross-hatched nodes 204 of
The following table depicts an example content of queues of an RMP priority queue set 160 (
For this example, each queue corresponds to a respective predecessor node 204 of the DAG priority queue 150-1, which have multiple candidate immediate successor node s 204. For example, as depicted in row four of Table 1, the T6 node 204 has the T12 and T14 immediate successor nodes 204 and corresponds to an RMP node priority queue set 160 that contains the RMP node priorities for the T12 and T14 nodes 204.
In accordance with further example implementations, the given RMP priority queue set 160 may include one or multiple RMP priority queues that correspond to respective successor nodes 204, where each of these successor nodes 204 has multiple candidate immediate predecessor nodes. For this example, the RMP priority queue for a given successor node has data representing the RMP node priorities for the candidate immediate predecessor nodes.
In accordance with some implementations, each element 158 contains data representing one or multiple links 316 to the element(s) 158 corresponding to the immediate successor node(s) 204. For example, for the T1 element 158, the element 158 stores data representing links 316 to the T2 element 158, the T3 element 158 and the T11 element 158. Moreover, in accordance with example implementations, each element 158 stores data representing a hash map 312. In accordance with example implementations, the hash map 312 allows the DAG scheduler 130 to determine the relevant RMP node priorities (if any). In accordance with some implementations, the hash map 312 is a probabilistic filter (e.g., a Bloom filters). For example, in accordance with some implementations, the DAG scheduler 130 may apply multiple hash functions to an identifier for the node 204 to produce multiple corresponding hash values. The DAG scheduler 130 may then apply the hash values to the hash map 312 for purposes of identifying the RMP node priorities that are relevant to the associated node 204, including ruling out irrelevant RMP node priorities.
In the context used herein, a “hash” (also called a “hash value”) is produced by the application of a cryptographic hash function to a value (e.g., an input, such as an image). A “cryptographic hash function” may be a function that is provided through the execution of machine-readable instructions by a processor (e.g., one or multiple central processing units (CPUs), one or multiple CPU processing cores, and so forth). The cryptographic hash function may receive an input, and the cryptographic hash function may then generate a hexadecimal string to match the input. For example, the input may include a string of data (for example, the data structure in memory denoted by a starting memory address and an ending memory address). In such an example, based on the string of data the cryptographic hash function outputs a hexadecimal string. Further, any minute change to the input may alter the output hexadecimal string. In another example, the cryptographic hash function may be a secure hash function (SHA), any federal information processing standards (FIPS) approved hash function, any national institute of standards and technology (NIST) approved hash function, or any other cryptographic hash function. In some examples, instead of a hexadecimal format, another format may be used for the string.
It is noted that the example priority DAG 150-1 of
As depicted in
In accordance with example implementations, the priority DAGs 150 may be independent and prioritized, and the DAG scheduler 130 may select a particular DAG 150 for processing based on its associated priority value.
In accordance with example implementations, one or multiple priorities of a given DAG may be dynamically assigned. For example, in accordance with some implementations, an executing taskset of the DAG may evaluate and possibly adjust a priority of the DAG. As a more specific example, a priority DAG may relate to networking route optimization, i.e., selecting an optimum route for packets through the data plane of a software defined network (SDN). In this manner, the DAG may contain tasksets and corresponding nodes corresponding to data path devices of an SDN, and each node of these nodes may have multiple candidate immediate successor nodes, corresponding to multiple candidate execution paths and multiple candidate network routing paths. The priorities of the candidate execution paths may be dynamically adjusted based on current network metrics (e.g., metrics representing bandwidth and latency). In accordance with some implementations, executing tasks of the DAG’s tasksets may dynamically adjust the priorities based on current network metrics.
Referring to
Referring to
Referring to
In accordance with example implementations, the DAG includes a given DAG of a plurality of DAGs, and the process further includes performing the scheduling responsive to the priority of the given DAG.
In accordance with example implementations, the schedule includes accessing a set of queues corresponding to the given DAG. Each queue corresponds to a node of the plurality of nodes and stores data representing available scheduling paths for the node and priorities associated with the available scheduling paths.
In accordance with example implementations, the task subset corresponding to the given successor node is associated with another DAG.
In accordance with example implementations, the process includes associating a third priority with a second successor node for the second successor node to execute after a third predecessor node; and associating a fourth priority with a third successor node for the third successor node to execute after the third predecessor node. The scheduling of the plurality of tasks further includes, based on the third priority and the fourth priority, scheduling the task subset corresponding to the second successor node to execute after the task subset corresponding to the third predecessor node, or scheduling the task subset corresponding to the third successor node to execute after the task subset corresponding to the third predecessor node.
In accordance with example implementations, the scheduling includes scheduling tasks for execution by a plurality of compute nodes of a cluster.
In accordance with example implementations, the scheduling includes accessing a hash map corresponding to the given successor node. Based on the hash map, a queue corresponding to the given successor node is accessed. The queue includes data representing the first priority and the second priority.
In accordance with example implementations, the DAG is associated with multiple scheduling paths, and each scheduling path is associated with a result of the plurality of results.
In accordance with example implementations, the process includes associating a plurality of priorities with the plurality of nodes and during the scheduling, changing a given priority of the plurality of priorities.
While the present disclosure has been described with respect to a limited number of implementations, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.