This description relates to dynamic resource allocation in a computing system.
Applications that run on computing systems require a portion of the computing system's computational resources to do so. The computing system must therefore manage allocation of its resources to applications running thereon. Some examples of resources that are allocated to applications include access to a portion of the computing system's memory, access to file data, and access to a required amount of processing power.
In distributed computing systems, computational resources (including data storage and processing resources) are distributed among a number of servers included in one or more clusters that work together to run data processing applications. In some examples, distributed computing systems use a centralized resource manager, which both schedules execution of applications on the computing system and manages allocation of the computing system's distributed resources to the applications. Examples of resource managers include “Hadoop YARN” and “Kubernetes.”
In a general aspect, a method for performing a distributed computation on a computing system using computational resources dynamically allocated using a computational resource manager includes storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling computation associated with that program portion to be performed in the computing system, requesting a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receiving a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, performing computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receiving an additional quantity of computational resources from the computational resource manager, and performing an additional computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
Aspects may include one or more of the following features.
The information specifying quantities of computational resources associated with respective ones of a number of program portions of the program may include characteristics of one or more program components associated with the respective ones of the number of program portions. The characteristics may include a degree of parallelism associated with each of the one or more program components and a quantity of computational resources required for performing computation associated with each of the one or more program components. The quantity of computational resources associated with a program portion of the number of program portions may be determined based at least in part on the degree of parallelism associated with each of the one or more program components and the quantity of computational resources required for performing computation associated with each of the one or more program components.
Performing the computation associated with the first portion of the program using the second quantity of computational resources may include partitioning the first program portion into a number of sub-portions according to the received second quantity of computational resources. Performing the computation associated with the first portion of the program may include performing a first sub-portion of the number of sub-portions while one or more other sub-portions of the number of sub-portions wait to perform computation.
Partitioning of a portion of the program can be performed in a way that preserves an order of execution of at least some of the program components. Partitioning of a portion of the program portion can be performed in a way that maximizes a usage of the received computational resources while preserving an order of execution of at least some of the program components. Maximization of usage of the received computational resources can include using the received computational resources to perform computation for some but not all required instances of a program component in a first part of the program portion and associating the remaining instances of the program component with another part of the program portion for later execution.
Performing computation associated with the first portion of the program using the second quantity of computational resources may include performing computation for a first sub-portion of the plurality of sub-portions using the second quantity of computational resources while one or more other sub-portions of the plurality of sub-portions wait to perform computation. The first sub-portion may be the sub-portion among the plurality of sub-portions that is configured to use most or all of the second quantity of computational resources for execution.
Performing the computation associated with the first portion of the program using the additional quantity of computational resources may include repartitioning the first program portion into an updated number of sub-portions according to the received second quantity of computational resources and the received additional quantity of computational resources.
Performing computation associated with the first portion of the program using the additional quantity of computational resources may include performing computation associated with a first sub-portion of the updated plurality of sub-portions using the additional quantity of computational resources while one or more other sub-portions of the updated plurality of sub-portions wait to perform computation
The first sub-portion of the updated plurality of sub-portions may be the sub-portion among the updated plurality of sub-portions may be configured to use most or all of the additional quantity of computational resources for performing computation.
Partitioning the first program portion into a number of sub-portions may include partitioning the first program portion according to characteristics of one or more of program components associated with the first program portion. The characteristics of the number of program portions may include a degree of parallelism associated with each of the one or more program components and a quantity of computational resources required for performing computation associated with each of the one or more program components. A first one or more instances of a first program component may be partitioned into the first sub-portion and a second one or more instances of the first program component may be partitioned into a second sub-portion.
The method may include relinquishing the second quantity of computational resources and the additional quantity of computational resources upon completion of the computation associated with the first program portion. The method may include retaining at least some of the second quantity of computational resources and the additional quantity of computational resources upon completion of the computation associated with the first program portion. The method may include performing a computation associated with a second portion of the program using at least some of the retained computational resources.
A third quantity of the received computational resources may become unavailable during the computation associated with the first program portion and the method may further include requesting the third quantity of computational resources from the computational resource manager, receiving the third quantity of computational resources from the computational resource manager, and continuing performing computation associated with the first program portion using the received third quantity of computational resources.
The computational resource manager may be opaque regarding a quantity of computational resources available for the computing system. The method may include storing output data from the first program portion and performing computation associated with a second program portion of the number of program portions including reading and processing the stored output data. The method may include performing computation associated with a second program portion of the number of program portions including receiving and processing a stream of output data from the first program portion.
In another general aspect, a system for performing a distributed computation using computational resources of a computing system dynamically allocated using a computational resource manager includes a storage device for storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling that program portion to be performed in the computing system and at least one processor configured to request a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receive a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, perform computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receive an additional quantity of computational resources from the computational resource manager, and perform the computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
In another general aspect, a system for performing a distributed computation using computational resources of a computing system dynamically allocated using a computational resource manager includes means for storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling performance of computation associated with that program portion in the computing system, means for processing configured to request a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receive a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, perform computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receive an additional quantity of computational resources from the computational resource manager, and perform the computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
In another general aspect, software stored in a non-transitory form on a computer-readable medium, for performing a distributed computation using computational resources of a computing system dynamically allocated using a computational resource manager, the software including instructions for causing the computing system to store information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling performance of computation for that program portion in the computing system, request a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receive a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, performing computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receive an additional quantity of computational resources from the computational resource manager, perform the computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
Performing computation associated with a program or a program portion can also be referred to as executing the program or program portion.
The program can be specified as a dataflow graph and the program portions can be specified as components of a dataflow graph.
The program can be specified as a procedural program specification and the program portions can be specified as subroutines.
Aspects can include one or more of the following advantages.
Among other advantages, aspects dynamically allocate computational resources to portions (e.g. components) of a computer program (e.g. data processing graph) in resource constrained computation environments, where the amount of resources available to the program portions varies over time. Portions of programs (sometimes referred to as “phases”) are enabled to partially perform computation with less than all their required computational resources and are enabled to incorporate additional computational resources while performing computation as they become available such as to complete computation. Program portions are advantageously less likely to be stalled while waiting for all their required resources to be granted. Program portions are advantageously able to recover from resource (e.g., node) failures in the computing system by dynamically allocating new resources to replace failed resources.
Other features and advantages of the invention will become apparent from the following description, and from the claims.
A data storage system 116 is accessible to the execution environment 104 and to a development environment 118. The development environment 118 is a system for developing programs that can be configured in a variety of ways such that different interrelated program portions are associated with different target quantities of computational resources to be allocated for use at runtime. In some implementations, these programs are data processing programs that process data during runtime, such as data received from the data source 102. One example of a data processing program is a data processing graph that includes vertices (representing data processing components or datasets) connected by directed links (representing flows of work elements, i.e., data) between the vertices. Other forms of data processing programs are possible in accordance with the present invention. In addition to these data flow connections, some data processing graphs also have control flow connections for determining flow of control among components. In such data processing graphs, the program portions are the components and they are interrelated according to their data flow links. In other examples, the program portions are sub-modules or other entities within a program that are separately granted computing resources for being executed. The program portions are considered interrelated to the extent that the ability of the overall program to which they belong to be executed depends on the abilities of the individual program portions. Such interrelated or interdependent program portions may also be dependent on each other for execution. For example, one program portion may receive data from or provide data to another program portion. Also, while the program portions are separately granted computing resources, they may overlap or be interdependent in various other ways (e.g., competing for a limited supply of computing resources).
For example, such an environment for developing graph-based computations is described in more detail in U.S. Publication No. 2007/0011668, titled “Managing Parameters for Graph-Based Applications,” incorporated herein by reference. A system for executing such graph-based computations is described in U.S. Pat. No. 5,966,072, titled “EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS,” incorporated herein by reference. Data processing graphs made in accordance with this system provide methods for getting information into and out of individual processes represented by graph components, for moving information between the processes, and for defining a running order for the processes. This system includes algorithms that choose interprocess communication methods from any available methods (for example, communication paths according to the links of the graph can use TCP/IP or UNIX domain sockets or use shared memory to pass data between the processes).
The execution module 112 processes data from the data source 102 according to one or more data processing graphs 114, using computational resources allocated by the resource manager 120, to generate output data which is stored back in the data source 102 or in the data storage system 116, or otherwise used. Storage devices providing the data source 102 may be local to the execution environment 104, for example, being stored on a storage medium connected to a computer hosting the execution environment 104 (e.g., hard drive 108), or may be remote to the execution environment 104, for example, being hosted on a remote system (e.g., mainframe 110) in communication with a computer hosting the execution environment 104, over a remote connection (e.g., provided by a cloud computing infrastructure). In some examples, the data source 102, includes different forms of database systems including data that may be organized as records having values for respective fields (also called “attributes” or “columns”), including possibly null values.
The resource manager 120 schedules execution of one or more computer programs, such as the data processing graphs 114, on the execution environment 104 and manages allocation of the execution environment's resources to the data processing graphs. As is described in greater detail below, for computer programs that include interrelated program portions, such as data processing graphs that include a number of interdependent components, the resource requesting module 122 interacts with the resource manager 120 to dynamically allocate computational resources based on availability of computational resources associated with the execution module 112, which may vary over time.
Referring to
The execution environment 104 includes the resource requesting module 122, the resource manager 120, and the execution module 112. Among other features, the execution module 112 includes computational resources which may be distributed across multiple hosts (e.g., computing clusters of servers). In
In
The resource manager 120 receives requests for computational resources and either grants or denies the requests based on an amount of available computational resources in the hosts of the execution module 112. One example of such a resource manager 120 includes the “Hadoop YARN” resource manager which is capable of receiving a request for computational resources for executing a computer program (or program portion) and, if sufficient computational resources are available, grants a ‘container’ with some number of units of the computational resources for use by the program, where a container can be implemented as any suitable data structure for containing a particular quantity of computational resources, or containing any information that identifies a particular quantity of computational resources, or any combination thereof. The computer program may then execute using the computational resources in the granted container. In some examples, the computer program can request multiple containers of resources at one time (e.g., a number of containers for running concurrent instances of a portion of the program) from the resource manager 120. If sufficient resources are available for the resource manager 120 to grant all of the requested multiple containers to the computer program, it will do so. Otherwise, based on the available resources, the resource manager 120 may grant only some of the requested containers (i.e., an integer number of containers less than the total number of containers requested), or the resource manager 120 may not grant any of the requested containers. In some implementations, all of the computational resources associated with a given container are derived from a single host. Alternatively, in other implementations, a given container's resources may be derived from multiple hosts.
As is described in greater detail below, the resource requesting module 122 interacts with the resource manager 120 in a way that allows for dynamic allocation (e.g., incremental allocation, deallocation, or reallocation) of resources for the data processing program 224 as resource availability in the execution module 112 changes over time.
The data processing program 224 is a specification of the computer program for processing data received from the data source 102.
In some examples, the execution of the data processing program (e.g., the data processing graph) is broken into multiple, sequential computation phases (sometimes referred to as “program portions”), where each program component of the computer program (e.g. a node or component of the graph or a subroutine of a procedural program) belongs to one of the computation phases. In general, all program components belonging to a computation phase must complete their processing before the program components belonging to a next, subsequent computation phase can begin their processing. For example, the data processing program 224 in
Furthermore, each program component of a computer program may be associated with a computational resource quantity that specifies a quantity of resources required for the program component to on the execution module 112 and a ‘layout’ constraint that specifies a degree of parallelization of the program component. A shorthand notation for the computational resource quantity and the layout for a component is “A×B,” where A is the computational resource quantity and B the layout constraint. In the example data processing graph of
Referring again to
In scenarios where resources are plentiful, the resource requesting module 122 is able to allocate resources for the phases of the data processing program 224 without issue. Each phase uses its allocated resources to and generate results, which may be used by subsequent phases in the data processing program. Upon completion of a phase, its allocated computational resources are relinquished. However, in some examples, the execution module 112 has limited computational resources, which can result in the resource requesting module 122 receiving less than all of the computational resources that it requests for a phase. In such examples, rather than waiting to execute the phase until the remainder of the resources for the phase become available, aspects described herein use a dynamic resource allocation process that executes part of the phase with the resources that are already allocated and executes additional parts of the phase as additional computational resources become available.
Referring to
For each sub-phase, a fifth step 368 of the process 300 executes the first sub-phase using the received computational resources, while the second sub-phase waits for resources to become available. In a sixth step 370 of the process 300, if additional resources become available during execution of the sub-phase, then a seventh step 372 of the process 300 expands the first sub-phase (and shrinks the second sub-phase) such that the first sub-phase is able to use the additional received computational resources (e.g., by adding instances of components from the second sub-phase to the first sub-phase). In an eighth step 374 of the process 300, the expanded first sub-phase executes.
Upon completion of execution of each sub-phase (or expanded sub-phase), the results of the execution are stored (e.g., in memory or on disk). The process 300 iterates through each sub-phase (e.g., the added second sub-phase), where each sub-phase reads any results stored by the previous sub-phase and executes using the allocated computational resources, as is described above until all sub-phases in the phase have executed. The process repeats for each phase.
Referring to
Referring to
Referring to
Referring to
Referring to
That is, the first sub-phase 229a includes a first instance of the second program component, B 228 (i.e., B1) and a second instance of the second program component, B 228 (i.e., B2). The first and second instances of the second program component 228 require “5” computational resource units each to execute and can therefore be executed using the with five computational resource units granted on the first host, H1236 and five computational resource units granted on the third host, H3240. In the fifth step 368 of the process 300, the first and second instances (B1, B2) of the second program component 228 begin executing using the with five computational resource units granted on the first host, H1236 and five computational resource units granted on the third host, H3240.
Referring to
Referring to
Referring to
In some examples, upon completion of execution of a phase or sub-phase, all the received computational resources are relinquished. In other examples, at least some of the granted computational resources are retained for execution of instances of program components in subsequent phases, preferably without requiring the module 122 to send a request for resources to manager 120. For example, in
Referring to
Referring to
Referring to
In the first step 362 of the process 300, the resource requesting module 122 requests the computational resources required to execute the single sub-phase of the third phase 331. In this case, the single sub-phase of the third phase 231 requires “6” computational resource units because it includes “6” instances of the fourth program component, D 232, which requires “1” computational resource unit per instance to execute. However, the first instance, D1 of the fourth program component, D 232 is assigned to the retained computational resources on the first host, H1236, so a “1×5” computational resource request is sent to the resource manager 120 for the remaining five instances of the fourth program component, D 232. The resource manager 120 responds to the resource requesting module 112 by granting the “1×5” computational resource units. The granted computational resource units are shown with bold outlines in the execution module 112, with five computational resource units granted on the second host, H2238.
Referring to
In some examples, a host that is executing instances of program components may experience a failure during execution. Referring to
Referring to
Referring to
Referring to
In some examples, execution of phases of a data processing graph may partially overlap and output data from one phase may be streamed to a subsequent phase rather than being stored to disk.
In some examples, the resource manager does not provide any indication of a quantity of computational resources available on the computing system. Rather, the resource manager accepts requests for computational resources and fulfills the requests (fully or partially) based on the computational resources available at the time of the request.
In the example described above, execution is rolled back due to a node failure. In other examples, execution is rolled back when the resource manager 120 “revokes”, or “preempts” computational resources from the resource requesting module 122, usually without notice. For example, if another application requests resources on node H2238, the resource manager 120 may determine that that another application is more important and revoke computational resources that have already been granted to an application. The rollback procedure described above is used to allocate new computational resources to replace the revoked resources.
In some examples, such as the examples described above, after computational resources for a phase are revoked, the entire phase is rolled back and restarted. In other examples, only program components of the phase that were using the preempted or revoked resources have their execution rolled back. Doing so advantageously avoids redundant work of rerunning program components that did not have their computational resources revoked.
In the examples described above, the resource manager receives requests for computational resources and grants the resources if they are available. In other examples, the computational resource manager offers computational resources to programs wishing to execute in the execution environment rather than receiving requests. In such an arrangement, rather than making requests to the resource manager, programs listen to a stream of “offers” for available computational resources, and programs choose which (if any) of the offers they would like to take. One example of a resource manager that works this way is Apache Mesos.
Performing computation associated with a program or a program portion can also be referred to as executing the program or program portion. The program can be specified as a dataflow graph and the program portions can be specified as components of a dataflow graph. The program can be specified as a procedural program specification and the program portions can be specified as subroutines.
Partitioning of a portion of the program can be performed in a way that preserves an order of execution of at least some of the program components. Partitioning of a portion of the program portion can be performed in a way that maximizes a usage of the received computational resources while preserving an order of execution of at least some of the program components.
The computational resource allocation approaches described above can be implemented, for example, using a programmable computing system executing suitable software instructions or it can be implemented in suitable hardware such as a field-programmable gate array (FPGA) or in some hybrid form. For example, in a programmed approach the software may include procedures in one or more computer programs that execute on one or more programmed or programmable computing system (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software may include one or more modules of a larger program, for example, that provides services related to the design, configuration, and execution of data processing graphs. The modules of the program (e.g., elements of a data processing graph) can be implemented as data structures or other organized data conforming to a data model stored in a data repository.
The software may be stored in non-transitory form, such as being embodied in a volatile or non-volatile storage medium, or any other non-transitory medium, using a physical property of the medium (e.g., surface pits and lands, magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM). In preparation for loading the instructions, the software may be provided on a tangible, non-transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general or special purpose computing system or device), or may be delivered (e.g., encoded in a propagated signal) over a communication medium of a network to a tangible, non-transitory medium of a computing system where it is executed. Some or all of the processing may be performed on a special purpose computer, or using special-purpose hardware, such as coprocessors or field-programmable gate arrays (FPGAs) or dedicated, application-specific integrated circuits (ASICs). The processing may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computing elements. Each such computer program is preferably stored on or downloaded to a computer-readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device accessible by a general or special purpose programmable computer, for configuring and operating the computer when the storage device medium is read by the computer to perform the processing described herein. The inventive system may also be considered to be implemented as a tangible, non-transitory medium, configured with a computer program, where the medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the processing steps described herein.
A number of embodiments of the invention have been described. Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
This application claims the benefit of U.S. Provisional Application No. 63/196,757 filed Jun. 4, 2021, the entire contents of which are incorporated herein.
Number | Date | Country | |
---|---|---|---|
63196757 | Jun 2021 | US |