The present invention relates to distributed computing.
Currently, when computer applications are submitted to distributed computing networks/resources, standard modes of communication such as TCP/IP and MPI are used. TCP/IP does not provide for any scheduling or management of latency in the network, and MPI is only used to synchronise communications between parallel processes.
When parallel or otherwise distributed computer jobs are submitted to a network, there are no existing ways to manage the communications other than by making an a-priori assessment of the optimal partitioning (division) of the job, and assuming a level of competition for resources from other applications, users or processes. There is also no way to make use of a new resource that is added, or adapting to changes in topology or network performance.
According to a first aspect of the present invention there is provided a computer-implemented method of allocating a task to a set of distributed computing resources, the method including: obtaining resource data describing a set of distributed computing resources; obtaining task data describing a computing task to be performed; and selecting at least one of the distributed computing resources for performing the task based on the obtained description of the task.
According to another aspect of the present invention there is provided apparatus for allocating a task to a set of distributed computing resources, the apparatus including: a device configured to obtain resource data describing a set of distributed computing resources; a device configured to obtain task data describing a computing task to be performed; and a device configured to select at least one of the distributed computing resources for performing the task based on the obtained description of the task.
According to a further aspect of the present invention there is provided a computer-implemented method of generating resource information describing a set of distributed computing resources in a network, the method including: selecting a first resource in the network; interrogating the resource to determine its characteristics; storing data describing the characteristics; and selecting at least one further resource that is in communication with the first resource and repeating the interrogating and storing steps for the at least one further resource. According to another aspect of the invention there is provided apparatus configured to perform this method.
According to yet another aspect of the present invention there is provided a computer-implemented method of generating task information describing a computing task to be performed using distributed computing resources, the method including analysing source or executable code describing the task to obtain statistics (or estimated statistics) of the computational requirements of the task. According to another aspect of the invention there is provided apparatus configured to perform this method.
According to further aspects of the present invention there are provided computer program products comprising computer readable medium, having thereon computer program code means, when the program code is loaded, to make the computer execute methods substantially as described herein.
Whilst the invention has been described above, it extends to any inventive combination of the features set out above or in the following description. Although illustrative embodiments of the invention are described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments. As such, many modifications and variations will be apparent to practitioners skilled in this art. Furthermore, it is contemplated that a particular feature described either individually or as part of an embodiment can be combined with other individually described features, or parts of other embodiments, even if the other features and embodiments make no mention of the particular feature. Thus, the invention extends to such specific combinations not already described.
The invention may be performed in various ways, and, by way of example only, embodiments thereof will now be described, reference being made to the accompanying drawings, in which:
In the shown example of
As will be known to the skilled person, the various nodes (e.g. computing/storage devices) in the network and the links between them can have many different individual characteristics. Conventionally, users often have to know, estimate or look up these characteristics before selecting which elements will be used to perform a distributed computing task. This is prone to human error and will not usually result in optimal distribution of a task to the most suitable resources. Embodiments of the present system provide the following features in an attempt to solve this problem:
One or more computer executing code for implementing processes 1.-4. above can be used. That computer(s) may be part of the network that will be used for executing the distributed computing task, or may be separate from it. The processes 1.-4. may be part of a single application, or may be separated into separate modules, e.g. a resource description-building program, a task description-building program, etc.
At step 304, one of the network nodes is selected as a “head node” that will be the starting point for a processes that builds the description of the available resources. This head node data may be selected/input by the user or retrieved from a store, e.g. the resource description-building program has been set up with default head node data for one or more network setups.
Steps 306 and 308 can be performed as part of a loop of steps. Starting with the selected head node, the resource description-building program interrogates the connection(s) and other node(s) in communication with that node and generates data describing their attributes. That description data is then stored, e.g. in the data structure 200 shown in
At step 504 the task to be performed is analysed so as to assess its computational requirements (in terms of those obtained at step 502). It will be appreciated that there are several ways of doing this. For example, the overall task may be broken down step-by-step, or into sections/groups of steps, and the number of integer operations required by a particular step/section may be recorded using a program that analyses the task source or executable code. Alternatively, a user may analyse the code to produce an estimate. A total of all the integer operations for the entire task can then be summated and the process can then be repeated for the other computational requirements. At step 506 an output representing the results of step 504 is produced. This can be in any suitable format, e.g. XML, preferably one that can be read by the network operating system and a program for allocating network resources to perform the task.
At step 606 the task is allocated to at least one of the network resources. It will be appreciated that there are several methods of doing this. For example, a resource-allocating program can use conventional algorithms, such as stochastic, deterministic or heuristic optimisation algorithms to allocate parts of the task to various resources. The skilled person will be able to find/derive suitable techniques from the field of Operations Research. These can include linear and integer programme techniques for both discrete (where the variables can take on only a set of pre-defined values) and continuous (where the variables are any (vector of) real-valued numbers) optimisation methods. Nonlinear techniques may also be used.
A non-exhaustive list of examples of suitable Operations Research techniques include: Branch and Bound (technique for solving discrete optimization problems by organizing the search in a tree. In each node of the tree, bounds on the objective are computed, which are used to exclude parts of the tree from the search); Dynamic Programming (method for solving dynamic (i.e. with time structure) optimization problems using recursion); Integer Programming (optimization where the variables only may take integer values, i.e. 0, 1, 2, 3, . . . ); Lagrangian Relaxation (transformation of an optimization problem, where constraints are moved to the objective, multiplied by auxiliary parameters, so called Lagrangian multipliers. These multipliers become variables in the so called dual problem); Linear Programming (optimization where objective function and constraints are linear); Simplex Algorithm (algorithm for optimization without constraints, that only uses objective function values (i.e. no derivatives). The objective is calculated in the vertices of a simplex, and a new vertex is produced by mirroring the worst vertex in the plane spanned by the other vertices. The Nelder-Mead simplex method is very popular because it is easy to understand and implement, and does not require derivatives to allocate parts of the task to various resources); Quadratic Programming (optimisation where the objective function is nonlinear and the constraints are linear). A suitable optimisation scheme may be a combination of any of the above (and/or other) schemes and so-called heuristics which require knowledge about the particular problem being solved. For distributing the processing task to the networked resources, it is likely that a combination of Dynamic Programming and Integer Programming will be best, including Heuristics to account for the existing knowledge (normally based on records of past performance) of the interpretation of the integer values in directing network resource.
Factors such as resource availability and cost may also be taken into account by the algorithm. The method can include optimisation algorithms such as genetic algorithms; simulated annealing; operational analysis techniques; heuristics based on prior knowledge; machine learning techniques such as neural networks and Artificial Intelligence, all of which will be familiar to the skilled person.
Step 608 can be performed if the network resources change during execution of the task. For instance, if a processor is urgently required for performing another task, or becomes unavailable for some other reason then resource-allocating program analyses the remaining available resources (based on the descriptions obtained) and attempts to re-allocate part of the distributed task to another suitable resource. This re-allocation can be performed dynamically or statistically. If a network-distribute programme is already running, then it can be undesirable to stop (or pause) that while reallocating resource for performing a task because resource availability (or cost) may change on an ad hoc basis. Dynamic re-allocation can allow the process to continue substantially uninterrupted whilst changing the forward resource allocation profile (i.e. the result of the allocation optimisation process based on the task description and the resource description). The optimisation techniques described above are capable of enabling both static and dynamic planning and so the choice of technique can be dictated by the capability of the network Operating System.
A tangible technical benefit provided by the inventive methods described above is that it is no longer necessary for an end-user to guess the availability of resource prior to submitting a job, or to understand fully the resource requirements for unfamiliar code. The limitations of TCP/IP in optimising a communication path are addressed by this invention because of the richer description of the resource requirements that a process is able to provide to the operating system and specialist sub-components.
Number | Date | Country | Kind |
---|---|---|---|
07270018.0 | Apr 2007 | EP | regional |
0706582.4 | Apr 2007 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB08/50243 | 4/4/2008 | WO | 00 | 7/10/2008 |