1. Field of the Invention
The present invention relates to a method for deciding on a distribution path of a task in a device that comprises one or more busses and a plurality of processing elements. Further, the present invention relates to a device and a system that are configured to decide on a distribution path.
2. Description of Related Art
Nowadays, large data amounts become available through the rapidly developing communication and computing techniques. Whereas highly specialized processing elements have been developed that are configured to efficiently execute different kinds of processing tasks, many resources are wasted because the tasks are inefficiently transported from a control element to a suitable processing element.
Some of known hardware/software solutions might pro-vide improvements into one direction or another. However, they still do not improve any or at least most of the above-listed criteria. Therefore, there is still a need for an improved hardware or software solution for optimizing the processing of tasks on a number of processing elements
It is, therefore, an object of the present invention to provide a method, a device and a server system that overcome some of the above-mentioned problems of the prior art.
Particularly, the advantages of the present invention are achieved by appended independent claims. Further aspects, embodiments, and features of the present invention are specified in the appended dependent claims and the description and also make a contribution to achieving said advantages.
According to an embodiment of the present invention, the method for deciding on a distribution path of a task comprises the steps:
The present invention is based on the idea that based on a cycle length prediction that particular path and processing element that lead to the fastest processing of the task are chosen. The method of the present invention thus avoids wasting of resources that are caused by using unnecessary long paths for communicating with a processing element or by using a processing element that is not ideally suited for processing a given task.
The present invention can be implemented in particular with bus systems where for at least one processing element at least two paths for communicating with this processing element are available. In particular, the invention is advantageous if the transfer times for the at least two paths are different.
Some elements of the bus can act both as control elements and as processing elements. For example, a first control element can send a task to a second control element, which then acts as processing element.
According to an embodiment of the present invention, access to the one or more busses is managed using a time division multiple access (TDMA) scheme. In a simple TDMA scheme, the active element of the bus is changed in fixed time increments. In this way, it is determined in advance, when which element will be allowed to access the bus. In the context of the present invention, this has the advantage that precise predictions about future availability of the one or more busses can be made.
According to a further embodiment of the present invention, access to the one or more busses is managed using a token passing scheme. In particular, an access token can be passed from a first element of the bus to the next element, when the first element is finished accessing the bus. Token passing schemes can be more efficient than simple TDMA schemes because idle time slots are avoided. On the other hand, the prediction of future bus availability can be more complicated. To this end, the control element can keep a table of current and future tasks to be executed on the bus. This allows an accurate prediction of future bus availability and choosing processing elements and transfer paths such that the one or more busses are used most efficiently.
According to a further embodiment of the present invention, the one or more busses are set up as token rings, i.e. the neighbors of an element are the physical neighbors of this element.
The present invention can also be used with other protocols for controlling access to the one or more busses. These can include static and dynamic access control schemes, e.g. scheduling methods and random access methods.
The present invention can be used with different kinds of topologies, in particular linear busses, ring busses, and branch topologies, star networks and tree topologies. In some embodiments, the method of the present invention can even be used in conjunction with fully connected meshes.
A task can comprise one or more instructions and data.
Identifying one or more processing elements that are capable of processing the task can be performed for example by using a lookup table which for each processing element provides the information, which processing capabilities it has. For example, for a given processing element that comprises a graphical processing unit (GPU) the table could comprise the information that this processing element can process certain tasks relating to certain graphical processing instructions.
Identifying one or more paths for communicating with the one or more identified processing elements can be implemented by looking up in a table through which busses a given processing element is connected with the control element that is requesting processing of this task. Even if there is only one bus available to communicate with the given processing element, there might be two directions available through which the control element can communicate with this processing element. In this case, there might be e.g. two paths available for communicating with the processing element in clockwise or counter-clockwise direction on a ring bus. Furthermore, a bus might comprise branches, which also result in a plurality of paths that are available for a communication with a given processing element.
Predicting a cycle length for one or more of the identified processing elements and the identified paths may comprise using two lookup tables: a first lookup table which stores path lengths for different paths between control elements and processing elements and a second lookup table which stores information about the expected processing time for different tasks and different processing elements. For example, the second lookup table could comprise the information that a certain graphical processing instruction requires 10 clock cycles to process on a first processing element, but only eight clock cycles the process on a second processing element.
In other embodiments of the invention, there is only one lookup table, which comprises information about the expected processing times for different kinds of tasks on different processing elements. For example, such a table can comprise expected processing times for a certain instruction on a certain processing element, with further information about how the processing time varies depending on the amount of input data for this instruction.
In other words, the cycle length can be predicted based on one or more of the following information: knowledge, how the bus is structured; in which state or position the bus and or the processing elements are at the moment; information about which tasks with which amount of data need to be processed; information, whether a given task comprises more datasets than can be stored in one vector, such that the task should ideally be distributed across the available processing elements, i.e. SIMD (single instruction, multiple data) across individual processing elements and processing steps.
In some cases, the predictions may be based on exact calculations. In other cases, the predictions may be based on heuristics and only be a rough estimation of the true path time or processing time.
According to an embodiment of the present invention, the cycle length for an identified processing element and an identified path is predicted based on:
The predicted forward transfer time and the predicted return transfer time may comprise the time for the entire input data to arrive at the processing element.
According to an embodiment of the present invention, the predicted cycle length is the sum of the predicted forward transfer time, the predicted return transfer time and the predicted processing time.
This embodiment has the advantage that the predicted cycle length is particularly quick and efficient to compute. In some embodiments, the sum of the predicted forward transfer time, the predicted return transfer time and the predicted processing time may be a weighted sum. This can be particularly useful if only some of the predicted times can be exactly calculated. In this case, a higher weighting may be given to the time which is exactly calculated.
According to an embodiment of the present invention, predicting the cycle length is based on at least one of
Considering the current availability and/or utilization of the busses and the processing elements allows for an even more precise prediction of path time and processing time.
According to an embodiment of the present invention, the method further comprises:
Updating the predicted cycle length of the task to obtain a predicted remaining cycle length of the task has the advantage that further information, that becomes available only after the processing of the task has started, can be considered. For example, in cases where information becomes available that a processing element that has already started processing certain tasks is slowed down and expectedly, it may be decided to cancel processing of the task on this processing element and defer the task to a different processing element.
This embodiment of the invention has the further advantage that the processing of the task on a given processing element can be canceled if the processing takes much longer than predicted, which may be an indication that the processing on this processing element has been falsely predicted.
In other embodiments of the invention, the processing of a task on a selected processing element can be canceled if the control element determines that this processing element is needed to process a task with higher priority. This can be particularly relevant in a case of predicted likely future tasks.
In a further preferred embodiment of the invention, the information that the processing of tasks on a given processing element has taken a longer time than predicted is stored in a table and considered when predicting processing elements for similar tasks. In particular, if the processing of a certain task has failed on a given processing element, this information can be stored in a table. In extreme cases, where the processing of a certain kind of the task has repeatedly failed on a given processing element it may be decided that similar tasks should not be processed on this processing element, even if the processing element indicates that it is available.
According to an embodiment of the present invention, the method further comprises:
This embodiment provides a simple way of deciding when execution of a certain task should be canceled because it is taking significantly longer than expected, which is likely due to a processing failure.
According to a further embodiment of the invention, there is provided a device, comprising
According to an embodiment of the present invention, at least one of the control elements is configured to predict the cycle length based on
According to an embodiment of the present invention, at least one of the control elements is configured to carry out the steps:
According to an embodiment of the present invention, the device further comprises a busy table comprising information about the current availability and/or utilization of the plurality of processing elements, wherein the control element is configured to regularly update the information in the busy table.
According to an embodiment of the present invention, the one or more busses comprise one or more rings.
According to a further embodiment of the present invention, the one or more busses comprise a first set of busses for transporting instructions and a second set of busses for transporting data. This has the advantage that the first of busses can be optimized for low-latency transmission of instructions and the second set of busses can be optimized for high bandwidth transmission of potentially large amounts of data. In particular, the first and second set of busses can operate at different frequencies, e.g. the first set of busses can operate at a higher frequency whereas the second set of busses operates at a lower frequency, but provides a higher transmission capacity per cycle.
According to a further embodiment of the present invention, the one or more busses comprise two rings that are unidirectional and oriented in opposite directions.
In this way, the present invention can be executed in a particularly efficient manner because a lot of data transport time can be saved if the more suitable of the two differently oriented ring busses is chosen.
According to an embodiment of the present invention, the one or more busses comprise an Element Interconnect Bus.
According to a further embodiment of the present invention, at least one of the plurality of processing elements is connected to the one or more busses and additionally comprises a direct connection to the primary processing element.
According to an embodiment of the present invention, the device further comprises a prediction module that is configured to predict future tasks based on previously processed tasks.
Predicting future tasks has the advantage that data required for a future task can be preloaded already before the task is actually executed. For example, if it is detected that previous tasks involved loading data1.jpg, data2.jpg, and data3.jpg, the prediction module could predict that a future task likely will involve loading a possibly existent data4.jpg and thus preload data4.jpg already before the corresponding task is started. In a preferred embodiment, such preloading of data is performed only if the system is under low load, for example, if the current load of the control element is lower than a predetermined threshold value.
According to a further embodiment of the present invention, the device is configured to cancel one or more predicted future tasks in favor of executing current tasks if one or more new tasks arrive after beginning execution of one or more predicted future task. For example, it may turn out that the prediction was not accurate and the new tasks should be executed instead of the predicted future tasks.
According to a further embodiment of the present invention, there is provided a server system, comprising a device according to one of the above-described embodiments.
In this way, also a server system is preferably configured such that it provides all of the positive effects listed in the present application. Additionally, introduction and/or use of existing data center infrastructures/components/modules/elements is enabled at the same time.
According to an embodiment of the present invention, there is provided an ASIC or FPGA which is configured to carry out the method as outlined above and explained in more detail below.
According to a further aspect of the present invention, the one or more busses, the one or more control elements, and at least some of the plurality of processing elements are located inside the same chip housing. This has the advantage that a particularly high bandwidth can be achieved for communicating with the components that are located within the same housing. Furthermore, this set-up yields cost savings in mass production.
According to a further embodiment of the present invention, there is provided a computer-readable medium comprising a program code, which, when executed by a computing device, causes the computing device to carry out the method as outlined above and explained in more detail below.
Further objects, features, and advantages of this invention will become readily apparent to persons skilled in the art after a review of the following description, with reference to the drawings and claims that are appended to and form a part of this specification.
The ring busses 112, 114 are set up as direct connections between the connected elements 120-134, operated in a time-shifted manner. For the system of
Successively, the connected elements 120-134 are allowed to write, i.e., the active status is passed from one element to the next and read or write operations can only be performed by the element that is active at a given point in time. In some embodiments, more than one task can be transported in one clock cycle. Also, more than one dataset can be attached to one task (SIMD). Depending on the number of bus rings, the number of connected elements 120-134 and the starting position and direction of the pointer, it can happen that more than one ring addresses the same element at one point in time. For this case, a FIFO buffer can be provided that absorbs the additional instructions and data. In
It should be noted that in other embodiments of the invention, the ring busses 112, 114; 212, 214 shown in
One mode of operation according to an embodiment of the invention shall be illustrated with the following example: Assuming that primary processing element 520a acts as a control element and sends a task that can be processed on one of the secondary processing elements 536-550: According to a prior art processing method, based on a previous successful result stored in one of the lookup tables, the task would be sent to secondary processing element 540 using first ring 512, which requires 14 clock cycles. After processing in the secondary processing element 540, which requires 4 clock cycles, the output data would be returned to primary processing element 520a on the first ring 512, which takes another 3 clock cycles. It takes a further 13 clock cycles before the active slot is returned to the primary processing element 520a. This yields a total cycle time of 14+4+13+3=34 clock cycles. According to the present invention, ideally it would be determined that the predicted cycle time is only 3+4+0+3=10 clock cycles if the task is sent to the secondary processing element 540 via the second ring 514, and returned to the primary processing element 520a via the first ring 512 without any bus waiting time because by set-up the ring 514 may have an exactly matching offset to ring 512. In this example, the method according to the present invention leads to a reduction of the cycle time to less than a third of the cycle time according to the prior art approach.
The n connected elements correspond to n different pointer positions.
Alternatively, the bus system 610 can also be set up using a token passing scheme where the token is passed from one station to the next, wherein the “next” station is defined based on the addresses of the bus interfaces of the elements connected to the bus.
In a further embodiment of the invention, the pointer can be pushed or pulled by a connected control element to receive or send data to or from any other connected element.
For example, the RAM component 722 has a total of three physical neighbors: control element 720b, processing element 730 of the second part 712b and processing element 740 of the third part 712c. Therefore, access to this bus system 710 should be managed with a token passing scheme where the neighbor relations are defined based on the addresses of the connected elements. It should be noted that linear parts 712b and 712c can be active at the same time. Temporary or second-level tokens are used to assign the active slot within one linear part. Knowledge about the current state and the predicted future availability of the linear parts 712a, 712b and 712c can be used by the cycle prediction method and by the decision which processing elements the tasks are assigned to.
In an advantageous embodiment, to allow for the use of more than one token per bus 712a, 712b, 712c, there is a primary branch part and a plurality of secondary branch parts. This is illustrated in
To avoid conflicts, there can only be one global token 750 which always has traversing priorities. The global token 750 is indicated in
Access to the busses 812, 814 can be implemented with a simple time division multiple access scheme. Alternatively, for example, a token passing scheme or a combination of the two can be used.
With regard to the embodiments explained above, it has to be noted that said embodiments may be combined with each other. Furthermore, it is understood, that the bus systems shown in the drawings can comprise further elements and further busses that are not shown in the drawings. In particular, branches as shown in
The bus systems 110, 210, 310, 410, 510, 610, 710, 810 in particular form part of a device D. The device comprises therefore one or more busses 112, 114, 212, 214, 312, 314, 412, 512, 514, 612, 614, 712a, 712b, 712c, 812, 814, one or more control elements 120, 220, 320, 420, 520a, 520b, 620, 720a, 720b and a plurality of processing elements 122-134, 222, 322-334, 422-434, 522-550, 620-640, 720a-742, 822-842. In this device D, at least one of the control elements 120, 220, 320, 420, 520a, 520b, 620, 720a, 720b is configured to decide on a distribution path for a task based on:
Furthermore, there is a server system, comprising at least one device D being configured according to the aspects mentioned.
In an alternative embodiment, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays, and other hardware devices, can be constructed to implement one or more of the methods described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations.
In accordance with various embodiments of the present disclosure, the methods described herein may be implemented by software programs executable by a computer system. Further, in an exemplary, non-limited embodiment, implementations can include distributed processing, component/object distributed processing, and parallel processing. Alternatively, virtual computer system processing can be constructed to implement one or more of the methods or functionality as described herein.
Further, the methods described herein may be embodied in a computer-readable medium. The term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or operations disclosed herein.
As a person skilled in the art will readily appreciate, the above description is meant as an illustration of the principles of this invention. This description is not intended to limit the scope or application of this invention in that the invention is susceptible to modification, variation, and change, without departing from the spirit of this invention, as defined in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
14185007.3 | Sep 2014 | EP | regional |
This application is a continuation of International Application No. PCT/EP2015/070382 having an international filing date of Sep. 7, 2015 and a priority date of Sep. 16, 2014, the entirety of which incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2015/070382 | Sep 2015 | US |
Child | 15444965 | US |