This application is related to the following papers and commonly owned applications:
The present subject matter relates to an intelligent redundancy management framework (IRMF) to self-heal configurable data units in a reconfigurable data processor.
The technology disclosed relates to an intelligent redundancy management framework (IRMF) to self-heal configurable data units in a reconfigurable data processor.
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Systems with reconfigurable processors which execute dataflow graphs include a compiler which translates and synthesizes a machine learning model of the dataflow graphs onto arrays of reconfigurable units. For performing various operations related to the dataflow graphs, software needs to program a pool of healthy components. Efficient management of healing of defective components is required for increasing overall performance of such systems.
The technology will be described with reference to the drawings, in which:
In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well-known methods, procedures and components have been described at a relatively high level, without detail, in order to avoid unnecessarily obscuring aspects of the present concepts. A number of descriptive terms and phrases are used in describing the various embodiments of this disclosure. These descriptive terms and phrases are used to convey a generally agreed upon meaning to those skilled in the art unless a different definition is given in this specification. Some descriptive terms and phrases are presented in the following paragraphs for clarity. The technology disclosed relates to self-healing of a coarse-grained reconfigurable (CGR) processor using healthy components or in other words self-healing of operations to be performed by the runtime wants in a CGR processor.
More specifically, embodiments of the present disclosure describe an intelligent redundancy management redundancy framework (IRMF) which can self-heal any operations that have defective configurable units. In other words. the IRMF also heals any configuration file as a mechanism to avoid any defective configurable units in the CGR processor. A CGR processor includes arrays of reconfigurable units arranged as “tiles.” Each tile may also be referred to as a “minimum compute/computing unit.” In order to execute a data graph, a CGR processor has to perform a range of graph-related operations (e.g., running a graph, tuning the hyper-parameters of a graph, updating input/output endpoints of a graph, etc.)
The IRMF is a system to efficiently manage operations (such as configure, translate, observe, etc.) with components that have been marked as defective. In order to complete the operations, software needs to directly or indirectly program a pool of components of a minimum compute unit of a CGR. Some components may be healthy while some may be defective as a result of the inaccuracies in the manufacturing process. In order for the operations to run correctly, any defective component must be “healed.” A CGR can typically include a plurality of tiles, each including an array of configurable units. The IRMF manages each tile's defective components.
In one example of the disclosed system can include a control plane (also known as a control layer) coupled to initialize the IRMF and a data plane (also known as a data layer) coupled to interact with the IRMF in order to perform the graph-related operations. To optimize performance, the IRMF can either perform healing dynamically or statically. The IRMF is made aware of defective components for each tile.
As will be explained later in the specification, a tile as shown in
In one example, the control layer can be initialized with the defective component information or can dynamically detect defective components at runtime. For example, a manufacturing process could mark specific tile components as defective, and then, during system initialization, the IRMF may initialize itself with information about those components being defective.
In one example, the IRMF can dynamically inspect or identify the defective components. In some cases, the components may not be fully defective but affected by a defect to some degree.
For any graph-related operation, a plurality of components are configured to perform various computational tasks. In other words, various configurations are generated for a plurality of components for various computational tasks. In one example, these components include PCUs, PMUs, AGCUs, and switches. In other examples, there can be other reconfigurable units,
In one example, there can be rows of such components and for each row a configuration is generated which is related to a specific computational task. In the beginning, it may be understood that the compiler generates various default configurations related to the computational tasks. The compiler also receives component health data from the manufacturer. Based on the health data, if any component (for example, PCU) in a particular row is defective (or affected by a defect,) then the entire row of components is considered defective or in other words the configuration is considered defective and as such discarded. Furthermore, a new (healthy) configuration is generated using a new row with all the healthy components, in other words, the defective configuration is healed using healthy components.
In static healing, an alternate/healthy configuration for each row is preemptively generated by the compiler. For example, if there are n rows of components in the CGR, then assuming that at least one component could be defective in each row, “n” alternate/healthy configurations are generated preemptively by the compiler. As such the configuration file generated by the compiler includes both default configurations and the alternate configurations. In some examples, the compiler may first generate a default configuration file and then update that with alternate configurations to generate a final (updated) configuration file.
The configuration file including both the default configurations and the alternate configurations is then provided to the IRMF. The IRMF compares default configurations with the component health data. If any component is defective (or affected by a defect) then it replaces the default configuration comprising that component with an alternate configuration. As such, the IRMF replaces defective configurations with alternate configurations which are already present in the configuration file.
In dynamic healing, alternate configurations are generated by the IRMF itself. In this case, the configuration file generated by the compiler includes only the default configurations.
The configuration file including only the default configurations is then provided to the IRMF. The IRMF compares default configurations with the component health data. If any component is defective (or affected by a defect) then it generates on its own an alternate configuration using healthy components. As such, the IRMF replaces defective configurations with alternate configurations, i.e., heals the defective configuration. In some cases, during dynamic healing, the IRMF can receive each individual configuration at a time from the compiler and checks if its defective or not by comparing it with the component data. If the configuration is defective, then it heals the configuration by generating an alternate configuration for it. For example, upon receiving a configuration for a row of components, if a component (PCU/PMU/AGCU/switch) is affected by a defect, then the IRMF discards that row of components and generates a new configuration using another row comprising of healthy components.
In such a case, the IRMF can update the configuration using healing information whenever the data plane wants to perform a configuration operation. These configurations are generated statically (i.e., by the compiler) and used at runtime. For example, when running a program on a tile, the bitfile needs to be healed so the program only uses healthy tile components. Configuration operations are used during application-specific operations as well as in the control plane to enable certain runtime operations (for example, PMU initialization on system initialization). In one example, the IRMF which heals any configuration file as a mechanism to avoid any defective configurable units in the CGR processor.
Additionally, in one example, the data plane can perform observability operations, which allows a user to inspect the state of components of a specific tile. During observability operations, the runtime may generate a set of target components to inspect which may include defective components, and the IRMF can then heal the set of target components in the configuration file, to ensure that only healthy components are used for any operation. For example, if a user wants to inspect performance counters on a tile. The observability operation will try to target all the components on the tile. The IRMF can then heal the set of target components to make sure that performance counters of only healthy components are read.
An operation can be healed statically (compiler-generated) or dynamically (runtime-generated). Certain operations naturally map to either static or dynamic healing, so the entire system uses a hybrid approach. For example, some observability operations naturally map to a dynamic healing approach. For example, the user can specify a specific target component to inspect, and the IRMF heals that target component. Configuration operations can also be mapped via static or dynamic healing.
As can be appreciated by those skilled in the art, there is a space-time trade-off between static and dynamic healing. Space means the size of the configuration file. As the size of the configuration file increases, it can affect disk performance. As explained earlier, in a static healing scheme, the configurations are generated statically, meaning all possible healing options are generated preemptively in the same configuration file. Therefore, in static healing, size of the configuration file is O(n), where “O” is an operation (also referred to as configuration or healing option) and “n” is the number of defective component variations. In a dynamic healing scheme, a required healing option is be generated when a configuration is about to be used. Therefore, in dynamic healing, the space or the size of the configuration is O(1). As such, dynamic healing may take up less space than static healing.
Time is the runtime overhead to perform a configuration/observability operation. In static healing, a healing option is preemptively generated and stored in the program configuration file for each possible defective component configuration (i.e., each unique set of components that can be marked defective). Therefore, static healing may require less time compared to the dynamic healing, which is done during runtime.
In one example, performance optimization can be achieved in dynamic healing. This is due to the fact that each possible configuration generated in dynamic healing can have different performance characteristics in terms of space or time. As such, by generating different configurations, the dynamic healing can optimize the performance characteristics of configuration operations.
As can be understood, in static healing, the time complexity is O(1) and the space complexity is O(n), where n is the number of defective component configurations. When n is small, the space overhead is negligible and may not negatively impact on performance much. Additionally, other practical methods (i.e., compression) can be used to reduce the space scaling factor. However, as n increases, the additional space requirements may affect performance because of latency or bandwidth requirements, which are mentioned in the configuration file.
In dynamic healing, space complexity of dynamic healing is O(1), since all heal options are generated dynamically. The time complexity of dynamic healing is O(f(n)), where f represents the time complexity of the healing function invoked at runtime and n is the number of defective component configurations. The complexity of the function is dependent on the implementation and may vary depending on the scheme used. For example, a simple healing function might be O(n), whereas a healing function which performs more optimization might be O(n{circumflex over ( )}2). In summary, dynamic and static healing make a tradeoff between space and time, and a hybrid approach allows for fine-grained tuning of performance. The IRMF can be made aware of different performance metrics, which can be used to optimize operation performance. While performing dynamic healing, the IRMF is solving an optimization problem, which is defined either by the user or by the compiler. For example, suppose that the IRMF is made aware of both the latency and bandwidth of memory. The compiler could specify that a given operation will be bottlenecked by memory bandwidth. When applying the dynamic healing function, the IRMF can optimize the healing for increased memory bandwidth, which would optimize the performance of the application. In another example, the IRMF could be made aware of the time complexity of different dynamic healing schemes. The compiler or user could specify that the runtime setup of the application is a performance bottleneck. The dynamic healing function could use a scheme which has a time complexity of O(n) rather than O(n{circumflex over ( )}2) in order to optimize application performance, at the cost of a sub-optimal healing option with respect to memory bandwidth being used. Extending this, the IRMF can be made aware of any performance metric, both on the host and RDU, in order to optimize the performance of an operation.
Traditional compilers translate human-readable computer source code into machine code that can be executed on a Von Neumann computer architecture. In this architecture, a processor serially executes instructions in one or more threads of software code. The architecture is static, and the compiler does not determine how execution of the instructions is pipelined, or which processor or memory takes care of which thread. Thread execution is asynchronous, and safe exchange of data between parallel threads is not supported.
High-level programs for machine learning (ML) and artificial intelligence (AI) may require massively parallel computations, where many parallel and interdependent threads (meta-pipelines) exchange data. Such programs are ill-suited for execution on Von Neumann computers. They require architectures that are optimized for parallel processing, such as coarse-grained reconfigurable architectures (CGRAs) or graphic processing units (GPUs). The ascent of ML, AI, and massively parallel architectures places new requirements on compilers, including how computation graphs, and in particular dataflow graphs, are pipelined, which operations are assigned to which compute units, how data is routed between various compute units and memory, and how synchronization is controlled particularly when a dataflow graph includes one or more nested loops, whose execution time varies dependent on the data being processed.
As used herein, the phrase one should be interpreted to mean exactly one of the listed items. For example, the phrase “one of A, B, and C” should be interpreted to mean any of: only A, only B, or only C.
As used herein, the phrases at least one of and one or more of should be interpreted to mean one or more items. For example, the phrase “at least one of A, B, and C” or the phrase “at least one of A, B, or C” should be interpreted to mean any combination of A, B, and/or C. The phrase “at least one of A, B, and C” means at least one of A and at least one of B and at least one of C.
Unless otherwise specified, the use of ordinal adjectives first, second, third, etc., to describe an object merely refers to different instances or classes of the object and does not imply any ranking or sequence.
The following terms or acronyms used herein are defined at least in part as follows:
Compiler—a translator that processes statements written in a programming language to machine language instructions for a computer processor. A compiler may include multiple stages to operate in multiple steps. Each stage may create or update an intermediate representation (IR) of the translated statements. Compiler stages are illustrated with reference to
Computation graph—some algorithms can be represented as computation graphs. As used herein, computation graphs are a type of directed graphs comprising nodes that represent mathematical operations/expressions and edges that indicate dependencies between the operations/expressions. For example, with machine learning (ML) algorithms, input layer nodes assign variables, output layer nodes represent algorithm outcomes, and hidden layer nodes perform operations on the variables. Edges represent data (e.g., scalars, vectors, tensors) flowing between operations. In addition to dependencies, the computation graph reveals which operations and/or expressions can be executed concurrently.
CGR unit—a circuit that can be configured and reconfigured to locally store data (e.g., a memory unit or a PMU), or to execute a programmable function (e.g., a compute unit or a PCU). A CGR unit includes hardwired functionality that performs a limited number of functions used in computation graphs and dataflow graphs. Further examples of CGR units include a CU and an AG, which may be combined in an AGCU. Some implementations include CGR switches, whereas other implementations may include regular switches.
CU—coalescing unit.
Data Flow Graph—a computation graph that includes one or more loops that may be nested, and wherein nodes can send messages to nodes in earlier layers to control the dataflow between the layers.
Datapath—a collection of functional units that perform data processing operations. The functional units may include memory, multiplexers, ALUs, SIMDs, multipliers, registers, buses, etc.
FCMU—fused compute and memory unit—a circuit that includes both a memory unit and a compute unit.
Graph—a collection of nodes connected by edges. Nodes may represent various kinds of items or operations, dependent on the type of graph. Edges may represent relationships, directions, dependencies, etc.
IC—integrated circuit—a monolithically integrated circuit, i.e., a single semiconductor die which may be delivered as a bare die or as a packaged circuit. For the purposes of this document, the term integrated circuit also includes packaged circuits that include multiple semiconductor dies, stacked dies, or multiple-die substrates. Such constructions are now common in industry, produced by the same supply chains, and for the average user often indistinguishable from monolithic circuits.
A logical CGR array or logical CGR unit—a CGR array or a CGR unit that is physically realizable, but that may not have been assigned to a physical CGR array or to a physical CGR unit on an IC.
ML—machine learning.
PCU—pattern compute unit—a compute unit that can be configured to perform one or more operations.
PEF—processor-executable format—a file format suitable for configuring a configurable data processor.
Pipeline—a staggered flow of operations through a chain of pipeline stages. The operations may be executed in parallel and in a time-sliced fashion. Pipelining increases overall instruction throughput. CGR processors may include pipelines at different levels. For example, a compute unit may include a pipeline at the gate level to enable correct timing of gate-level operations in a synchronous logic implementation of the compute unit, and a meta-pipeline at the graph execution level to enable correct timing of node-level operations of the configured graph. Gate-level pipelines are usually hard wired and unchangeable, whereas meta-pipelines are configured at the CGR processor, CGR array level, and/or CGR unit level.
Pipeline Stages—a pipeline is divided into stages that are coupled with one another to form a pipe topology.
PMU—pattern memory unit—a memory unit that can locally store data.
PNR—place and route—the assignment of logical CGR units and associated processing/operations to physical CGR units in an array, and the configuration of communication paths between the physical CGR units.
RAIL—reconfigurable dataflow unit (RDU) abstract intermediate language.
CGR Array—an array of CGR units, coupled with each other through an array-level network (ALN), and coupled with external elements via a top-level network (TLN). A CGR array can physically implement the nodes and edges of a dataflow graph and is sometimes referred to as a reconfigurable dataflow unit (RDU).
SIMD—single-instruction multiple-data—an arithmetic logic unit (ALU) that simultaneously performs a single programmable operation on multiple data elements delivering multiple output results.
TLIR—template library intermediate representation.
TLN—top-level network.
IRMF—Intelligent Management Redundancy Framework
Initial configuration or default configuration—Configuration data in configuration file at the starting state.
The architecture, configurability and dataflow capabilities of an array of CGR units enable increased compute power that supports both parallel and pipelined computation. A CGR processor, which includes one or more CGR arrays (arrays of CGR units), can be programmed to simultaneously execute multiple independent and interdependent dataflow graphs. To enable simultaneous execution, the dataflow graphs may need to be distilled from a high-level program and translated to a configuration file for the CGR processor. A high-level program is source code written in programming languages like Spatial, Python, C++, and C, and may use computation libraries for scientific computing, ML, AI, and the like. The high-level program and referenced libraries can implement computing structures and algorithms of machine learning models like AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, RCNN, YOLO, SqueezeNet, SegNet, GAN, BERT, ELMo, USE, Transformer, and Transformer-XL.
Translation of high-level programs to executable bitfiles/configuration files is performed by a compiler, see, for example,
In dataflow processors with reconfigurable architectures, a pipeline of computational stages can be formed in the array of reconfigurable units to execute dataflow graphs. The computational stages Since various computational stages can have various latencies, efficiently manage the pipeline, especially when it comes to providing the final output of the pipeline, can be challenging.
Host 180 may be or include a computer such as further described with reference to
CGR processor 110 may accomplish computational tasks by executing a configuration file file0 165. For the purposes of this description, a configuration file corresponds to a dataflow graph, or a translation of a dataflow graph, and may further include initialization data. A compiler 160 compiles the high-level program to provide the configuration file file0 165. In some implementations described herein, a CGR array is configured by programming one or more configuration stores with all or parts of the configuration file file0 165. A single configuration store may be at the level of the CGR processor or the CGR array, or a CGR unit may include an individual configuration store. The configuration file may include configuration data for the CGR array and CGR units in the CGR array and link the computation graph to the CGR array. Execution of the configuration file file0 165 by CGR processor 110 causes the CGR array(s) to implement the user algorithms and functions in the dataflow graph.
CGR processor 110 can be implemented on a single integrated circuit die or on a multichip module (MCM). An IC can be packaged in a single chip module or a multichip module. An MCM is an electronic package that may comprise multiple IC dies and other devices, assembled into a single module as if it were a single device. The various dies of an MCM may be mounted on a substrate, and the bare dies of the substrate are electrically coupled to the surface or to each other using for some examples, wire bonding, tape bonding or flip-chip bonding.
A CGR array comprises an array of CGR units (e.g., PMUs, PCUs, FCMUs) coupled via an array-level network (ALN), e.g., a bus system. The ALN is coupled with the TLN 330 through several AGCUs, and consequently with I/O interface 338 (or any number of interfaces) and memory interface 339. Other implementations may use different bus or communication architectures.
Circuits on the TLN in this example include one or more external I/O interfaces, including I/O interface 338 and memory interface 339. The interfaces to external devices include circuits for routing data among circuits coupled with the TLN and external devices, such as high-capacity memory, host processors, other CGR processors, FPGA devices, and so on, that are coupled with the interfaces.
Each depicted CGR array has four AGCUs (e.g., MAGCU1, AGCU12, AGCU13, and AGCU14 in CGR array 310). The AGCUs interface the TLN to the ALNs and route data from the TLN to the ALN or vice versa.
One of the AGCUs in each CGR array in this example is configured to be a master AGCU (MAGCU), which includes an array configuration load/unload controller for the CGR array. The MAGCU1 includes a configuration load/unload controller for CGR array 310, and MAGCU2 includes a configuration load/unload controller for CGR array 320. Some implementations may include more than one array configuration load/unload controller. In other implementations, an array configuration load/unload controller may be implemented by logic distributed among more than one AGCU. In yet other implementations, a configuration load/unload controller can be designed for loading and unloading configuration of more than one CGR array. In further implementations, more than one configuration controller can be designed for configuration of a single CGR array. Also, the configuration load/unload controller can be implemented in other portions of the system, including as a stand-alone circuit on the TLN and the ALN or ALNs.
The TLN is constructed using top-level switches (switch 311, switch 312, switch 313, switch 314, switch 315, and switch 316) coupled with each other as well as with other circuits on the TLN, including the AGCUs, and external I/O interface 338. The TLN includes links (e.g., L11, L12, L21, L22) coupling the top-level switches. Data may travel in packets between the top-level switches on the links, and from the switches to the circuits on the network coupled with the switches. For example, switch 311 and switch 312 are coupled by link L11, switch 314 and switch 315 are coupled by link L12, switch 311 and switch 314 are coupled by link L13, and switch 312 and switch 313 are coupled by link L21. The links can include one or more buses and supporting control lines, including for example a chunk-wide bus (vector bus). For example, the top-level network can include data, request and response channels operable in coordination for transfer of data in any manner known in the art.
A configuration file may include configuration data representing an initial configuration, or starting state, of each of the CGR units that execute a high-level program with user algorithms and functions. Program load is the process of setting up the configuration stores in the CGR array based on the configuration data to allow the CGR units to execute the high-level program. Program load may also require loading memory units and/or PMUs.
The ALN includes one or more kinds of physical data buses, for example a chunk-level vector bus (e.g., 512 bits of data), a word-level scalar bus (e.g., 32 bits of data), and a control bus. For instance, interconnects 421 between two switches may include a vector bus interconnect with a bus width of 512 bits, and a scalar bus interconnect with a bus width of 32 bits. A control bus can comprise a configurable interconnect that carries multiple control bits on signal routes designated by configuration bits in the CGR array's configuration file. The control bus can comprise physical lines separate from the data buses in some implementations. In other implementations, the control bus can be implemented using the same physical lines with a separate protocol or in a time-sharing procedure.
Physical data buses may differ in the granularity of data being transferred. In one implementation, a vector bus can carry a chunk that includes 16 channels of 32-bit floating-point data or 32 channels of 16-bit floating-point data (i.e., 512 bits) of data as its payload. A scalar bus can have a 32-bit payload and carry scalar operands or control information. The control bus can carry control handshakes such as tokens and other signals. The vector and scalar buses can be packet-switched, including headers that indicate a destination of each packet and other information such as sequence numbers that can be used to reassemble a file when the packets are received out of order. Each packet header can contain a destination identifier that identifies the geographical coordinates of the destination switch unit (e.g., the row and column in the array), and an interface identifier that identifies the interface on the destination switch (e.g., North, South, East, West, etc.) used to reach the destination unit.
A CGR unit 401 may have four ports (as drawn) to interface with switch units 403, or any other number of ports suitable for an ALN. Each port may be suitable for receiving and transmitting data, or a port may be suitable for only receiving or only transmitting data.
A switch unit, as shown in the example of
During execution of a graph or subgraph in a CGR array after configuration, data can be sent via one or more switch units and one or more links between the switch units to the CGR units using the vector bus and vector interface(s) of the one or more switch units on the ALN. A CGR array may comprise at least a part of CGR array 400, and any number of other CGR arrays coupled with CGR array 400.
A data processing operation implemented by CGR array configuration may comprise multiple graphs or subgraphs specifying data processing operations that are distributed among and executed by corresponding CGR units (e.g., FCMUs, PMUs, PCUs, AGCUs).
Each vector input is buffered in this example using a vector FIFO in a vector FIFO block 460 which can include one or more vector FIFOs. Likewise in this example, each scalar input is buffered using a scalar FIFO 455. Using input FIFOs decouples timing between data producers and consumers and simplifies inter-configurable-unit control logic by making it robust to input delay mismatches.
A configurable unit includes multiple reconfigurable datapaths in block 480. A datapath in a configurable unit can be organized as a multi-stage (Stage 1 . . . . Stage N), reconfigurable SIMD (Single Instruction, Multiple Data) pipeline. The chunks of data pushed into the configuration serial chain in a configurable unit include configuration data for each stage of each datapath in the configurable unit. The configuration serial chain in the configuration data store 425 is connected to the multiple datapaths in block 480 via line 435.
A configurable datapath organized as a multi-stage pipeline can include multiple functional units (e.g., 481, 482, 483; 484, 485, 486) at respective stages. A special functional unit SFU (e.g., 483, 486) in a configurable datapath can include a configurable module 487 that comprises sigmoid circuits and other specialized computational circuits, the combinations of which can be optimized for particular implementations. In one embodiment, a special functional unit can be at the last stage of a multi-stage pipeline and can be configured to receive an input line X from a functional unit (e.g., 482, 486) at a previous stage in a multi-stage pipeline. In some embodiments, a configurable unit like a PCU can include many sigmoid circuits, or many special functional units which are configured for use in a particular graph using configuration data.
Configurable units in the array of configurable units include configuration data stores 425 (e.g., serial chains) to store unit files comprising a plurality of chunks (or sub-files of other sizes) of configuration data particular to the corresponding configurable units. Configurable units in the array of configurable units each include unit configuration load logic 440 connected to the configuration data store 425 via line 461, to execute a unit configuration load process. The unit configuration load process includes receiving, via the bus system (e.g., the vector inputs), chunks of a unit file particular to the configurable unit and loading the received chunks into the configuration data store 425 of the configurable unit. The unit file loaded into the configuration data store 425 can include configuration data, including opcodes and routing configuration, for circuits implementing a matrix multiply as described with reference to
The configuration data stores in configurable units in the plurality of configurable units in this example comprise serial chains of latches, where the latches store bits that control configuration of the resources in the configurable unit. A serial chain in a configuration data store can include a shift register chain for configuration data and a second shift register chain for state information and counter values connected in series.
Input configuration data 410 can be provided to a vector FIFO as vector inputs, and then be transferred to the configuration data store 425. Output configuration data 430 can be unloaded from the configuration data store 425 using the vector outputs.
The CGRA uses a daisy-chained completion bus to indicate when a load/unload command has been completed. The master AGCU transmits the program load and unload commands to configurable units in the array of configurable units over a daisy-chained command bus. As shown in the example of
The bus interfaces can include scalar inputs, vector inputs, scalar outputs and vector outputs, usable to provide write data (WD). The data path can be organized as a multi-stage reconfigurable pipeline, including stages of functional units (FUs) and associated pipeline registers (PRs) that register inputs and outputs of the functional units. PMUs can be used to store distributed on-chip memory throughout the array of reconfigurable units.
A scratchpad is built with multiple SRAM banks (e.g., 531, 532, 533, 534). Banking and buffering logic 535 for the SRAM banks in the scratchpad can be configured to operate in several banking modes to support various access patterns. A computation unit as described herein can include a lookup table stored in the scratchpad memory 530, from a configuration file or from other sources. In a computation unit as described herein, the reconfigurable scalar data path 520 can translate a section of a raw input value I for addressing lookup tables implementing a function f(I), into the addressing format utilized by the scratchpad memory 530, adding appropriate offsets and so on, to read the entries of the lookup table stored in the scratchpad memory 530 using the sections of the input value I. Each PMU can include write address calculation logic and read address calculation logic that provide write address WA, write enable WE, read address RA and read enable RE to the banking buffering logic 535. Based on the state of the local FIFOs 511 and 519 and external control inputs, the control block 515 can be configured to trigger the write address computation, read address computation, or both, by enabling the appropriate counters 516. More specifically, the counters 516, which can be a programmable counter chain 516 (Control Inputs, Control Outputs,) and control block 515 can trigger PMU execution.
Instrumentation logic 518 is included in this example of a configurable unit. The instrumentation logic 518 can be part of the control block 515 or implemented as a separate block on the device. The instrumentation logic 518 is coupled to the control inputs and to the control outputs. Also, the instrumentation logic 518 is coupled to the control block 515 and the counter chain 516, for exchanging status signals and control signals in support of a control barrier network configured as discussed above.
This is one simplified example of a configuration of a configurable processor for implementing a computation unit as described herein. The configurable processor can be configured in other ways to implement a computation unit. Other types of configurable processors can implement the computation unit in other ways. Also, the computation unit can be implemented using dedicated logic in some examples, or a combination of dedicated logic and instruction-controlled processors.
Each stage in PCU 560 may also hold one or more registers (not drawn) for short-term storage of parameters. Short-term storage, for example during one to several clock cycles or unit delays, allows for synchronization of data in the PCU pipeline.
Compiler stack 600 may take its input from application platform 610, or any other source of high-level program statements suitable for parallel processing, which provides a user interface for general users. It may further receive hardware description 615, for example defining the physical units in a reconfigurable data processor or CGRA processor. Application platform 610 may include libraries such as PyTorch, TensorFlow, ONNX, Caffe, and Keras to provide user-selected and configured algorithms. The example user program 700 depicted in
More specifically,
The kernel module 650 dynamically discovers reconfigurable processor devices in the pool of reconfigurable data flow resources 655 during module initialization and presents them as a single virtual device/dev/rdu (which may be a virtual reconfigurable dataflow unit) to the applications 102 running in the user space. As a result, each reconfigurable processor device acts as a core and each subarray of configurable units (e.g., tile) acts as a hardware thread, which can be dynamically allocated to a process by the resource manager 651 of the kernel module 650.
The kernel 322 includes a resource manager 651, a scheduler 652, a device abstraction module 653, and a device driver 654. The resource manager 651 manages the host memory and the device memory (e.g., on-chip and off-chip memory of the reconfigurable processors) and provides efficient allocation/free functions for the applications 102 and binary data (e.g., bit files, data, arguments, segments, symbols, etc.) in the execution file 156. The scheduler 472 manages queuing and mapping of the configuration files for the applications 102 depending on the availability of the hardware resources.
The device abstraction module 473 scans all the reconfigurable processors in the pool of reconfigurable data flow resources 178 and presents them as a single virtual reconfigurable processor device to the user.
Application platform 610 outputs a high-level program to compiler 620 (which is an example of the compiler 160 shown in
Dataflow graph compiler 621 converts the high-level program with user algorithms and functions from application platform 610 to one or more dataflow graphs. The high-level program may be suitable for parallel processing, and therefore parts of the nodes of the dataflow graphs may be intrinsically parallel unless an edge in the graph indicates a dependency. Dataflow graph compiler 621 may provide code optimization steps like false data dependency elimination, dead-code elimination, and constant folding. The dataflow graphs encode the data and control dependencies of the high-level program. Dataflow graph compiler 621 may support programming a reconfigurable data processor at higher or lower-level programming languages, for example from an application platform 610 to C++ and assembly language. In some implementations, dataflow graph compiler 621 allows programmers to provide code that runs directly on the reconfigurable data processor. In other implementations, dataflow graph compiler 621 provides one or more libraries that include predefined functions like linear algebra operations, element-wise tensor operations, non-linearities, and reductions required for creating, executing, and profiling the dataflow graphs on the reconfigurable processors. Dataflow graph compiler 621 may provide an application programming interface (API) to enhance functionality available via the application platform 610.
Algebraic graph compiler 622 may include a model analyzer and compiler (MAC) layer that makes high-level mapping decisions for (sub-graphs of the) dataflow graph based on hardware constraints. It may support various application frontends such as Samba, JAX, and TensorFlow/HLO. Algebraic graph compiler 622 may also transform the graphs by automatically generating gradient computing graphs, perform stitching between sub-graphs, for performance and latency estimation, convert dataflow graph operations to AIR operation, perform tiling, sharding (database partitioning) and other operations, and model the parallelism that can be achieved on the dataflow graphs.
Algebraic graph compiler 622 may further include an arithmetic or algebraic intermediate representation (AIR) level that translates high-level graph and mapping decisions provided by the MAC layer into explicit AIR graphs. Key responsibilities of the AIR level include legalizing the graph and mapping decisions of the MAC, expanding data parallel, tiling, metapipe, region instructions provided by the MAC, inserting stage buffers and skip buffers, eliminating redundant operations, buffers and sections, and optimizing for resource use, latency, and throughput. The AIR layer constructs pipelines based on MAC mapping decisions by placing operations into a metapipe and inserting stage buffers between them. It may also insert AllReduce instructions for collecting results from parallelized operations. It may also further optimize by redundant operation and dead code elimination, pipeline collapsing, and operation fusion.
Template graph compiler 623 may translate AIR statements and/or graphs into TLIR statements 900 (see
Template library 624 provides templates for commonly used operations, for example GEMM. Templates are implemented using assembly language. Templates are further compiled by an assembler that provides an architecture-independent low-level programming interface as well as optimization and code generation for the target hardware. Responsibilities of the assembler may include address expression compilation, intra-unit resource allocation and management, making a template graph physically realizable with target-specific rules, low-level architecture-specific transformations and optimizations, and architecture-specific code generation.
PNR 625 translates and maps logical (i.e., unplaced physically realizable) CGR units (e.g., the nodes of the logical computation graph 1100 shown in
Further implementations of compiler 620 provide for an iterative process, for example by feeding information from PNR 625 back to an earlier module, so that the earlier module can execute a new compilation step in which it uses physically realized results rather than estimates of or placeholders for physically realizable circuits. For example, PNR 625 may feed information regarding the physically realized circuits back to algebraic graph compiler 622.
Memory allocations represent the creation of logical memory spaces in on-chip and/or off-chip memories for data required to implement the dataflow graph, and these memory allocations are specified in the configuration file. Memory allocations define the type and the number of hardware circuits (functional units, storage, or connectivity components). Main memory (e.g., DRAM) may be off-chip memory, and scratchpad memory (e.g., SRAM) is on-chip memory inside an RDU. Other memory types for which the memory allocations can be made for various access patterns and layouts include cache, read-only look-up tables (LUTs), serial memories (e.g., FIFOs), and register files.
Compiler 620 binds memory allocations to unplaced memory units and binds operations specified by operation nodes in the dataflow graph to unplaced compute units, and these bindings may be specified in the configuration data. In some implementations, compiler 620 partitions parts of a dataflow graph into multiple subgraphs such as memory subgraphs or compute subgraphs and specifies these subgraphs in the PEF file1 167. A memory subgraph may comprise address calculations leading up to a memory access. A compute subgraph may comprise all other operations in the parent graph. In one implementation, a parent graph is broken up into multiple memory subgraphs and exactly one compute subgraph. A single parent graph can produce one or more memory subgraphs, depending on how many memory accesses exist in the original loop body. In cases where the same memory addressing logic is shared across multiple memory accesses, address calculation may be duplicated to create multiple memory subgraphs from the same parent graph.
Compiler 620 generates the configuration files with configuration data (e.g., a bit stream) for the placed positions and the routed data and control networks. In one implementation, this includes assigning coordinates and communication resources of the physical CGR units by placing and routing unplaced units onto the array of CGR units while maximizing bandwidth and minimizing latency.
After software-stack compilation of dataflow graphs, all compute nodes in the graph are assigned a dedicated pipeline stage with a stage buffer before and after that graph-node. A stage-buffer implementation can range from one to several PMUs and consumes variable on-chip SRAM resources. Compiler 620 may then estimate a latency for each stage in the pipeline and further determine the longest latency for each pipeline. As different nodes require varied compute complexity, some stages consume smaller latency compared to other nodes. In general, a data graph sample that has completed computation at the current stage will wait in a stage buffer before the next stage until the latter computation is complete for another sample. This will be explained in greater detail with regard to
Furthermore, the configN is for setN/rowN which includes a PCUn 1344 (healthy), PMUn 1346 (healthy), and an AGCUn 1348 (healthy) among many other components which are not shown here.
As explained earlier, the default configuration file file0 165 is a result of the configuration file received by the compiler 620. One difference between in
In one example, in dynamic healing, the IRMF 1200 generates alternate healthy configurations for defective configurations components during runtime 630. Furthermore, the runtime generates an updated configuration file file4 1465 or PEF file file5 1467. In this example, the generate operations are shown as generate0 1401 and generate1 1403. As can be seen the default configuration 1441 is provided to the runtime 630 as is. The runtime 630 then starts configuring the components and during this time the IRMF 1200 interacts with the runtime 630 to generates a similar healthy component/configuration for any defective component/configuration. For example, for the first defective config50, the IRMF 1200 generates a healthy config60 including PMUk 1454, a healthy PMUk 1456, and a healthy AGCUk 1458, via first generate operation generate0 1401. Similarly, for the second defective config51, the IRMF 1200 generates a healthy config61 including a healthy PCUm 1464, a healthy PMUm 1466, and a healthy AGCUk 1468, via a second generate operation generate1 1403. In one example, the IRMF 1200 replaces the defective configurations config50 and config51 with the healthy configurations config60 and config61 respectively.
Since the configN has all the healthy components (healthy PCUn 1354, a healthy PMUn 1356, and a healthy AGCUn 1358) in the default configuration 1441, the runtime 630 retains the configN as it is. Although shown sequentially, the runtime 630 can program any or all the configurations concurrently. Similarly, the IRMF 1200 can also perform any or all replacement operations sequentially or concurrently. Furthermore, the healed configurations (config60 and config61) along with the default healthy configuration (configN) are further provided the updated configuration file4 1465 or PEF file file5 1467, which is further provided to the CGR processor 110.
One difference to configure a set of components in the plurality of components using the configuration, between
In some examples, in static healing there can be multiple configuration files stacked together. The IRMF 1200 then chooses the correct one based on the component state (healthy or d Static Healing has multiple configuration files stacked together. IRMF chooses the correct one based on the component state. The default config file is not used if not applicable based on the defect. The default configuration file is used if applicable based on the health of the components. If the components in a tile are defective then an alternate configuration file is selected by the IRMF 1200. In dynamic healing, there is only one configuration file which is the default configuration file and that is updated such that non-defective components are targeted for programming.
It should be noted that in any or all of the above Figures (
In one example, the goal of static or dynamic healing of components is to optimize the performance of the applications running on the CGR processor 110. In static healing, as the components are healed before being provided to the runtime, latency of the graph-related operations is optimized. On the other hand, in dynamic healing since the components are healed during runtime, the space is optimized. Therefore, in order to decide whether to use static healing or dynamic healing, the IRMF uses performance metrics of the CGR processor 110. If based on the current performance, the CGR processor 110 needs better operation latency, then the IRMF 1200 may use static healing to heal any defective components. If the based on the current performance, the CGR processor 110 needs more memory bandwidth, then the IRMF 1200 may use the dynamic healing. The optimization objective 1204 in
As shown at step 1602, health data for a component in a tile of the CGR processor can be received from the manufacturer via an application platform. An example of this is shown in
At step 1604, configuration file including default configurations and their equivalent healthy configurations for a plurality of components or plurality of sets of components can be received from a compiler. This is shown in
At step 1606, defective default configurations can be replaced with alternate healthy configurations that are preemptively present in the default configuration file. An example of this is shown in
At step 1608, the updated configuration file can be provided to the runtime. An example of this is shown in
As shown at step 1702, health data for a component in a tile of the CGR processor can be received from the manufacturer via an application platform. An example of this is shown in
At step 1704, the IRMF can receive a configuration file from a compiler which includes default configurations. This is shown in
At steps 1706 and 1708, the IRMF can inspect during runtime, the health of a single default configuration by checking if any component in that configuration is affected by a defect. If so, then the method proceed to step 1710, if not then the method can proceed to step 1712. An example of this is shown in
At step 1710, upon identifying that a component is defective, the IRMF can generate an alternate healthy configuration using a set of healthy components. An example of this is shown in
At step 1712, upon identifying the configuration/healthy, the default configuration or component can be retained. An example of this is shown in
At step 1714, the healthy configurations can be provided to an updated configuration file. An example of this is shown in
At step 1716, the method can if all the components/configurations are inspected. If so, then the method can proceed to block 1718. If not, then the method go back to the step 1706.
At step 1718, the updated configuration file can be provided to the CGR processor. An example of this is shown in
A first example of accelerated deep learning is using a deep learning accelerator implemented in a CGRA to train a neural network. A second example of accelerated deep learning is using the deep learning accelerator to operate a trained neural network to perform inferences. A third example of accelerated deep learning is using the deep learning accelerator to train a neural network and subsequently perform inference with any one or more of the trained neural network, information from the trained neural network, and a variant of the same.
Examples of various implementations are described in the following paragraphs:
Example 1. A data processing system comprising a coarse-grained reconfigurable (CGR) processor including an array of reconfigurable units configured to execute a dataflow graph, a compiler coupled to provide a configuration file including a configuration for a plurality of components in the plurality of reconfigurable units, an initialization block (control plane) coupled to receive health data of a plurality of components in the reconfigurable units and initialize an intelligent redundancy management framework (IRMF) with the health data of the plurality of components, a configuration block (data plane) coupled to configure a set of components in the plurality of components using the configuration, a runtime coupled to execute a plurality of operations using the set of components, wherein the IRMF is coupled to check health data of the set of components and identify the configuration as a defective configuration if a component in the set of component is defective, identify the configuration as a healthy configuration if all the components in the set of components are healthy, and wherein the IRMF is further coupled to perform a healing operation for the defective configuration by replacing the defective configuration with an alternate configuration using a different set of all healthy components.
Example 2. The system of example 1, wherein the healing operation can be a static healing operation or a dynamic healing operation.
Example 3. The system of example 2, wherein in the static healing operation, the configuration file includes a default configuration with healthy or defective components, and an alternate configuration similar to the default configuration, the alternate configuration including healthy components.
Example 4. The system of example 2, wherein in the static healing operation, the IRMF is coupled to receive the configuration file from the compiler and update the configuration file by performing the healing operation before the runtime executing an operation.
Example 5. The system of example 2, wherein in a dynamic healing operation, the IRMF performs the healing operation during runtime.
Example 6. The system of example 1, wherein the component is a pattern compute unit (PCU), pattern memory unit, address generation and coalescing unit, or a switch.
Example 7. The system of example 1, wherein the configuration block is coupled to perform an observability operation on a target component in response to a user input.
Example 8. The system of example 7, wherein the target component is preselected by the user.
Example 9. A method for a data processing system comprising a coarse-grained reconfigurable (CGR) processor including an array of reconfigurable units configured to execute a dataflow graph, receiving by a compiler, a configuration file including a configuration for a plurality of components in the plurality of reconfigurable units, receiving health data of a plurality of components, initializing an intelligent redundancy management framework (IRMF) with the health data of the plurality of components, configuring a set of components in the plurality of components using the configuration, checking health data of the set of components to identify each component as a defective component or a healthy component, retaining the configuration if all the components in the configuration are healthy, replacing the configuration with an alternate configuration by using a different set of all healthy components if a component in the set of components is defective, thereby updating the configuration file, and configuring during runtime, the set of components using the alternate configuration.
Example 10. The method of example 9, further comprising performing a static healing operation or a dynamic healing operation.
Example 11. The method of example 10, further comprising performing the static healing operation by receiving the configuration file including a default configuration with healthy or defective components, and an alternate configuration similar to the default configuration, the alternate configuration including healthy components.
Example 12. The method of example 10, further comprising performing the dynamic healing operation, by receiving configuration file from the runtime and updating the configuration file during the runtime.
Example 13. The method of example 10, wherein the component is a pattern compute unit (PCU), pattern memory unit, address generation and coalescing unit, or a switch.
Example 14. The method of example 9, further comprising an observability operation on a target component in response to a user input.
Example 15. The method of example 14, further comprising selecting via the user input the target component, checking the health data of the target component, and performing a healing operation for the target component upon identifying the target component to be defective during execution.
Example 16. A non-transitory machine-readable medium comprising computer instructions that, in response to being executed by a processor, cause the processor to: produce a configuration file to configure an array of reconfigurable units to execute a dataflow graph, the configuration file comprising a configuration for a plurality of components in the array of reconfigurable units, initialize an intelligent redundancy management framework (IRMF) to receive health data of the plurality of components, configure a set of components in the plurality of components using the configuration, check health data of the set of components to identify each component as a defective component or a healthy component, retain the configuration if all the components in the set of components are healthy, replace the defective configuration with an alternate configuration using a different set of all healthy components, and perform a healing operation for the defective component by replacing the defective configuration with the alternate configuration, thereby updating the configuration file, and configuring during runtime, the set of components using the alternate configuration.
Example 17. The non-transitory machine-readable medium of example 16, wherein the healing operation is a static healing operation or a dynamic healing operation.
Example 18. The non-transitory machine-readable medium of example 17, wherein the IRMF is coupled to perform the static healing operation by receiving the configuration file including a default configuration with healthy or defective components, and an alternate configuration similar to the default configuration, the alternate configuration including healthy components.
Example 19. The non-transitory machine-readable medium of example 18, wherein the IRMF is further coupled to perform the dynamic healing operation, by generating a new configuration using a different set of healthy components during the runtime.
Example 20. The non-transitory machine-readable medium of example 16, wherein the component is a pattern compute unit (PCU), pattern memory unit, address generation and coalescing unit, or a switch.
Examples of neural networks include fully connected neural networks (FCNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), convolutional neural networks (CNNs), graph convolutional networks (GCNs), long short-term memory (LSTM) networks, autoencoders, deep belief networks, and generative adversarial networks (GANs).
An example of training a neural network is determining one or more weights associated with the neural network, such as by back-propagation in a deep learning accelerator. An example of making an inference is using a trained neural network to compute results by processing input data using the weights associated with the trained neural network. As used herein, the term ‘weight’ is an example of a ‘parameter’ as used in various forms of neural network processing. For example, some neural network learning is directed to determining parameters (e.g., through back-propagation) that are usable for performing neural network inferences.
A neural network processes data according to a dataflow graph comprising layers of neurons. Example layers of neurons include input layers, hidden layers, and output layers. Stimuli (e.g., input data) are received by an input layer of neurons and the computed results of the dataflow graph (e.g., output data) are provided by an output layer of neurons. Example hidden layers include rectified linear unit (ReLU) layers, fully connected layers, recurrent layers, graphical network layers, long short-term memory layers, convolutional layers, kernel layers, dropout layers, and pooling layers. A neural network may be conditionally and/or selectively trained. After being trained, a neural network may be conditionally and/or selectively used for inference.
Examples of ICs, or parts of ICs, which may be used as deep learning accelerators, are processors such as central processing unit (CPUs), CGR processor ICs, graphics processing units (GPUs), FPGAs, ASICs, application-specific instruction-set processor (ASIP), and digital signal processors (DSPs). The disclosed technology implements efficient distributed computing by allowing an array of accelerators (e.g., reconfigurable processors) attached to separate hosts to directly communicate with each other via buffers.
In one embodiment, each of the AGCUs may be allocated a specific bandwidth to access TLN. This is similar to VAGs participating and winning arbitration to get access to the TLN. For example, the RDU 110 may include one or more AGCU arbiters to arbitrate among the AGCUs 202 to 232 to gain access to the TLN agents 244 to 266. The arbiter may be implemented in hardware or software or both.
In one example, a software implemented arbiter may keep a table of AGCUs and their need to access the external memory devices or host. Those AGCUs which have a higher bandwidth demand to access the external memory devices or host, may be assigned a higher priority than those which have a lower need. The higher priority AGCUs may be selected to access TLN. In other words, the higher priority AGCUs may get more bandwidth on the TLN than the lower priority ones.
The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the implementations described herein.
Although the description has been described with respect to particular implementations thereof, these particular implementations are merely illustrative, and not restrictive. The description may reference specific structural implementations and methods and does not intend to limit the technology to the specifically disclosed implementations and methods. The technology may be practiced using other features, elements, methods and implementations. Implementations are described to illustrate the present technology, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art recognize a variety of equivalent variations in the description above.
All features disclosed in the specification, including the claims, abstract, and drawings, and all the steps in any method or process disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in the specification, including the claims, abstract, and drawings, can be replaced by alternative features serving the same, equivalent, or similar purpose, unless expressly stated otherwise.
Although the description has been described with respect to particular implementations thereof, these particular implementations are merely illustrative, and not restrictive. For instance, many of the operations can be implemented in a CGRA system, a System-on-Chip (SoC), application-specific integrated circuit (ASIC), programmable processor, in a programmable logic device such as a field-programmable gate array (FPGA) or a graphics processing unit (GPU), obviating a need for at least part of the dedicated hardware. Implementations may be as a single chip, or as a multi-chip module (MCM) packaging multiple semiconductor dies in a single package. All such variations and modifications are to be considered within the ambit of the present disclosed technology, the nature of which is to be determined from the foregoing description.
One or more implementations of the technology or elements thereof can be implemented in the form of a computer product, including a non-transitory computer-readable storage medium with computer usable program code for performing any indicated method steps and/or any configuration file for one or more RDUs to execute a high-level program. Furthermore, one or more implementations of the technology or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps, and/or an RDU that is operative to execute a high-level program based on a configuration file. Yet further, in another aspect, one or more implementations of the technology or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein and/or executing a high-level program described herein. Such means can include (i) hardware module(s); (ii) software module(s) executing on one or more hardware processors; (iii) bit files for configuration of a CGR array; or (iv) a combination of aforementioned items.
Thus, while particular implementations have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular implementations will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit of the technology disclosed.
This application claims the benefit of U.S. Patent Application No. 63/616,952, entitled, “Intelligent Redundancy Management Framework For A Reconfigurable Data Processor,” filed on Jan. 2, 2024. The provisional application is hereby incorporated by reference for all purposes.
| Number | Date | Country | |
|---|---|---|---|
| 63616952 | Jan 2024 | US |