The present invention relates to field-reconfigurable neural networks and machine intelligence systems, and more specifically to a hardware system for neural networks that is expandable and field-reconfigurable to match the structure and processing flow of neural networks and logic reasoning.
There are two phases in using a neural network for machine learning, training and inferencing. In the training phase, a computing engine needs to be able to process a large number training examples quickly, thus needs both fast processing and fast I/O. In the inferencing phase, a computing engine needs to receive input data and produce the inference results, in real-time in many applications. In both phases, the computing engine needs to be configured to implement the different neural network architectures that are best suited to a learning task, e.g., for human face recognition, speech recognition, handwriting recognition, playing a game or controlling a drone, etc., each may require a different neural network architecture, or structure and processing flow, e.g., number of layers, number of nodes at each layer, the interconnection among layers, types of processing performed at each layer, etc. Prior art computing engines for neural networks using GPU, FPGA or ASIC lack the high processing power, expandability, flexibility of interconnections of sub-engines and real-time configurability offered by this invention.
“A Cloud-Scale Acceleration Architecture” by A. M. Caulfield et al of Microsoft Corporation published at 49th Annual IEEE/ACM International Symposium on Microarchitecture in 2016 described an architecture that places a layer of FPGAs between the servers' Network Interface Cards (NICs) and the Ethernet network switches. It connects a single FPGA to a server CPU and connects many FPGAs through up to three layers of Ethernet switches. Its main advantages are in offering general purpose cloud computing services, allowing the FPGAs to transform network flows at line rate and accelerate local applications running on each server. It allows a large number of FPGAs to communicate, however, the connections between FPGAs need to go through one or more levels of Ethernet switches and requires a network operating system's coordination of the CPUs connected to each of the FPGAs. For a single CPU, the co-processing power is limited by the size and processing speed of the single FPGA attached to the CPU. The overall performance and FPGA to FPGA communication latency will heavily depend on the efficiency and the uncertainty of the multi-levels of Ethernet switches due to competition of other data traffic on the Ethernet switch network in a data center, and the network operating system's efficiency in managing, requesting, releasing and acquiring of the FPGA resources at a large number of other CPUs or servers. There is also prior art that attaches multiple FPGAs or GPUs to a server through a CPU or peripheral bus, e.g., the PCIe bus. There are no direct connections between the FPGAs or GPUs of one server with those of another server. A large neural network requiring a large number of FPGAs will need to involve multiple servers and their upper layer software overhead and latency.
This invention offers significant advantages in terms of reconfiguring and mapping configurable hardware to optimally match the structure and processing flow of a wide range of neural networks, in addition to overcoming the shortcomings in the prior art identified above.
Reference may now be made to the drawings wherein like numerals refer to like parts throughout. Exemplary embodiments of the invention are provided to illustrate aspects of the invention and should not be construed as limiting the scope of the invention. When the exemplary embodiments are described with reference to block diagrams or flowcharts, each block represents both a method step or an apparatus element for performing the method step. Depending upon the implementation, the corresponding apparatus element may be configured in hardware, software, firmware or combinations thereof. In this invention, the term neural network or learning network, used interchangeably, means an information processing structure that can be characterized by a graph of layers or clusters of processing nodes and interconnections among the processing nodes, which include but are not limited to feedforward neural networks, deep learning network, convolutional neural networks, recurrent neural networks, self-organizing neural networks, long short term memory networks, gated recurrent unit networks, reinforcement learning networks, unsupervised learning networks, etc., or a combination of thereof. For example, a learning network may consist of one or more recurrent networks and one or more feedforward networks interconnected together. The term data, information, signal may be used interchangeably, each of which may mean a bit stream, signal pattern, waveform, binary data etc., which may be interpreted as weights, biases or timing parameters of a learning network, a command for a processing module, input to or output from a node, layer, cluster, processing stage etc. of a learning network.
This invention includes embodiments of a method for implementing learning networks and the system or apparatus of a field-reconfigurable learning network 100, as shown in
Hereafter, the term Field-Reconfigurable Processing and Interconnection Module 2 or FR-PIM 2 will be used to indicate either the collection of field-reconfigurable connection circuits alone or the collection of field-reconfigurable connection circuits together with some field-reconfigurable logic or computation circuits connected to the collection of field-reconfigurable connection circuits. For example, when the FR-PIM 2 is implemented using FPGA-type of circuits, it will include both field-reconfigurable connection circuits and field-reconfigurable logic or computation circuits. The connections established by the FR-PIM 2 include the inter-parts connections among the partitioned parts of the N layers, clusters or stages of the selected learning network distributed over the two or more processing modules for direct communication among the multiple parts through the such configured reconfigurable connection circuits, using a first set of one or more high speed connections 3 between the processing modules 1 and the FR-PIM 2 to send and receive signals between a source and a destination, using a second set of one or more high speed connections 5 for connecting the FR-PIM 2, or the two or more processing modules 1 via the FR-PIM 2, with one or more host servers to which the field-reconfigurable learning network 100 provides the function of a reconfigurable machine learning co-processor. A processing module can be either a source or a destination of a signal for a connection in the first set or second set of high speed connections with the FR-PIM 2.
Each processing module 1 contains a collection of field-reconfigurable circuits which can be field-reconfigured to perform a wide range of logic processing and computation and to connect a collection of inputs to a collection of outputs with or without logic or computations inserted in between. One method to achieve this is to include one or more field-reconfigurable very large scale integrated circuits, e.g., FPGA chips or field-reconfigurable parallel processing hardware, in a processing module. The collection of field-reconfigurable circuits make up the hardware of the field-reconfigurable learning network 100 which can be configured by software, prior to or at the time of use, to implement the selected learning network using. The FR-PIM 2 may also be implemented using a FPGA and can also include a memory module, e.g., block RAM or DRAM, that holds the parameters, settings, or data for some or all of the processing modules.
One embodiment configures the reconfigurable circuits in the FR-PIM 2 to interconnect each part of the N layers, clusters or stages that are partitioned into the two or more processing modules 1 such that the circuits of a first subset of the one or more processing modules configured to perform the computations of a kth layer, cluster or stage receive input information provided by the circuits of a second subset of the one or more processing modules configured to perform the computations of an mth layer, cluster or stage, and send output information to the circuits of a third subset of the one or more processing modules configured to perform the computations of an nth layer, cluster or stage which uses the received information as the input information, wherein 1≤k,m,n≤N, the circuits of the subset of the one or more processing modules configured for k=1 receive input data from an input data source, internal state or a memory, and the circuits of the subset of the one or more processing modules configured for k=N produce an output of the selected learning network, or send output information to the circuits of a subset of the one or more processing modules configured to perform the computations of a jth layer, cluster or stage, wherein 1≤j<N.
Each of the two or more processing modules 1 and their field-reconfigurable very large scale integrated circuits, e.g., FPGA chips, are reconfigured by software, at the time of or prior to use, to fit the architecture or processing flow of a selected neural network. The neural network may be a deep learning network or a recurrent network or other networks listed at the beginning of this section. The neural network is organized into N layers, clusters or stages, which are partitioned into a number of blocks with one or more blocks implemented in each of the two or more processing modules. Each processing module 1 implements a part of the selected learning network, and the processing modules 1 collectively implement the complete learning network. For some selected learning networks, the embodiment may partition the layers, clusters or stages such that multiple layers are implemented using the same or same subset of processing modules. On the other hand, neurons in the same layer or cluster may be duplicated in multiple processing modules, with each processing module performing the computation of the same layer or cluster of neurons but at different processing stages or states, e.g., in a pipeline configuration. These processing modules need to be connected, via the FR-PIM 2, to complete the function of the single layer or cluster. A recurrent network may be partitioned into multiple processing modules with each processing module implementing one or more layers or clusters of the recurrent network. The inter-layer or inter-cluster connections among the multiple processing modules will be provided by the FR-PIM 2.
One example is to use a subset of processing modules for each of the N layers of a deep learning network as shown in
The FR-PIM 2 is reconfigured by software, at the time of or prior to use, to provide the interconnections among the parts of the N layers, clusters or stages of the selected learning network that are partitioned into the two or more processing modules, or subsets of the processing modules. A FR-PIM 2, implemented using an FPGA chip with a sufficient number of high speed I/O ports, uses its reconfigurable circuits to establish interconnections among the N layers, clusters or stages, or parts of, to connect one or more ingress high speed connections to one or more egress high speed connections through the established interconnections, to enable the source of the ingress high speed connection to send data directly to the destination of the egress high speed connection. This can be achieved as a direct circuit connection without the need of using the destination's address or ID. The reconfigurable circuits in the FR-PIM are configured to interconnect each part of the N layers, clusters or stages partitioned into the two or more processing modules such that the circuits of a first subset of the one or more processing modules, e.g., in layer 103, configured to perform the computations of a kth layer, cluster or stage, e.g., layer 3 in 103, receive input information provided by the circuits of a second subset of the one or more processing modules, e.g., in layer 102, configured to perform the computations of an mth layer, cluster or stage, e.g., layer 2 in 102, and send output information to the circuits of a third subset of the one or more processing modules, e.g., in layer 104, configured to perform the computations of an nth layer, cluster or stage, e g., layer 4 in 104, which uses the received information as the input information, wherein 1≤k,m,n≤N. The circuits of the subset of the one or more processing modules configured for k=1, e.g., in layer 101, receive input data via connections 8 from input data source, internal state or a memory, and the circuits of the subset of the one or more processing modules configured for k=N produce an output of the selected learning network, or send output information to the circuits of a subset of the one or more processing modules configured to perform the computations of another intermediate or hidden layer, cluster or stage.
A selected learning network may have recurrent connections wherein the kth layer receives input from and sends output to the same layer, i.e., n=m. In a learning network with sequentially ordered layers, clusters or stages from 1 to N, the field-reconfigurable learning network may be configured to have m<k<n or k≥n in one or more configurations.
In the implementation of some learning network, the FR-PIM is configured to insert reconfigurable computation circuit along the connection path from one or more ingress high speed connections to one or more egress high speed connections, wherein the said reconfigurable computation circuit processes the data as it passes through the connection path. The reconfigurable computation circuits in the FR-PIM can also be configured to function as an additional processing module of the field-reconfigurable learning network, being used to implement part or all of one or more layers, clusters or stages of a selected learning network.
In some learning networks, effective or efficient learning may require processing nodes that operate on concurrent or time-sensitive outputs, states, parameters, processing and/or configuration of multiple layers, clusters or stages. To implement such learning networks, one embodiment configures multiple processing modules to send data to the FR-PIM in parallel, and configures the reconfigurable computation circuits in the FR-PIM to receive the data and perform computation that requires time-sensitive inputs from one or more layers, clusters or stages that are distributed across the multiple processing modules. Another embodiment configures the reconfigurable computation circuits in the FR-PIM to perform computation on received data and/or data in memory and to transmit the resulting data from the computation in parallel to one or more layers, clusters or stages that are distributed across multiple processing modules. The multiple processing modules are configured accordingly to receive the data from the FR-PIM and perform processing in parallel. In yet another embodiment, the reconfigurable circuits in the FR-PIM are configured to receive signals from two or more processing modules in parallel and process the received signals to derive centralized control and/or coordination signals 9, and transmit the centralized control and/or coordination signals 9 to two or more processing modules. The two or more processing modules are configured to receive the centralized control and/or coordination signals 9 from the FR-PIM and modify their states, parameters, processing and/or configurations.
The FR-PIM may be equipped with a memory module which stores data shared by multiple processing modules. The reconfigurable circuits in the FR-PIM are configured to retrieve data from the memory and transmit the data to two or more processing modules which require the data for their function. The two or more processing modules are to be configured accordingly to receive the data from the FR-PIM, to use the data in the processing or modify their states, parameters, processing and/or configurations.
One embodiment implements multiple selected learning networks in a field-reconfigurable learning network, as shown in
Another embodiment is cooperative learning networks and multi-level learning, in which the two or more processing modules are configured to implement two or more learning networks, e.g., a first set of one or more processing modules 131 implementing a first learning network and a second set of one or more processing modules 132 implementing a second learning network as shown in
There is a need to get a large amount of data in and out of the field-reconfigurable learning network, for training a learning network with a large amount of examples in the training phase, and for running real-time data to get real-time results in the inference phase. In one embodiment, a processing module further comprises one or more high speed interconnects or I/O ports 4. These I/O ports 4 can be used to connect the field-reconfigurable learning network to an external system or to a computer network, e.g., a cloud data center network or the Internet, for entering input data into or providing output data from the learning network. In another embodiment, the I/O ports 4 can also be used to connect with the I/O ports 4 of one or more other field-reconfigurable learning networks 100 to produce a larger field-reconfigurable learning network 200, as shown in
Another embodiment of scaling into a larger field-reconfigurable learning network is by connecting the FR-PIMs 2 of multiple field-reconfigurable learning networks 100 using a third set of one or more high speed connections 12, and configuring the multiple FR-PIMs and the processing modules of the such interconnected multiple field-reconfigurable learning networks to work as a single larger field-reconfigurable learning network 200, as shown in
Neural network learning is only one aspect of machine intelligence. Logic reasoning coupled with neural networks provide a more powerful general machine intelligence computing engine. GPU and CPU are less efficient in implementing logic than FPGA-type of circuits whose logic circuits can be reconfigurable to efficient compute sequential and combinatorial logic as the signals pass through the circuits. It is difficult for special purpose neural network ASIC or ASIC with fixed logic to implement logic reasoning other than those pre-designed into the fixed logic circuits. FPGA-type of circuits with reconfigurable logic are designed for and well suited for implementing a wide range of logic through configuration by software, and can implement logic reasoning more efficiently than GPU, CPU and ASIC and can complete logic reasoning faster than them. One embodiment is a field-reconfigurable machine intelligence method or system comprising two or more processing modules 1 which includes FPGA-type of circuits with reconfigurable logic, computation and connection circuits, a collection of field-reconfigurable connection circuits, e.g., those in the FR-PIM 2, a first set of one or more high speed connections 3 between the processing modules and the collection of field-reconfigurable connection circuits, e.g., in FR-PIM 2. The reconfigurable logic, computation and connection circuits in the two or more processing modules 1 are configured to implement one or more selected learning networks which are partitioned into multiple parts with each part implemented in a subset of the processing modules 1. The collection of field-reconfigurable connection circuits, e.g., in the FR-PIM 2, are reconfigured to interconnect the partitioned parts of the one or more selected learning networks. While some of the reconfigurable logic, computation and connection circuits in the processing modules 1 and/or FR-PIM 2 are configured to implement one or more selected learning networks, some of the reconfigurable logic, computation and connection circuits in the processing modules 1 and/or FR-PIM 2 are configured to perform logic reasoning and combine results from the one or more selected neural networks and logic reasoning to produce the result of the machine intelligence system.
The collection of the field-reconfigurable circuits of the system, in FR-PIM 2, are configured to establish connections of the signals of the implemented one or more selected learning networks and the signals of the implemented logic reasoning circuits. These connections can be from the output layer, cluster or stage of a selected learning network, or an intermediate layer, cluster or stage of a selected learning network to the input of the field-reconfigured logic reasoning circuit, or from the output of a field-reconfigured logic reasoning circuit to the output or an intermediate layer, cluster or stage of a selected learning network. They can also be for connecting the output layer, cluster or stage of a selected learning network, or an intermediate layer, cluster or stage of a selected learning network and the output of one or more field-reconfigured logic reasoning circuits to the input of one or more selected learning networks and/or the input of one or more one or more field-reconfigured logic reasoning circuits. These connections can all be established using the collection of field-reconfigurable connections circuits, e.g., in FR-PIM 2 and using the signals 9. The outcome is that that field-reconfigurable machine intelligence system combines the signals from the implemented one or more selected learning networks and the output from the one or more implemented logic reasoning circuits to produce one or more output of the system.
In
The field-reconfigurable machine intelligence system can be connected to one or more connected host servers, and/or to a computer network, e.g., a local area network or the Internet, to provide a web service or cloud service. Multiple field-reconfigurable machine intelligence systems can be connected together to produce a larger field-reconfigurable machine intelligence system, e.g., by connecting the processing modules of the multiple systems or connecting a central connection hub in each of the field-reconfigurable machine intelligence systems. Each of some of the multiple field-reconfigurable machine intelligence systems can be connected to a computer network to provide machine intelligence access or service of a larger field-reconfigurable machine intelligence system over the computer network, e.g., as a web service or cloud service.
Multiple field-reconfigurable machine intelligence systems 100 can be connected together to produce a larger field-reconfigurable machine intelligence system 200 as shown in
Although the foregoing descriptions of the preferred embodiments of the present inventions have shown, described, or illustrated the fundamental novel features or principles of the inventions, it is understood that various omissions, substitutions, and changes in the form of the detail of the methods, elements or apparatuses as illustrated, as well as the uses thereof, may be made by those skilled in the art without departing from the spirit of the present inventions. Hence, the scope of the present inventions should not be limited to the foregoing descriptions. Rather, the principles of the inventions may be applied to a wide range of methods, systems, and apparatuses, to achieve the advantages described herein and to achieve other advantages or to satisfy other objectives as well.