The present disclosure relates to computing hardware. More particularly, the present disclosure relates to techniques for training and using neural networks to perform inference.
A neural network is a machine learning model used for a variety of different applications (e.g., image classification, computer vision, natural language processing, speech recognition, writing recognition, etc.). A neural network may be trained for a set of purposes by running datasets through it, comparing results from the neural network to known results, and updating the network parameters based on the differences.
Various embodiments of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings.
In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. Such examples and details are not to be construed as unduly limiting the elements of the claims or the claimed subject matter as a whole. It will be evident to one skilled in the art, based on the language of the different claims, that the claimed subject matter may include some or all of the features in these examples, alone or in combination, and may further include modifications and equivalents of the features and techniques described herein.
Described here are techniques for determining schedules for processing neural networks on hardware. In some embodiments, a system includes a processor, several hardware units, and memory. The hardware units are each configured to perform a certain set of operations. For example, a first hardware unit may be configured to read data from the memory, a second hardware unit may be configured to write data to the memory, a third hardware unit may be configured to perform matrix multiplication operations, a fourth hardware unit may be configured to perform activation functions, etc. The processor receives and executes a program that includes instructions for processing a neural network in the form of a data flow graph. To implement the instructions in the program, the processor can determine a schedule of operations that are to be performed by a set of the hardware units and distributes the schedule to the set of hardware units. The schedule of operations may include a specific set of instructions that are to be performed by the set of hardware units in a particular order. A peer-to-peer (P2P) communication mechanism may be implemented in the set of instructions to allow the set of hardware units to communicate with each other in an orderly manner.
The techniques described in the present application provide a number of benefits and advantages over conventional methods of processing neural networks on hardware. For instance, employing a P2P communication mechanism for hardware units to communicate with each other during execution of schedules of operations to implement neural network operations reduces latency in the system. This is because conventional methods of processing neural networks on hardware typically use the processor as a centralized arbiter where hardware units are required to communicate with it in order to control the schedule of operations. The techniques described in the present application eliminate the need for such a centralized arbiter, thereby reducing communication between the processor and the hardware units. Reducing latency can allow for higher hardware utilization.
Data flow enabler (DFE) 105 is responsible for executing instructions for processing data through neural networks (e.g., training neural networks, using neural networks to perform inference, etc.). For example, DFE 105 may receive machine learning (ML) instructions 130 for processing data through a neural network. In some embodiments, ML instructions 130 are implemented by a set of programs generated by an application (e.g., a programming integrated development environment (IDE) application). The application may generate the program based on a set of machine learning libraries (e.g., a set of Tensorflow libraries, a set of Pytorch libraries, a set of open neural network exchange (ONNX) libraries, etc.). ML instructions 130 can be expressed in terms of a data flow graph in some embodiments.
To process ML instructions 130, DFE 105 may determine a hardware definition that specifies hardware units 120A-N and the functions that each of the hardware units 120A-N is configured to perform. Based on the hardware definition, DFE 105 can determine a schedule of operations to be performed by one or more hardware units 120A-N to implement ML instructions 130. In some embodiments, DFE 105 determines the schedule by generating a set of instructions for each of the hardware units 120A-N used to implement ML instructions 130. Then, DFE 105 distributes the set of instructions to the instruction queues 110A-N of the respective hardware units 120A-N.
DFE 105 can receive responses from hardware units 120A-N via response queues 115A-N. A response may indicate that a particular hardware unit 120 has completed one or more successive instructions received from DFE 105. This allows DFE 105 to determine the availability of space in instruction queues 110A-N. In some cases, a response can indicate any error conditions encountered by hardware units 120A-N. In addition, DFE 105 may use the responses that DFE 105 receives from hardware units 120A-N to prepare future instructions to hardware units 120A-N.
In some embodiments, DFE 105 can be implemented as a hardware processor with software operating on the hardware processor. The software may include the logic for the operations that are described in the present application as being performed by DFE 105.
Each of the hardware units 120A-N is configured to perform a particular set of functions. Examples of such functions include reading data from memory 125, writing data to memory 125, performing matrix multiplication operations, performing activation operations, performing various types of element-wise operations, etc.
When DFE 105 receives ML instructions 130 (the set of programs in this example), DFE 105 determines a schedule of a set of operations that are to be performed by a set of hardware units 120A-N in order to implement ML instructions 130. In this example, hardware unit 120A is configured to write data to memory 125, hardware unit 120B is configured to perform matrix multiplication operations, hardware unit 120C is configured to read data from memory 125, and hardware unit 120N is configured to perform function f( ). Here, DFE 105 determines the schedule of the set of operations by generating a set of instructions for hardware units 120A, 120B, 120C, and 120N to implement ML instructions 130. Specifically, DFE 105 generates a first instruction to read input data X and W from memory 125, a second instruction to perform matrix multiplication on input data X and W, a third instruction to perform function f( ) on the output of the matrix multiplication operation, and a fourth instruction to write the output of function f( ) to memory 125. DFE 105 distributes these instructions to hardware units 120A, 120B, 120C, and 120N by sending the first instruction to hardware unit 120C via instruction queue 110C, sending the second instruction to hardware unit 120B via instruction queue 120B, sending the third instruction to hardware unit 120N via instruction queue 110N, and sending the fourth instruction to hardware unit 120BA via instruction queue 110A.
In some embodiments, a instruction that DFE 105 generates includes three parameters: a first token, an operation to perform upon receiving the first token, and an instruction to generate a second token after performing the operation and send the second token to a particular hardware unit. In some cases where the first token is null or empty, the operation can be performed without needing to receive a token (i.e., the operation is performed upon processing of the instruction). In other cases, the instruction to generate a second token may be null or empty. For such a instruction, the second token is not generated after the operation is performed. As illustrated in
In some embodiments, DFE 105, instruction queues 110A-N, response queues 115A-N, hardware units 120A-N and memory 125 are implemented on a single chip. In some such embodiments, hardware system 100 can include additional chips similar to this chip. That is, these additional chips can include a DFE, instruction queues, response queues, hardware units, and memory similar to that shown in
Based on a hardware definition specifying the set of hardware units and functions that each hardware unit in the set of the hardware unit is configured to perform, process 500 determines, at 520, a schedule of a set of operations to be performed by a subset of the set of hardware units to implement the set of instructions. Referring to
Finally, process 500 distributes, at 530, the schedule of the set of operations to the subset of the set of hardware units. Referring to
The techniques describe above may be implemented in a wide range of computer systems configured to process neural networks.
Bus subsystem 604 can provide a mechanism for letting the various components and subsystems of computer system 600 communicate with each other as intended. Although bus subsystem 604 is shown schematically as a single bus, alternative embodiments of the bus subsystem can utilize multiple busses.
Network interface subsystem 616 can serve as an interface for communicating data between computer system 600 and other computer systems or networks. Embodiments of network interface subsystem 616 can include, e.g., Ethernet, a Wi-Fi and/or cellular adapter, a modem (telephone, satellite, cable, ISDN, etc.), digital subscriber line (DSL) units, and/or the like.
Storage subsystem 606 includes a memory subsystem 608 and a file/disk storage subsystem 610. Subsystems 608 and 610 as well as other memories described herein are examples of non-transitory computer-readable storage media that can store executable program code and/or data that provide the functionality of embodiments of the present disclosure.
Memory subsystem 608 includes a number of memories including a main random access memory (RAM) 618 for storage of instructions and data during program execution and a read-only memory (ROM) 620 in which fixed instructions are stored. File storage subsystem 610 can provide persistent (e.g., non-volatile) storage for program and data files, and can include a magnetic or solid-state hard disk drive, an optical drive along with associated removable media (e.g., CD-ROM, DVD, Blu-Ray, etc.), a removable flash memory-based drive or card, and/or other types of storage media known in the art.
It should be appreciated that computer system 600 is illustrative and many other configurations having more or fewer components than system 600 are possible.
In various embodiments, the present disclosure includes systems, methods, and apparatuses for determining schedules for processing neural networks on hardware. The techniques described herein may be embodied in non-transitory machine-readable medium storing a program executable by a computer system, the program comprising sets of instructions for performing the techniques described herein. In some embodiments, a system includes a set of processing units and a non-transitory machine-readable medium storing instructions that when executed by at least one processing unit in the set of processing units cause the at least one processing unit to perform the techniques described above. In some embodiments, the non-transitory machine-readable medium may be memory, for example, which may be coupled to one or more controllers or one or more artificial intelligence processors, for example.
The following techniques may be embodied alone or in different combinations and may further be embodied with other techniques described herein.
For example, in one embodiment, the present disclosure includes a system comprising a processor and a set of hardware units, wherein the processor is configured to receive a set of instructions that define processing of data through a neural network; based on a hardware definition specifying the set of hardware units and functions that each hardware unit in the set of the hardware unit is configured to perform, determine a schedule of a set of operations to be performed by a subset of the set of hardware units to implement the set of instructions; and distribute the schedule of the set of operations to the subset of the set of hardware units.
In one embodiment, the set of instructions are a first set of instructions. Determining the schedule of the set of operations comprises generating a second set of instructions for the subset of the set of hardware units, wherein distributing the schedule of the set of operations to the subset of the set of hardware units comprises distributing the second set of instructions to the subset of the set of hardware units.
In one embodiment, a first instruction in the second set of instructions is distributed to a first hardware unit in the subset of the set of hardware units. The instruction specifies an operation to perform and a second instruction to generate a token after performing the operation and send the token to a second hardware unit in the subset of the set of hardware units.
In one embodiment, a first instruction in the second set of instructions is distributed to a first hardware unit in the subset of the set of hardware units. The instruction specifies a first token, an operation to perform upon receiving the first token, and a second instruction to generate a second token after performing the operation and send the second token to a second hardware unit in the subset of the set of hardware units.
In one embodiment, an instruction in the second set of instructions is distributed to a hardware unit in the subset of the set of hardware units. The instruction specifies an operation to perform upon receiving a token.
In one embodiment, a first instruction in the second set of instructions is distributed to a first hardware unit in the subset of the set of hardware units. The first instruction specifies a first operation to perform and a second instruction to generate a first token after performing the first operation and send the first token to a second hardware unit in the subset of the set of hardware units. A third instruction in the second set of instructions is distributed to the second hardware unit. The third instruction specifies a second operation to perform upon receiving the first token and a fourth instruction to generate a second token after performing the second operation and send the second token to a third hardware unit in the subset of the set of hardware units. A fifth instruction in the second set of instructions is distributed to the third hardware unit. The fifth instruction specifying a third operation to perform upon receiving the second token.
In one embodiment, the present disclosure further comprises memory. One of the first, second, and third hardware units is configured to read data from the memory. One of the first, second, and third operations distributed to the one of the first, second, and third hardware units is to retrieve the data from the memory.
In one embodiment, the present disclosure further comprises memory. One of the first, second, and third hardware units is configured to write data to the memory. One of the first, second, and third operations distributed to the one of the first, second, and third hardware units is to write the data to the memory.
In one embodiment, one of the first, second, and third hardware units is configured to perform matrix multiplication operations. One of the first, second, and third operations distributed to the one of the first, second, and third hardware units is to perform a matrix multiplication operation on a first matrix and a second matrix.
In one embodiment, one of the first, second, and third hardware units is configured to perform activation functions. One of the first, second, and third operations distributed to the one of the first, second, and third hardware units is to perform an activation function.
In one embodiment, the processor is a first processor and the set of hardware units is a first set of hardware units. The present disclosure further comprises a first chip and a second chip. The first chip includes the first processor and the first set of hardware units. The second chip includes a second processor and a second set of hardware units. The schedule of the set of operations is to be further performed by a subset of the second set of hardware units. The set of instructions is a first set of instructions. Determining the schedule of the set of operations further comprises determining a third set of instructions and sending the third set of instructions to the subset of the second set of hardware units.
In one embodiment, the present disclosure further comprises a set of queues. Each queue in the set of queues is configured to store instructions for a hardware unit in the set of hardware units. Distributing the second set of instructions to the subset of the set of hardware units comprises sending the second set of instructions to a subset of the set of queues for the subset of the set of hardware units.
In one embodiment, the set of instructions are implemented in a program generated by an application.
In one embodiment, the program is generated based on a set of machine learning libraries.
In one embodiment, the set of instructions are expressed in terms of a data flow graph.
In one embodiment, the data flow graph comprises a set of nodes and a set of edges connecting the set of nodes. Each node in the set of nodes represents a mathematical operation. Each edge in the set of edges represents a matrix on which a particular instance of a mathematical operation is performed.
In one embodiment, the processing of the data through the neural network comprises training the neural network based on the data.
The above description illustrates various embodiments of the present disclosure along with examples of how aspects of the particular embodiments may be implemented. The above examples should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the particular embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents may be employed without departing from the scope of the present disclosure as defined by the claims.