This application claims foreign priority from United Kingdom patent application No. GB2320004.1, filed on 22 Dec. 2023, the contents of which are incorporated by reference herein in their entirety.
This application is directed to methods and neural network accelerators for executing a dynamic neural network.
An artificial neural network, which will be referred to herein as a neural network, comprises one or more interconnected layers that can be used for machine learning applications. In particular, a neural network can be used in signal processing applications, including, but not limited to, image processing and computer vision applications.
The data input to and output from a layer of a neural network can be described as a tensor. As is known to those of skill in the art, a tensor is a generalization of vectors and matrices and can be considered as an n-dimensional array. A vector is a one-dimensional tensor, and a matrix is a two-dimensional tensor. The tensors in a neural network are often, but are not necessarily, four-dimensional. Reference is made to
The processing that is performed on the input tensor to a layer depends on the type of layer. For example, each layer of a neural network may be one of a plurality of different types. Example neural network layer types include, but are not limited to, a convolution layer, an activation layer, a normalisation layer, a pooling layer, a fully connected layer, and a batch normalisation layer. It will be evident to a person of skill in the art that these are example neural network layer types and that this is not an exhaustive list and there may be other neural network layer types.
A convolution layer convolves the input tensor with weights associated with the layer. Specifically, each convolution layer is associated with a plurality of weights k1 . . . kg, which may also be referred to as filter weights or coefficients. The weights are grouped to form one or more filters or kernels, and each filter may be associated with an offset bias bias. Each filter may have a dimension KW×KH×Cin (i.e., each filter may comprise a set of KW×KH×Cin weights k), where Cin is the number of channels in the input tensor. Each filter may be applied to the input tensor according to a convolution operation across steps sW and sH in the W and H directions. The step sizes sW and sH may be referred to as the strides of the convolution. The number and dimensions of filters and/or the number of weights per filter may vary between convolution layers. A convolutional neural network (CNN), which is a specific type of neural network that is effective for image recognition and classification, generally comprises a plurality of convolution layers.
An activation layer, which often, but not necessarily, follows a convolution layer, applies one or more activation functions to the input tensor. An activation function receives an input tensor and performs a certain non-linear mathematical operation on each value or element in the input tensor. In other words, the activation function operates on each value or element in the input tensor separately. In some examples, an activation layer may act as rectified linear unit (ReLU) by implementing an ReLU function or a leaky rectified linear unit (LReLU) by implementing a LReLU function.
A normalisation layer is configured to perform a normalising function, such as a Local Response Normalisation (LRN) function on the input tensor.
A pooling layer performs a pooling function, such as a max, min or average function, to summarise subsets of the input tensor. The purpose of a pooling layer is thus to reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting.
A fully connected layer, which often, but not necessarily, follows a plurality of convolution and pooling layers, takes a two-dimensional tensor (e.g. a tensor with a batch size and a channel dimension) of input data values and outputs a two-dimensional tensor (e.g. a tensor with a batch size dimension and a channel dimension). Where the DNN is used for classification, the output may have A channels where A is the number of classes, and each value in the tensor may represent the probability of a certain class. The output tensor is generated through a matrix multiplication of a set of weights, optionally followed by a bias offset. A fully connected layer thus receives a set of weights and may receive a bias.
A batch normalisation (often referred to as “batch norm”) layer, which often, but not necessarily, follows a convolution layer, applies a per channel affine transformation to an input tensor. Batch normalisation layers may be added to a neural network to make training of the neural network faster and more stable by normalisation of a subsequent layer's inputs by re-centring and re-scaling.
Many neural networks are static—i.e., they have a fixed sequence of layers or operations which are applied to each set of input data. For example, a static neural network may comprise a convolution layer, an activation layer, a second convolution layer, and a second activation layer and each set of input data to the neural network (e.g., each inference) is processed by those four layers in that order. However, there are some neural networks which are dynamic, meaning the layers or operations which are applied to a set of input data, and/or the number of times a set of layers is applied to a set of input data, is dynamic or can change. For example, a dynamic neural network may (i) apply layers 1, 2, 3 in some cases and layers 4, 5, 6 in other cases; or (ii) apply layers 1,2,3 once in some case and layers 1, 2, 3 three times in other cases. A dynamic neural network is thus a network that comprises one or more loops and/or branches. An example of a dynamic neural network is a recurrent neural network with reinforcement learning wherein the layers or operations that are applied to a set of input data are based on a dynamic input.
A recurrent neural network with reinforcement learning may be used, for example, to control a robot or for natural language processing. If a recurrent neural network with reinforcement learning is used to control a robot the neural network receives an input (which represents the environment) and the state (the result from the last interaction) and produces an output to guide the robot's actions. In contrast, if a recurrent neural network with reinforcement learning is used for natural language processing the neural network may process each word based on the state and other words that have recently been processed in order to understand and/or translate sentences. In both examples the operations that are performed on a set of input data is quite similar each time, but it is the number of times that the operations are performed that is variable. For example, the number of times the operations are performed on an input sentence may depend on the number of words in the sentence.
As dynamic neural networks are becoming more prolific it is desirable to be able to execute such networks in a hardware efficient manner (e.g., in a manner that requires less silicon area, less system bandwidth or less processing power).
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Described herein are neural network accelerators with one or more neural network accelerator cores. Each neural network accelerator core comprises: one or more hardware accelerators configured to accelerate one or more neural network operations; an embedded processor; a command decoder; and a hardware feedback path between the embedded processor and the command decoder. The command decoder of at least one neural network accelerator core of the one or more neural network accelerator cores is configured to control the one or more hardware accelerators and the embedded processor of that core in accordance with commands of a command stream, and when the command stream comprises a set of one or more branch commands that indicate a conditional branch is to be performed, cause the embedded processor to determine a next command stream, and in response to receiving information from the embedded processor identifying the next command stream via the hardware feedback path, control the one or more hardware accelerators and the embedded processor in accordance with commands of the next command stream.
A first aspect provides a neural network accelerator comprising one or more neural network accelerator cores, each neural network accelerator core comprising: one or more hardware accelerators, each hardware accelerator configured to accelerate one or more neural network operations; an embedded processor; a command decoder; and a hardware feedback path between the embedded processor and the command decoder; wherein the command decoder of at least one neural network accelerator core of the one or more neural network accelerator cores is configured to control the one or more hardware accelerators and the embedded processor of that core in accordance with commands of a command stream, and when the command stream comprises a set of one or more branch commands that indicate a conditional branch is to be performed, cause the embedded processor to determine a next command stream, and in response to receiving information from the embedded processor identifying the next command stream via the hardware feedback path, control the one or more hardware accelerators and the embedded processor in accordance with commands of the next command stream.
A second aspect provides a method of processing a dynamic neural network at a neural network accelerator comprising one or more hardware accelerators, an embedded processor, and a command decoder, the method comprising, at the command decoder: controlling the one or more hardware accelerators and the embedded processor in accordance with commands of a command stream; in response to determining that the command stream comprises a set of one or more branch commands that indicate that a conditional branch is to be performed, cause the embedded processor to determine a next command stream; and in response to receiving information from the embedded processor identifying the next command stream, controlling the one or more hardware accelerators and the embedded processor in accordance with commands of the next command stream.
A third aspect provides a neural network accelerator comprising one or more hardware accelerators, an embedded processor and a command decoder, wherein the neural network accelerator is configured to perform the method of the second aspect.
The neural network accelerators described herein may be embodied in hardware on an integrated circuit. There may be provided a method of manufacturing, at an integrated circuit manufacturing system, a neural network accelerator as described herein. There may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the system to manufacture a neural network accelerator as described herein. There may be provided a non-transitory computer readable storage medium having stored thereon a computer readable description of a neural network accelerator as described herein that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the neural network accelerator.
There may be provided an integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a neural network accelerator described herein; a layout processing system configured to process the computer readable description so as to generate a circuit layout description of an integrated circuit embodying the neural network accelerator; and an integrated circuit generation system configured to manufacture the neural network accelerator according to the circuit layout description.
There may be provided computer program code for performing any of the methods described herein. There may be provided non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.
The above features may be combined as appropriate, as would be apparent to a skilled person, and may be combined with any of the aspects of the examples described herein.
Examples will now be described in detail with reference to the accompanying drawings in which:
The accompanying drawings illustrate various examples. The skilled person will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the drawings represent one example of the boundaries. It may be that in some examples, one element may be designed as multiple elements or that multiple elements may be designed as one element. Common reference numerals are used throughout the figures, where appropriate, to indicate similar features.
The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.
Embodiments will now be described by way of example only.
Described herein are methods and neural network accelerators for executing a dynamic neural network.
Neural networks are often expensive to implement in terms of computation, bandwidth and power. Accordingly, neural network accelerators (NNAs) have been developed that allow neural networks to be implemented in an efficient manner (e.g., in a manner that requires less silicon area or less processing power).
An NNA is a hardware accelerator that is designed to accelerate the processing of a neural network. As is known to those of skill in the art, a hardware accelerator is hardware designed to perform a specific set of one or more functions more efficiently than a general processing unit, such as a central processing unit (CPU). Accordingly, in contrast to a general CPU which can be configured to perform any number of functions, an accelerator can only perform a limited set of one or more functions. NNAs comprise one or more hardware accelerators designed to accelerate one or more neural network operations. Therefore a graphics processing unit (GPU) with one or more hardware accelerators designed to accelerate one or more neural network operations can be understood to be an NNA. A neural network operation is defined herein as an operation that is used to implement all or a part of a neural network layer. A neural network layer may be implemented by one or more neural network operations. Example neural network operations include, but are not limited to, convolution operations, non-linear operations, pooling operations and normalisation operations.
An NNA may therefore comprise, for example, a convolution accelerator with one or more convolution engines which is configured to accelerate convolution operations, an activation accelerator which is configured to accelerate non-linear operations, a pooling accelerator with one or more pooling engines which is configured to accelerate pooling operations, and/or a normalisation accelerator configured to accelerate normalisation operations. It will be evident to a person of skill in the art that this is just an example set of accelerators that an NNA may have, and NNAs may have additional accelerators, fewer accelerators or a different combination of accelerators. NNAs may also have other components such as, but not limited to, interconnection hardware that connects the accelerators, a command decoder (which may also be referred to as a controller) which controls the operation of the other components (e.g. hardware accelerators) in response to a set of commands etc.
NNAs may be configured to execute a neural network over one or more hardware passes of the NNA. A hardware pass of the NNA is defined herein as performing some processing using one or more components (e.g. accelerators) of the NNA to generate processed data. The processed data of a hardware pass may be output from the NNA to memory, or stored in the NNA for use in a subsequent hardware pass. The memory which is used to store the processed data of a hardware pass may be memory that is external to the NNA, but is internal to the chip on which the NNA is situated (i.e., on-chip memory), or memory that is external to the NNA and is external to the chip on which the NNA is situated (i.e., off-chip memory).
NNAs may have hardware constraints (e.g., the size of buffers, number of convolution engines, number of pooling engines) that limit the processing that can be performed in a hardware pass, or the order in which, or number of times that, a hardware pass can use components (e.g. hardware accelerators) of the NNA. Where all of the processing to implement a neural network cannot be completed in a single hardware pass of the NNA, the processing may have to be split into multiple hardware passes of the NNA.
In some examples, the hardware passes to perform or implement a pass of a neural network may be identified by first mapping each layer of the neural network to a sequence of one or more low level layers, wherein a low level layer is a set of one or more operations that can be performed by a single component (e.g. accelerator) of the neural network accelerator. In other words, each low level layer corresponds to a component (e.g. hardware accelerator) of the neural network accelerator.
Once the layers of the neural network have been mapped to low level layers, the low level layers may be divided into one or more layer groups, wherein each layer group comprises a sequence of one or more low level layers that can be implemented on the NNA. The sequences of lower level layers that can be implemented by the NNA depend on the components (e.g. hardware accelerators etc.) of the NNA and how they can be connected to process data. For example, if an NNA comprises a convolution accelerator and a pooling accelerator that can be connected to form a pipeline, the NNA can perform convolution operations and pooling operations in the same hardware pass. This means that a layer group may comprise a low level convolution layer followed by a low level pooling layer. Since each low level layer corresponds to a component of the NNA, each layer group comprises a sequence of components (e.g. hardware accelerators) that can be implemented by the NNA.
Once the low level layers have been split into one or more layer groups, it is determined, for each layer group, whether that layer group can be implemented in a single hardware pass of the NNA. Specifically, depending on the NNA hardware constraints, it may not be possible to perform all of the processing associated with a layer group in the same hardware pass. For example, the input tensor to the first layer of the layer group may be too large to be processed in a single hardware pass. Accordingly, if it is determined that a layer group cannot be implemented in a single hardware pass of the neural network that layer group is divided into a plurality of hardware passes. An example method for identifying hardware passes to implement a pass of a neural network is described in the Applicant's UK patent application no. 2209584.8, which is herein incorporated by reference in its entirety.
Once a neural network has been mapped to a set of hardware passes, commands are generated for each hardware pass that cause the neural network to implement that hardware pass. The commands for a hardware pass may comprise, for example, information that identifies the components (e.g. hardware accelerators) that are active in the hardware pass, how the active components are to be configured, the format of one or more data sets used in the hardware pass etc. The commands for all the hardware passes to implement the neural network form a command stream for the neural network which, when processed by the command decoder, cause the NNA to execute the entire neural network. For a static neural network, the command stream may be loaded into memory accessible by the NNA (e.g. by a host (e.g. CPU) controlling the NNA) and the command decoder (e.g. controller) of the NNA may be provided with the address of the start of the command stream. The command decoder may then process the commands from start to finish for a first set of input data to the neural network (e.g. for a first inference or first pass), and then process those same commands for the next set of input data to the neural network (for the next inference or the next pass) etc. This allows the NNA to iteratively execute the neural network (e.g. iteratively perform inferences or passes) without intervention from the host (e.g. CPU).
As described above, for a dynamic neural network there are points in the neural network where a decision is made as to which part of the neural network is to be executed next. The hardware accelerators of most NNAs are not capable of performing the processing necessary to determine which part of the neural network is to be executed next. Accordingly, known methods for implementing a dynamic neural network on such NNAs, which is not admission that such methods are known outside of the Applicant company or are well-known, comprise grouping the hardware passes of the dynamic neural network into segments which can be executed by an NNA without intervention from the host (e.g. CPU), creating a separate command stream for each segment, and having the host (e.g. CPU) cause the NNA to execute the appropriate segment command stream at the appropriate time. This means that after the NNA executes or processes a segment command stream the NNA has to wait for the host (e.g. CPU) to (i) determine what segment command stream is to be executed next and (ii) notify the NNA of the next segment command stream. This also means that the host (e.g. CPU) has to constantly monitor the status of the NNA and make decisions on what segment is to be executed next. Accordingly, this is not a very efficient means of implementing a dynamic neural network on an NNA.
However, new NNAs have recently been developed that have, in addition to one or more hardware accelerators to accelerate one or more neural network operations, an embedded processor which can perform more complicated and more varied operations than the hardware accelerators. Specifically, the embedded processor may be able to execute one of a plurality of pre-configured programs. In some cases, the embedded processor may take the form of a micro-controller. The term micro-controller is used herein to mean a small and low-cost micro-computer which is designed to perform one or more tasks or operations within an embedded system. A micro-controller comprises at least a processor, memory and one or more input/output (I/O) ports which can be used to connect the micro-controller to other components of the system.
The inventor has determined that a dynamic neural network can be efficiently executed on an NNA with an embedded processor by using the embedded processor to dynamically determine which segment command stream is to be executed next. Specifically, this is accomplished by adding a hardware feedback path between the embedded processor and the command decoder, and configuring the command decoder to process a new set of commands, which will be referred to as branch commands, which indicate to the command decoder that it is time to branch or jump to another command stream. The branch may be an unconditional branch (e.g. the next command stream to be processed may be predetermined) or may be a conditional branch (e.g., the next command stream to be processed may not be predetermined). Then, when the command decoder processes a set of one or more branch commands the command decoder determines whether the branch is an unconditional branch or a conditional branch. If the branch is a conditional branch, the command decoder causes the embedded processor to determine the next command stream and notify the command decoder of the next command stream via the hardware feedback path. Once the next command stream has been identified, either because it was predetermined or has been identified by the embedded processor, the command decoder starts processing or executing the commands of the next command stream. Where each command stream comprises commands for one or more hardware passes of the neural network accelerator, the one or more branch commands may be in the form of a new type of hardware pass, which may be referred to as a branch hardware pass.
To execute a dynamic neural network on such a neural network accelerator, the operations of the dynamic neural network are grouped into segments (e.g. a set of hardware passes) which can be executed on the neural network, and a command stream is generated for each segment (e.g. set of hardware passes) which causes the neural network accelerator to perform the operations of that segment. A set of one or more branch commands are added to the end of each segment command stream (e.g. in the form of a branch hardware pass), except the last segment command stream, that causes the command decoder to branch or jump to the next segment command stream by either, branching or jumping to a predetermined command stream identified in the one or more branch commands, or causing the embedded processor to determine which command stream is to be processed next according to one or more criteria and/or one or more conditions. The command stream for each segment may be loaded into memory accessible by the neural network accelerator, and the starting address of the command stream for the first segment is provided to the command decoder. The command decoder may then start processing the commands at that address in order. This allows the complete dynamic neural network to be executed on the neural network accelerator without host (e.g. CPU) intervention. This allows a dynamic neural network to be executed in a much more hardware efficient manner.
For example, reference is now made to
In the example of
A command stream 302, 304, 306 has been generated for each segment that comprises a set of commands for each operations hardware pass in that segment that, when processed by the command decoder, causes the neural network accelerator to perform a set of one or more desired operations. Specifically, command stream A 302 has been generated for segment A, command stream B 304 has been generated for segment B, and command stream C 306 has been generated for segment C.
A set of one or more branch commands to implement a branch hardware pass have been added to command streams A and B 302, 304 respectively which cause the command decoder to, once it has processed the commands for the operations hardware passes in that command stream, jump to the next command stream. Specifically, one or more branch commands to implement branch (BR) hardware pass P4 have been added to command stream A 302, and one or more branch commands to implement a branch (BR) hardware pass P6 have been added to command stream B 304. A branch hardware pass may implement an unconditional branch (the next command stream is predetermined) or a conditional branch (where the next command stream is not predetermined and is based on one or more conditions or criteria).
In this example, the set of one or more branch commands to implement the branch (BR) hardware pass P4 in command stream A 302 cause the command decoder to implement an unconditional branch to command stream B 304. When the command decoder (e.g. controller) of the NNA processes such a set of branch commands it causes the command decoder to branch or jump to (i.e., start processing) the commands in command stream B 304. An unconditional branch is used in this example because it is known, or predetermined, that after executing command stream A 302 that command stream B 304 is to be executed. Thus, no criteria need to be assessed or analysed to determine where to branch to.
The set of one or more branch commands to implement the branch (BR) hardware pass P6 in command stream B 304 cause the command decoder (e.g. controller) of the NNA to implement a conditional branch in which the embedded processor is used to determine which command stream is to be processed next. In this example, the command decoder causes the embedded processor to determine how many times command stream B 304 has been processed (or executed), and if it is less than N, notify the command decoder that command stream B 304 is to be processed again, and if it is equal to N, notify the command decoder that command stream C 306 is to be processed.
Reference is now made to
In the example of
The input unit 420 is hardware configured to receive and store input data to the hardware pipeline 418. The input data may be received from external memory 426 (i.e., memory external to the NNA 400) via a memory interface 428. In some examples, the input unit 420 may comprise one or more buffers to store the received input data. Although the example hardware pipeline 418 of
Each hardware accelerator 406, 408, 410, 412, is configured to accelerate one or more neural network operations. Specifically, each hardware accelerator 406, 408, 410, 412 is configured to receive an input tensor and perform, via hardware logic, one or more operations on the input tensor to generate an output tensor. The hardware pipeline 418 of
The convolution accelerator 406 is hardware configured to accelerate convolution operations. An example implementation of a convolution accelerator 406 is described with respect to
The element-wise operations accelerator 408 is hardware configured to receive input data (e.g. an input tensor) and perform an element-wise operation on the input data (e.g. input tensor), optionally with another data set (e.g. another tensor which may be referred to as the secondary input tensor) which may be obtained or retrieved from external memory 426 (e.g. memory external to the NNA) via the memory interface 428. An element-wise operation is a same operation that is performed on each element of the input data/tensor (e.g. each input data value or each tensel). Element-wise operations which may be performed on the input data include, but are not limited to, add, multiply, maximum, and minimum.
The other data set/tensor may be the same size (e.g. have the same dimensions) as the input data/tensor such that corresponding elements of the two tensors are combined using an element-wise operation. Alternatively, the other data set/tensor and the input data/tensor may have a different size or dimensions. If, for example, the mismatching dimension of one of the tensors is of size 1, an element-wise operation may be performed between the input data/tensor and the other data set/tensor using a broadcast technique wherein the smaller tensor is broadcast (or expanded) to the size of the other tensor. For example, a tensor of size [N, H, W, C]= [1, 10, 1, 10] can be combined element-wise with a tensor of size [N, H, W, C]= [1, 10, 10, 10] by expanding the W dimension of the first tensor.
The pooling accelerator 410 is hardware configured to accelerate pooling operations such as, but not limited to, max, min and average. The activation accelerator 412 is hardware configured to accelerate non-linear operations such as, but not limited to, ReLU and LReLU.
The output unit 422 is hardware configured to receive the output tensor generated by processing the input data via one or more hardware accelerators 406, 408, 410, 412. In some cases, the output unit 422 may have a buffer or other storage for temporarily storing all or a portion the output tensor prior to outputting the output tensor from the hardware pipeline 418. In some cases, the output unit 422 may be configured to save the output tensor in external memory 426 (i.e., memory that is external to the neural network accelerator) via the memory interface 428.
The interconnection hardware 424 statically or dynamically connects the input unit 420, one or more hardware accelerators 406, 408, 410, 412, and the output unit 422 to allow input data to flow through (e.g. be processed by) one or more hardware accelerators and then be output from the hardware pipeline 418. In some cases, the interconnection hardware 424 may comprise fixed hardware connections between the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422 that allow data to flow through the input unit 420, the hardware accelerators 406, 408, 410, 412 to the output unit 422 in a limited number of ways. However, in other cases, the interconnection hardware 424 may comprise hardware that can dynamically connect the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422 in a plurality of different ways in response to one or more control signals.
For example, the interconnection hardware 424 may comprise a crossbar and the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422 may be connected to the crossbar in such a manner that the crossbar can dynamically connect the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422 in a plurality of different ways in response to one or more control signals. For example, in one hardware pass of the hardware pipeline 418 the crossbar may connect the output of the input unit 420 to the input of the convolution accelerator 406, connect the output of the convolution accelerator 406 to the input of the element-wise operations accelerator 408, and then connect the output of the element-wise operations accelerator 408 to the input of the output unit 422 so that the input data for the hardware pass is processed by the convolution accelerator 406 then the element-wise operations accelerator 408. In another hardware pass, the crossbar may connect the output of the input unit 420 to the input of the convolution accelerator 406, and the output of the convolution accelerator 406 to the input of the output unit 422 so that the input data for the hardware pass is processed only by the convolution accelerator 406. Accordingly, in these cases the connections between the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422 (and thus the manner in which data may flow through the input unit 420, the hardware accelerators 406, 408, 410, 412 and the output unit 422) are not fixed or static.
The embedded processor 414 can execute programs or functions. In some cases, the embedded processor 414 may be a micro-controller. The programs or functions which the embedded processor 414 can execute may be stored in internal memory 430. In some cases, the embedded processor can execute programs or functions that cause the embedded processor 414 to perform operations on data input to the embedded processor 414. The embedded processor 414 may be able to receive data from external memory 426 via the memory interface 428 or from the hardware pipeline 418 via internal paths. For example, in some cases, the output unit 422 of the hardware pipeline 418 may be able to write data to internal memory 430 of the NNA, via the memory interface 428, which the embedded processor has access to. There is a hardware feedback path between the embedded processor 414 and the command decoder 416 which enables the embedded processor 414 to control the operation of the command decoder 416 and specifically, to control which command stream the command decoder 416 is to process next. The hardware feedback path is a set of one or more hardware components that enable the embedded processor to control which command stream the command decoder 416 is to process next. As described in more detail below, in some examples, the hardware feedback path may comprise a set of registers and/or set of interfaces situated between the embedded processor 414 and the command decoder 416 which allow the embedded processor 414 to notify the command decoder 416 of which command stream the command decoder 416 is to process next. However, in other examples, the hardware feedback path may take a different form. In the example shown in
The command decoder 416 controls the operation of the other components of the NNA core 402 in accordance with a set of commands that form a command stream.
The example neural network accelerator 400 of
A command stream thus comprises commands to implement one or more hardware passes. The commands for each operations hardware pass may indicate which input data is to be processed in that hardware pass, which components are to process the input data, how those components are to process the input data, and where the output data is to be stored. Then, for each operations hardware pass, the command decoder 416 sends commands or control information to the appropriate components of the NNA core so that the identified input data will be processed using the desired components in the desired manner.
For example, the commands for an NNA hardware pass may identify which of the hardware accelerators 406, 408, 410 and 412 of the hardware pipeline 418 are to be active (i.e., used) in that hardware pass and what operations they are to perform. The command decoder 416 may then be configured to, in response to processing commands for an NNA hardware pass, (i) send command or control information to the interconnection hardware 424 indicating which accelerators are to be active or used in the hardware pass which causes the interconnection hardware 424 to connect the input unit 420, the active accelerators 406, 408, 410 and/or 412, and the output unit 422 in the desired manner, and (ii) send information to each active hardware accelerator indicating that it is to be active in the hardware pass and how it should be configured in that hardware pass which causes the hardware accelerator to perform a desired operation on the input data to that accelerator. The commands for an NNA hardware pass may also indicate other information, such as, but not limited to, the formats of the input and output data of the active accelerators.
The commands for an EP hardware pass may identify what program or function the embedded processor is to execute and what input data it is to process. The command decoder 416 may then be configured to, in response to processing commands for an EP hardware pass, cause the embedded processor 414 to execute the identified program using the identified input data.
The commands for a branch hardware pass may comprise one or more branch commands which notify the command decoder 416 that it is to branch to (i.e., start processing) another command stream. The command decoder 416 may support both unconditional branching and conditional branching. In such cases, the commands for a branch hardware pass may indicate whether the branch is a conditional branch or an unconditional branch. In unconditional branching the next command stream to be processed is known or predetermined. Accordingly, when the branch is an unconditional branch, the branch commands may identify the next command stream to be processed. In such cases, when the command decoder 416 processes a set of one or more branch commands which notify the command decoder that an unconditional branch is to be taken to a particular command stream, the command decoder 416 may be configured to start processing the particular command stream.
In contrast, in conditional branching the next command stream to be executed is not fixed or predetermined but is based on one or more conditions or criteria. The command decoder 416 is configured to, when it processes a set of one or more branch commands that indicate that a conditional branch is to be taken, cause the embedded processor 414 to determine which command stream is to be processed next. As described above, the embedded processor is configured to execute programs or functions. Accordingly, when the branch is a conditional branch, the branch commands may identify a program or function that the embedded processor is to execute which causes the embedded processor 414 to determine, using the desired criteria and/or conditions, which command stream is to be processed and to notify the command decoder 416 of the determined command stream via the hardware feedback path. Once the command decoder 416 has been notified by the embedded processor 414 of the command stream to process next, the command decoder 416 may be configured to start processing that command stream.
Reference is now made to
As described above, the command decoder 416 is configured to, when it processes a set of branch commands, of a command stream, that indicate that the command decoder 416 is to branch to, or jump to, another command stream and the other command stream is not known (e.g. it is a conditional branch), cause the embedded processor 414 to identify the command stream that is to be processed or executed next using desired criteria and/or conditions and notify the command decoder 416 of the identified command stream via the hardware feedback path. As described above, the embedded processor 414 is configured to execute programs or functions. Accordingly, in some cases, the command decoder 416 may be configured to cause the embedded processor 414 to identify which command stream is to be executed next by causing the embedded processor 414 to execute a specific program or function that is designed to assess the desired criteria and/or conditions, select a command stream based on the assessment, and notify the command decoder 416 of the selected command stream. The command decoder 416 may be able to cause the embedded processor 414 to execute a specific program or function by notifying the embedded processor 414 that it is to execute a program or function, and providing the embedded processor 414 with information that identifies the program or function. In some cases, the embedded processor 414 may be able to execute one of a plurality of pre-compiled programs or functions each of which can be uniquely identified by an index. In these cases, the information identifying a program or function may comprise an index associated with the program or function. In some cases, in addition to identifying the program or function that the embedded processor 414 is to execute the command decoder 416 may be able to provide the embedded processor 414 with parameters or arguments which are to be used in executing the identified program or function.
In some cases, the NNA core 402 may comprise a plurality of embedded processor (EP) control registers 502 that control the operation of the embedded processor 414. In such cases, the embedded processor 414 may be configured to monitor the EP control registers 502 and operate in accordance with the information in the EP control registers 502. For example, the embedded processor 414 may be configured to read, and operate in accordance with, certain registers in response to an interrupt. In some cases, as shown in
Table 1 lists an example set of EP control registers 502 which the command decoder 416 may be able to write to, to control the operation of the embedded processor 414. It will be evident to a person of skill in the art that this is an example set of EP control registers and that in other examples there may be additional registers, fewer registers and/or different registers.
For the example set of EP control registers 502 in Table 1, the command decoder 416 may cause the embedded processor 414 to execute a specific program or function by writing an index of the program to the EP_OPERATION_CTRL.OPERATION_IDX register, optionally, writing any arguments/parameters or the address of any arguments/parameters to the EP_ARG0 register and/or the EP_ARG1 register, and notifying the embedded processor (e.g. via an interrupt) that a new program or function is to be executed by the embedded processor. The notification causes the embedded processor to read the registers in Table 1 to identify the function or program to be executed next and execute the identified function or program.
In some cases, the instructions or code forming the programs or functions that the embedded processor 414 can execute are stored in memory accessible to the embedded processor 414. In some cases, the memory storing the instructions or code forming the programs or functions that the embedded processor 414 can execute may be internal to the embedded processor 414. In other cases, the memory storing the instructions or code forming the programs or functions that the embedded processor 414 can execute may be internal to the NNA core, but external to the embedded processor. For example, as shown in
As described above, the embedded processor 414 is configured to, once it has identified the next command stream to be processed by the command decoder, notify the command decoder 416 of the identified command stream. In some cases, when the neural network accelerator is configured to execute a dynamic neural network, the command streams that form that dynamic neural network are loaded into memory (e.g. external memory 426) accessible to the neural network accelerator (and the command decoder 416 thereof). Then the command decoder 416 is provided with information identifying the address (e.g. a pointer, NN_CMD_BASE_ADDRESS) of the first command stream. This address may be referred to as the base address. The command decoder 416 then reads the commands of that command stream from memory and controls the other components of the neural network accelerator core in accordance with the commands. In such cases, the embedded processor 414 may be configured to notify the command decoder of the identified command stream by providing the command decoder 416 with information identifying the location in memory of the next command stream to be executed.
In some cases, the information identifying the location in memory of a command stream may comprise information identifying an offset with respect to the base address (e.g. NN_CMD_BASE_ADDRESS). In some cases, the information identifying an offset with respect to the base address for a command stream may comprise the value of the offset. When the information identifying the offset is the value of the offset, the command decoder 416 may be configured to generate the address of the next command stream to be processed by adding the value of the offset to the base address (e.g. NN_CMD_BASE_ADDRESS). However, in other cases, a command stream may comprise commands that define a BRANCH_TABLE with one or more entries wherein each entry comprises a unique index and an offset for a command stream. In such cases, the information identifying an offset may comprise an index to the BRANCH_TABLE. When the information identifying the offset is an index to the BRANCH_TABLE, the command decoder 416 may be configured to retrieve the offset corresponding to the identified index from the BRANCH_TABLE, and generate the address of the next command stream to be processed by adding the retrieved offset to the base address (e.g. NN_CMD_BASE_ADDRESS). The number of entries in the BRANCH_TABLE may be configurable. The BRANCH_TABLE may have a maximum number of entries, which may be 16 in one example. In yet other cases, the embedded processor 414 may be able to identify the offset in either manner (e.g. by providing the value of the offset, or by providing an index to the BRANCH_TABLE).
Having the embedded processor select and provide an index to the BRANCH_TABLE is preferable since it means that the embedded processor does not have to know anything about the actual address and simply has to select from one of a fixed number of options. In these cases, the embedded processor just needs to make the decision but does not need to determine the actual offset. For example, in a looping use case, you may want to keep executing the command stream identified by Index=0 until the loop exit condition is met at which point you then jump to the command stream identified by Index=1. Furthermore, as described in more detail below, having the next command stream be one of a plurality of predetermined command streams allows the command decoder to pre-fetch one or more of those predetermined command streams on the basis that one of the command streams is likely to be processed next.
Table 2 shows example command fields in a command stream which may be used to define a BRANCH_TABLE, where UNITx is an x-bit unsigned integer, and INTx is an x-bit signed integer. Specifically, in this example, if there are Y entries in the BRANCH_TABLE there are 2Y HDR_STREAM_BRANCH_CTRL fields and two HDR_STREAM_BRANCH_CTRL fields are used to define each entry of the BRANCH_TABLE. Specifically, if the entries are numbered from 0 to Y−1, then the (2*N)th and the (2*N+1)th HDR_STREAM_BRANCH_CTRL fields are used to define the Nth entry. For example, HDR_STREAM_BRANCH_CTRL0 and HDR_STREAM_BRANCH_CTRL1 define the 0th entry. In this example, the first of each pair of fields is used to define the LSBs of the offset and the second field of the pair is used to define the MSBs of the offset. However, this is an example only. The DISABLE_BRANCH_PREFETCH and PREFETCH_SIZE_MIN1 portions of these fields will be described below.
In some cases, the NNA core 402 may comprise a plurality of command decoder (CMD) control registers 506 that control the operation of the command decoder 416. In such cases, the command decoder 416 may be configured to monitor the CMD control registers 506 and operate in accordance with the information in the CMD control registers 506. In some cases, as shown in
Table 3 lists an example set of CMD control registers 506 which the embedded processor 414 may be able to write to, to control the operation of the command decoder 416. The purpose of some of the registers listed in Table 3 will be described in more detail below. It will be evident to a person of skill in the art that this is an example only and that in other examples fewer registers, additional registers or a completely different set of registers may be used by the embedded processor 414 to control the operation of the command decoder 416.
For the example set of CMD control registers 506 in Table 3, the embedded processor 414 may, once it has identified the next command stream to be processed, notify the command decoder 416 of the next command stream to be processed by (1) setting the EP_OP_COMPLETION.STATE register to 0x0 to indicate the embedded processor 414 has completed the processing to determine the next command stream, setting EP_OP_COMPLETION.USE_BRANCH_IDX register to 1 to indicate that the embedded processor 414 will provide an index to the BRANCH_TABLE (as opposed to an offset value), and setting the EP_OP_COMPLETION.BRANCH_IDX register to the index associated with the next command stream to be processed; or (2) setting the EP_OP_COMPLETION.STATE register to 0x0 to indicate it has completed the processing to determine the next command stream, setting EP_OP_COMPLETION.USE_BRANCH_IDX register to 0 to indicate that the embedded processor 414 will provide the offset value (as opposed to an index), and setting the EP_CMD_BASE_OFFSET_LSB and EP_CMD_BASE_OFFSET_MSB registers to the MSB and LSBs of the offset respectively.
In some cases, the EP_OP_COMPLETION, EP_CMD_BASE_OFFSET_LSB, and EP_CMD_BASE_OFFSET_MSB registers are connected to the command decoder via a single interface, which may be referred to as the ep_cmd_done interface. In such cases, the command decoder may be notified via this interface (the ep_cmd_done interface interface) of any changes to these registers. In some cases, a trigger signal may be sent to the command decoder on this interface when the EP_OP_COMPLETION register is written to/updated which causes the command decoder to act in accordance with these registers—e.g. determine the next command stream from the provided information and execute the next command stream. In such cases, if the embedded processor is providing an offset to the command decoder then the embedded processor may be configured to write to the offset registers (EP_CMD_BASE_OFFSET_LSB and EP_CMD_BASE_OFFSET_MSB) prior to writing or updating the EP_OP_COMPLETION register.
As described in Table 3, the embedded processor can update the address of the address of an Alt table using the EP_ALT_ADDR_UPDATE_LSB and EP_ALT_ADDR_UPDATE_MSB registers. The Alt table is a table that maps virtual addresses used by an NNA core to physical addresses in the external memory. The Alt table allows the same command stream to be used with different data. For example, if the command stream is to be used to run multiple (e.g. 16) different inputs with the same command stream, the embedded processor can adjust the Alt table for each input set of data to point to the appropriate input data. For example, the embedded processor can control a loop to execute the command stream with 16 different inputs by adding an offset to the Alt table after each iteration of the loop.
In some cases, the EP_ALT_ADDR_UPDATE_LSB, and EP_ALT_ADDR_UPDATE_MSB registers are connected to the command decoder via a single interface, which may be referred to as the ep_cmd_alt_address interface. In such cases, the command decoder may be notified via this interface (the ep_cmd_alt_address interface) of any changes to these registers. In some cases, a trigger signal may be sent to the command decoder on this interface when the EP_ALT_ADDR_UPDATE_MSB register is written to, or updated, which causes the command decoder to act in accordance with these registers—e.g. update the address of the Alt table. In such cases, if the embedded processor is to update/adjust both the MSBs and LSBs of the Alt table address, then the embedded processor may be configured to write or update the LSB register (the EP_ALT_ADDR_UPDATE_LSB register) prior to writing or updating the MSB register (the EP_ALT_ADDR_UPDATE_MSB register). In some cases, the command decoder may be configured to, regardless of when the command decoder receives updated Alt table information, only use the updated Alt table information for the next command stream. In other words, the command decoder may not switch between Alt tables within a command stream. This can prevent a race condition.
In some cases, the commands forming the command streams of a dynamic neural network are loaded into memory accessible to the command decoder 416. For example, as shown in
As described above, the command decoder 416 is configured to, when it processes a set of branch commands that indicate that the command decoder 416 is to branch to, or jump to, a next command stream and the next command stream is not known (e.g. it is a conditional branch), cause the embedded processor 414 to identify which command stream is to be processed next using desired criteria and/or conditions and notify the command decoder 416 of the identified command stream via a hardware feedback path. In these cases, the command decoder 416 may wait until the embedded processor 414 has identified the next command stream and notified the command decoder 416 of the identified command stream before the command decoder 416 retrieves the commands of that command stream from external memory 426. However, this can cause a delay in starting the processing of the next command stream. Accordingly, the command decoder 416 may be configured to mask this delay by proactively pre-fetching at least a portion of the commands for one or more likely next command streams from the memory (e.g. external memory 426).
Where the original command stream comprises commands defining a BRANCH_TABLE as described above, the command decoder 416 may be configured to pre-fetch at least a portion of the commands of the command streams associated with the first predetermined number, X, entries of the BRANCH_TABLE. In other words, the command decoder 416 may be configured to pre-fetch at least a portion at the command streams at the addresses in memory identified by the first X offsets in the BRANCH_TABLE. The BRANCH_TABLE may then be configured such that information for the most likely command streams to be processed is in the top X entries. In some cases, the amount of a command stream that is pre-fetched may be configurable and in some cases, the amount of a command stream that is pre-fetched may be configurable on a per command stream basis. For example, in addition to each entry of the BRANCH_TABLE comprising an index and an offset, each entry may also comprise information indicating a pre-fetch size (e.g. this may be defined in the command stream via the IDX<N>_PREFETCH_SIZE_MIN1 field of Table 2). The pre-fetch sizes may be selected based on the likelihood that the command stream is the command stream that will be processed next. For example, the pre-fetch sizes may be set such that the higher the likelihood that a command stream is the next command stream, the higher the pre-fetch size, and the lower the likelihood that a command stream is the next command stream, the lower the pre-fetch size.
In some cases, pre-fetching may be selectively disabled. In some cases, pre-fetching may be selectively disabled on a per command stream basis. For example, the command stream may specify (e.g. via the IDX<N>_DISABLE_BRANCH_PREFETCH field of Table 2) on a per BRANCH_TABLE entry basis whether pre-fetching of the associated command stream is enabled or disabled. Where only the top X entries are pre-fetched, whether the pre-fetching is explicitly disabled or enabled may only affect the top X entries. For example, if the command decoder 416 is configured to pre-fetch the command streams associated with the first two entries of the BRANCH_TABLE, even if pre-fetch is enabled for the third entry, the command stream associated with the third entry is not pre-fetched.
Reference is now made to
The method 600 begins at block 602 where the command decoder 416 determines whether the set of branch commands relates to an unconditional branch (i.e., the next command stream to execute is predetermined) or a conditional branch (i.e., the next command stream to execute is not predetermined and is based on one more criteria or conditions). In some cases, a set of branch commands may comprise a field, such as a BRANCH_STREAM field that indicates whether the set of branch commands relates to an unconditional or conditional branch. In such cases the command decoder 416 may be configured to analyse the BRANCH_STREAM field to determine whether the set of branch commands relates to an unconditional branch or a conditional branch. If it is determined that the set of branch commands relates to an unconditional branch then the method 600 proceeds to block 604. If, however, it is determined that the set of branch commands relates to a conditional branch then the method 600 proceeds to block 616.
At block 604, the command decoder 416 determines the next command stream to execute from the set of branch commands. In other words, for an unconditional branch information identifying the next command stream is embedded in the set of branch commands and the command decoder 416 analyses the set of branch commands to determine the next command stream to execute. In some cases, when a set of branch commands relates to an unconditional branch the set of branch commands may comprise information identifying a BRANCH_TABLE, as described above, and the first entry (e.g. the entry related to index 0) in the BRANCH_TABLE identifies the next command stream to execute. In such cases, the command decoder 416 may be configured to analyse the set of branch commands to identify the BRANCH_TABLE and determine that the next command stream to execute is the command stream identified (or pointed to) by the first entry. For example, where each entry in the BRANCH_TABLE includes the offset (with respect to the base address) of a command stream, the command decoder may determine that the address of the next command stream is the base address+the offset in the first entry of the BRANCH_TABLE. In other cases, the set of branch commands may include a special field, or the like, that explicitly identifies (e.g. via an offset) the next command stream to be executed.
In some cases, once the next command stream to be executed has been identified from the set of branch commands, the command decoder records information identifying the next command stream. For example, the command decoder may have a set of registers that are used to store the information about the command stream that is to be executed. In such cases, the command decoder may be configured to store information in the appropriate registers that identify the next command stream to be executed. For example, the command decoder may comprise a register that stores the offset of the next command stream to be executed and the command decoder may be configured to update this register with the offset of the next command stream to execute.
Once the next command stream to be executed has been identified from the set of branch commands (and optionally the command decoder has recorded information identifying the next command stream) the method 600 may proceed to block 606, 608 or 614. Specifically, if the command decoder supports pre-fetching (e.g. because the command streams are stored external to the command decoder) the method 600 may proceed to block 606; if the command decoder allows the embedded processor to be used to perform some processing before the next command stream is executed, the method 600 may proceed to block 608; and if the command decoder does not support either pre-fetching or allowing the embedded processor to perform some processing before the next command stream the method 600 may proceed directly to block 614.
At block 606, the command decoder may pre-fetch (or load) all or a portion of the next command stream determined in block 604 into internal storage (e.g. buffer 508) of the command decoder 416 from memory, or storage, external to the command decoder 416 (e.g. external memory 426). The amount of a command stream that is pre-fetched may be configurable and may be configurable on a per command stream basis. For example, as described above, each entry of a BRANCH_TABLE may comprise, in addition to an offset to a particular command stream, information identifying the amount of the command stream that is to be pre-fetched. Once the pre-fetching of the next command stream has been started, the method 600 may proceed to block 608 or block 614 depending on whether the command decoder allows the embedded processor to be used to perform some processing before the next command stream is executed.
At block 608, the command decoder determines from the set of branch commands whether the embedded processor (EP) is to perform some processing before the next command stream is executed. For example, the embedded processor may be used to: generate parameters etc. that are used in executing the next command stream; perform some tidying-up; notify another device in the system; and/or perform some post processing. However, it will be evident to a person of skill in the art that these are just examples of the processing that the embedded processor may be used to perform and that the embedded processor may be used to perform any suitable processing. In some cases, the set of branch commands may comprise a field, or the like, such as a KICK_EP_OPERATION field, which indicates whether the embedded processor is to be used to perform some processing before the next command stream is executed. In such cases, the command decoder may be configured to analyse such a field of the set of branch commands to determine whether the embedded processor is to be used to perform some processing. If it is determined that the embedded processor is to be used to perform some processing before the next command stream is executed then the method 600 proceeds to block 610. If, however, it is determined that the embedded processor is not to be used to perform some processing then the method 600 proceeds to block 614.
At block 610, the command decoder causes the embedded processor to perform some processing. As described above, the command decoder may be configured to cause the embedded processor to perform some processing by notifying the embedded processor that it is to execute a predetermined program or function. Each program or function that the embedded processor can execute may be identified by an index, and the command decoder may identify the program or function that is to be executed via its index. The command decoder may also be able to provide one or more arguments or a pointer to arguments. In some cases, the command decoder may notify the embedded processor that it is to execute a predetermined program by writing to one or more EP control registers and raising and interrupt. For example, the command decoder may write the index of the program or function to execute and any arguments (or pointer to arguments) to that program or function to one or more EP control registers and raise an interrupt to notify the embedded processor that it should execute the program or function identified by those registers. Once the command decoder has caused the embedded processor to perform some processing the method 600 proceeds to block 612.
At block 612, the command decoder waits until the embedded processor has completed the requested processing. In some cases, the embedded processor may be configured to, once it has executed an identified program or function, notify the command decoder (e.g. by writing to one or more command decoder control registers). Once the command decoder has determined that the embedded processor has completed the processing, the method 600 proceeds to block 614.
At block 614, the command decoder starts executing or processing the next command stream. Executing or processing the next command stream comprises controlling the operation of the hardware accelerators and the embedded processor in accordance with the commands of the next command stream.
At block 616, which is implemented after it has been determined at block 602 that the set of branch commands relate to a conditional branch, the command decoder causes the embedded processor to identify the command stream that is to be processed or executed next using specific criteria and/or conditions. As described above, in some cases the command decoder may be configured to cause the embedded processor to identify which command stream is to be executed or processed next by causing the embedded processor to execute a predetermined program or function that causes the embedded processor to assess the specific criteria and/or conditions and provide the command decoder with information identifying the command stream to be executed next. The command decoder may be configured to cause the embedded processor to execute a specific program or function by notifying the embedded processor 414 that it is to execute a program or function and providing the embedded processor 414 with information that identifies the program or function. In some cases, the programs or functions that the embedded processor can execute are each identified by a unique index and the information provided to the embedded processor is the index of the program or function to be executed. The command decoder may also be able to provide the embedded processor with parameters or arguments which may be used in executing the identified program or function. In some cases, the operation of the embedded processor may be controlled by one more embedded processor control registers and the command decoder may be configured to cause the embedded processor to execute a program or function by writing information (e.g. index) identifying the program or function to be executed by the embedded processor to one or more EP control registers, optionally, writing any parameters or a pointer to parameters to one or more EP control registers, and issuing an interrupt to the embedded processor which causes the embedded processor to read the relevant EP control registers and execute the function or program identified thereby.
Once the command decoder has caused the embedded processor to identify the next control stream to execute, the method 600 proceeds either to block 618 or block 622. Specifically, where the command decoder supports proactively pre-fetching one or more command streams before the next command stream to be executed has been determined then the method may proceed to block 618. If, however, proactive pre-fetching is not supported then the method may proceed directly to block 622.
At block 618, the command decoder determines, from the set of branch commands, if proactive pre-fetching of command streams, before the next command stream to be executed has been identified by the embedded processor, has been disabled. As described above, in some cases the command decoder may be configured to, when it is processing a conditional branch, proactively pre-fetch a predetermined number, X, of likely command streams to mask the delay in fetching the next command stream after the embedded processor has identified the next command stream. As described above, and in more detail below with respect to block 620, in some cases a set of branch commands may identify a BRANCH_TABLE that comprises one or more entries wherein each entry identifies, or points to, a command stream, and the command decoder is configured to pre-fetch the command streams identified in the first X entries in the BRANCH_TABLE. However, in some cases, the proactive pre-fetching may be selectively disabled. For example, where the branch commands define a BRANCH_TABLE, each entry in the BRANCH_TABLE may indicate whether the corresponding command stream is to be proactively pre-fetched. In such cases, the command decoder may be configured to determine that proactive pre-fetching has been disabled if the BRANCH_TABLE indicates that pre-fetching has been disabled for each of the first X entries of the BRANCH_TABLE.
If it has been determined that proactive pre-fetching of command streams, before the next command stream has been identified by the embedded processor, has been disabled the method 600 proceeds directly to block 622. If, however, it has been determined that pre-fetching of command streams, before the next command stream has been identified by the embedded processor, has not been disabled then the method 600 proceeds to block 620.
At block 620, all or a portion of one or more command streams are pre-fetched (or loaded) into local storage (e.g. buffer 508) of the command decoder from memory, or storage, external to the command decoder (e.g. external memory 426). As described above, in some cases, a set of branch commands may define a BRANCH_TABLE that comprises one or more entries wherein each entry identifies a command stream, and the command decoder is configured to pre-fetch each command stream identified in the first X entries in the BRANCH_TABLE for which pre-fetching has not been disabled. Pre-fetching a command stream may comprise identifying the address in memory of the command stream and reading all or a portion of the command stream from memory and storing the read portion of the command stream in local storage (e.g. buffer). Where each BRANCH_TABLE entry comprises an offset from a base address from which the corresponding command stream is stored in memory, the command decoder may be configured to determine the address of a command stream associated with a BRANCH_TABLE entry to be the base address+the offset stored in that entry.
The amount or portion of a command stream that is pre-fetched may be configurable and may be configurable on a per command stream basis. For example, as described above, each entry of a BRANCH_TABLE may comprise, in addition to information identifying a command stream (e.g. offset), information identifying the amount or portion of the command stream that is to be pre-fetched. In such cases, the command decoder may determine the amount or portion of a command stream to be pre-fetched from the BRANCH_TABLE.
Once all or a portion of one or more command streams have been pre-fetched, the method 600 proceeds to block 622.
At block 622, the command decoder waits until it receives information from the embedded processor indicating which command stream is to be executed or processed next. Once the command decoder has received information from the embedded processor indicating which command stream is to be executed or processed next, the method 600 proceeds to block 624.
At block 624, the command decoder determines from the information received from the embedded processor which command stream is to be executed next. As described above, in some cases the embedded processor may be configured to identify the command stream to be executed or processed next by providing one of: (i) an index into a BRANCH_TABLE defined in the set of branch commands; or (ii) an offset from which the address in memory of the command stream can be determined.
Reference is now made to
At block 704, the command decoder identifies the index provided by the embedded processor. In some cases, the embedded processor may use a specific field and/or register to provide the index of the next command stream to be executed. For example, as described above with respect to Table 3, the embedded processor may store the index of the BRANCH_TABLE entry that is associated with the next command stream to be executed in the EP_OP_COMPLETION.BRANCH_IDX field or register. In such cases, the command decoder may read this field or register to identify the index provided by the embedded processor. Once the index provided by the embedded processor has been identified the method proceeds to block 706.
At block 706, the command decoder obtains the offset for the next command stream to be executed or processed next from the BRANCH_TABLE using the index identified in block 704. For example, if the provided index is 3 then the command decoder may read the entry of the BRANCH_TABLE corresponding to index 3 and extract the offset from the read entry. The method then proceeds to block 710.
At block 708, which is implemented if an index has not been provided by the embedded processor meaning that an offset has been explicitly provided, the command decoder identifies the offset provided by the embedded processor. In some cases, the embedded processor may use a specific field and/or register to provide the offset for the next command stream to be executed. For example, as described above with respect to Table 3, the embedded processor may store the offset for the next command stream to be executed in a combination of the EP_CMD_BASE_OFFSET_MSB and EP_CMD_BASE_OFFSET_LSB fields/registers. In such cases, the command decoder may read these fields or registers to identify the relevant offset. Once the relevant offset has been identified the method proceeds to block 710.
A block 710, the command decoder computes the address of the next command stream from the offset obtained or identified in block 706 or block 708. In some cases, the command decoder may compute the address of the next command stream to be the sum of the offset and the base address (e.g. NN_BASE_CMD_BASE_ADDRESS). Once the command decoder computes the address of the next command stream the method may end.
Returning to
At block 626, if the identified command stream has not already been pre-fetched (e.g. in block 620) then all or a portion of the identified command stream is pre-fetched or loaded into internal or local storage (e.g. buffer 508) of the command decoder from memory external to the command decoder (e.g. external memory 426). Once all or a portion of the identified command stream has been loaded into internal or local storage of the command decoder, the method proceeds to block 614 where the command decoder starts executing the next command stream.
Although the NNA in
Where an NNA, such as the NNA 800 of
In scenario (i), the command decoder of each NNA core 402 may be configured to, when, it reaches a set of branch commands that relate to a conditional branch, operate as described above. For example, each NNA core 402 may be configured to implement blocks 616 to 626 and 614 of the method 600 of
In scenario (ii) or (iii), the primary or master NNA core may operate differently from a secondary or slave NNA core when it reaches a set of branch commands that relate to a conditional branch. Specifically, the primary or master NNA core may be configured to, when it reaches a set of branch commands that relate to a conditional branch, cause its embedded processor to determine the next command stream to be executed, and once the command decoder has received information from the embedded processor identifying the next command stream the command decoder may be configured to notify all the NNA core(s) (including itself) of the next command stream. In contrast, a secondary or slave NNA core may be configured to, when it reaches a set of branch commands that relate to a conditional branch, instead of causing its embedded processor to determine the next command stream to be executed or processed, the NNA core simply waits until it receives information from the master NNA core.
The primary or master NNA core 402 may be configured to notify the NNA core(s) 402 of the next command stream to be executed in any suitable manner. In some cases the primary or master NNA core may be configured to notify the NNA cores of the next command stream to be executed by broadcasting information identifying the next command stream to be executed to the NNA cores. For safety and security reasons the NNA cores may not be able to directly communicate with each other. Accordingly, in one example, which is shown in
In some cases, as shown in
The information identifying the next command stream that is provided to the LSYNC module 802 (e.g. by writing that information to an LSYNC memory address) by the command decoder 416 of the master or primary NNA core 402 may be any information that identifies the next command stream. In some cases, the primary or master NNA core 402 may be configured to provide the LSYNC module with the same information that the NNA core 402 received from its embedded processor identifying the next command stream to the LSYNC module 802. For example, if, as described above, the embedded processor 414 provides the command decoder 416 with an index into the BRANCH_TABLE the command decoder 416 may be configured to provide the index to the LSYNC module 802 (e.g. by writing the index to an LSYNC memory address); and if the embedded processor 414 provides the command decoder with an offset (with respect to a base address) the command decoder may be configured to provide the offset to the LSYNC module 802 (e.g. by writing the received offset to an LSYNC memory address). As described above, where the embedded processor 414 provides an offset, the embedded processor may also provide a pre-fetch amount or size. In such cases, the command decoder 416 of the primary or master NNA core may also be configured to provide the received pre-fetch amount or size to the LSYNC module 802 (e.g. by writing the received pre-fetch amount or size to an LSYNC memory address).
Similarly, the LSYNC module 802 may be configured to provide (e.g. by writing to one or more command decoder control registers) the same information identifying the next command stream to be executed or processed that the LSYNC module 802 receives from the primary or master NNA core to the NNA core(s). For example, if the LSYNC module 802 receives an index from the primary or master NNA core, then the LSYNC module 802 may provide the received index to the NNA core(s) (e.g. by writing the index to one or more command decoder control registers of the NNA core(s)); and if the LSYNC module 802 receives an offset (and optionally a pre-fetch amount or size) then the LSYNC module 802 may be configured to provide the received offset (and pre-fetch amount or size) to the NNA core(s) (e.g. by writing the offset and/or the pre-fetch amount or size to one or more command decoder control registers of the NNA core(s)).
Once the command decoder of an NNA core 402 has received information identifying the next command stream to be executed or processed via the LSYNC module 802, the command decoder starts executing or processing the identified command stream as described above, which may comprise, first pre-fetching a predetermined amount of the identified command stream.
Although
Reference is now made to
The method begins at block 902 where the command decoder 416 determines whether the NNA core to which it belongs is a secondary NNA core. If the NNA core is not a secondary core then the NNA core is either an independent core or a primary NNA core. The command decoder 416 may determine that it belongs to, or forms part of, a secondary NNA core if the branch commands do not indicate that the embedded processor (EP) is to be used to determine the next command stream. For example, in some cases, the set of branch commands may comprise a field, such as, but not limited to a KICK_EP_OPERATION field, which indicates whether the embedded processor is to be used to determine the next command stream. In these cases, the command decoder may determine, from this field, whether the NNA core to which it belongs is a secondary NNA core. If it is determined that the NNA core to which the command decoder belongs, or forms part of, is not a secondary NNA core then the method proceeds to block 904. If, however, it is determined that the NNA core to which the command decoder belongs, or forms part of, is a secondary NNA core then the method proceeds to block 922.
At block 904, which is implemented if the NNA core to which the command decoder belongs is not a secondary NNA core (i.e., the NNA core to which the command decoder belongs is a primary NNA core or an independent NNA core), the command decoder causes the corresponding embedded processor to determine the command stream to be executed or processed using specific criteria and/or conditions. Block 904 corresponds to block 616 of the method 600 of
Once the NNA core has caused its embedded processor to identify the next command stream to be executed or processed, the method 900 proceeds to block 906 or block 910. Specifically, where the NNA core supports proactively pre-fetching all or a portion of one or more command streams before the next command stream to be executed has been identified, the method 900 proceeds to block 906. If, however, the NNA core does not support proactive pre-fetching then the method 900 may proceed directly to block 910.
At blocks 906 and 908, which correspond to blocks 618 and 620 of the method 600 of
At block 910, the command decoder waits for information from its embedded processor identifying which command stream is to be executed or processed next (which may be referred to as a response). Once the command decoder has received information from the embedded processor identifying the next command stream be executed, the method 900 proceeds to block 912.
At block 912, the command decoder determines if the NNA core to which it belongs, or forms part of, is a primary NNA core or an independent NNA core. A primary NNA core is one which uses its embedded processor to determine which command stream to execute next and notifies a set of NNA cores of the determination. In contrast, an independent NNA core is one which uses its embedded processor to determine which command stream to execute next but does not notify any other NNA cores of the determination. In some cases, the embedded processor may be configured to, in addition to providing the command decoder with information identifying the next command stream, provide information to the command decoder indicating whether it is to notify other NNA cores of the next command stream (which also indicates whether the NNA core is a primary NNA core or an independent core). For example, in some cases, the embedded processor may be configured to write information to a specific field or register indicating whether or not the NNA core is a primary NNA core. For example, as shown in Table 3, the embedded processor may be configured to use the register/field EP_OP_COMPLETION. BROADCAST field or register to indicate whether the NNA core is to notify other NNA cores of the next command stream (which indicates whether the NNA core is a primary NNA core or an independent NNA core). Specifically, in one example, the embedded processor may be configured to set the EP_OP_COMPLETION. BROADCAST field to “1” when the command decoder is to broadcast the information identifying the next command stream (e.g. index or offset) to other NNA cores and set the EP_OP_COMPLETION. BROADCAST field to “0” when the command decoder is not to broadcast the information identifying the next command stream (e.g. index or offset). In such cases, the command decoder may be configured to determine that the NNA core to which it belongs is a primary NNA core if the EP_OP_COMPLETION. BROADCAST field is set to “1” and is an independent NNA core if this field is set to “0”. If it is determined that the NNA core is a primary NNA core then the method proceeds to block 914. If, however, it is determined that the NNA core is not a primary NNA core, then the NNA core is an independent NNA core, and the method 900 proceeds to block 916.
At block 914, the command decoder (of the primary NNA core) notifies the NNA cores (e.g. the secondary NNA cores and the primary NNA core) of the next command stream to execute or process based on the information received from the embedded processor. The command decoder may notify the NNA cores of the next command stream to execute or process in any suitable manner. For example, as described above, the NNA may comprise an LSYNC module which acts as an intermediately between the NNA cores. In such cases, the command decoder (of the primary NNA core) may notify the LSYNC module of the next command stream to execute or process and the LSYNC module may notify the NNA cores of the next command stream to execute. As described above, in some cases, the command decoder may notify the LSYNC module of the next command stream to execute by writing information identifying the next command stream to an LSYNC memory address and the LSYNC module may be configured to, in response to receiving such information, write or store the information in one or more command decoder control registers of each of the NNA cores. Once the command decoder (of the primary NNA core) has caused all the NNA cores to be notified of the next command stream, the method 900 proceeds to block 926.
At block 916, the command decoder (of the independent NNA core) determines from the information received from the embedded processor (EP) the command stream to be executed next. Block 916 corresponds to block 624 of the method 600 of
At block 918, if the identified command stream has not already been pre-fetched (e.g. in block 908) then all or a portion of the identified command stream is pre-fetched or loaded into internal or local storage (e.g. buffer 508) of the command decoder from memory external to the command decoder (e.g. external memory 426). Block 918 generally corresponds to block 626 of the method 600 of
At block 920, the next command stream is executed. Executing or processing the next command stream comprises controlling the operation of the hardware accelerators and the embedded processor in accordance with the commands of the next command stream.
At blocks 922 and 924, which correspond to blocks 618 and 620 of the method 600 of
At block 926, which is implemented if the NNA core to which the command decoder belongs is a primary NNA core or a secondary NNA core, the command decoder waits to receive a notification comprising information identifying the next command stream. As described above, in some cases, the primary NNA core may be configured to notify the NNA cores of the next command stream by providing information identifying the next command stream to the LSYNC module (e.g. by writing the information (e.g. an index or an offset) to an LSYNC memory address) which causes the LSYNC module to provide that information to all the NNA cores (e.g. by writing the information (e.g. an index or an offset) to one or more registers of each NNA core). However, it will be evident to a person of skill in the art that this is an example only and that the primary or master NNA core may cause the NNA cores to be notified of the next command stream in any suitable manner. Once the command decoder of a primary or secondary NNA core has received a notification of the next command stream to be executed, the method 900 proceeds to block 928.
At block 928, the command decoder (of a primary or secondary NNA core) determines from the notification which command stream is to be executed next. As described above, in some cases a primary NNA core may be configured to cause the NNA cores to be notified of the next command stream by causing one of: (i) an index into a BRANCH_TABLE defined in the branch commands; or (ii) an offset from which the address in memory of the command stream can be determined, to be provided to the NNA cores.
Where the notification can comprise an index or an offset then block 928 may be implemented by a method similar to the method described above with respect to
In some cases, a set of branch commands in a command stream may be independent or stand-alone. Specifically, in some cases a set of branch commands in a command stream may not have any interaction with, or be related to, branch commands in another command stream. In particular, in some cases, a set of branch commands may simply (i) cause the command decoder to jump to a predetermined (if an unconditional branch) command stream; or (ii) cause the command decoder to cause the embedded processor to start and complete a function or program to determine the next command stream to be executed or processed, and branch or jump to the next command stream.
For example, as shown in
In other cases, a set of branch commands in one command stream may interact or may be related to a set of branch commands in another command stream. For example, a set of branch commands in one command stream may be configured to cause the embedded processor to start a function or process that can be updated via branch commands in another command stream. This may be a more efficient way of implementing a long loop (i.e., a loop in which a specific command stream is executed multiple times before a branch or jump is made to a different command stream).
For example, as shown in
For example, if the control loop process is counting the number of iterations of command stream 1 then the set of branch commands in the second command stream 1104 may cause the control loop process to increment the number of iterations, the control loop process may then compare the number of iterations to a threshold number of iterations to determine whether the command decoder should execute the second command stream 1104 again or another command stream. It will be evident to a person of skill in the art that using the number of iterations to determine whether to continue looping or to exit the loop is an example only and that any suitable condition(s) and/or criteria may be used to control the loop. For example, depending on the application, the control loop may use, for example, an environment matrix or the water level to determine whether to continue looping or exit the loop.
In this example the main control loop remains open or running until the last iteration of the second command stream 1104 has been completed. It can be seen in
The ability of the command decoder to jump or branch between command streams may also allow the NNA to implement macro functions. A macro function is defined herein as a command stream that can be called from any part of a neural network and is inserted in a neural network between two other command streams. Accordingly, like any other command stream, a macro function can comprise commands for one or more hardware passes which may include one or more operations hardware passes such as, but not limited to, NNA hardware passes and EP hardware passes, and a branch hardware pass. A macro function is different from a standard, or regular, command stream in that the next command stream to be executed after the macro function is not based on the commands in the macro function, but is based on where in the sequence of commands streams the neural network is when the macro function is called. Accordingly, it cannot be determined from the commands forming a macro function which command stream is to be executed after the macro function. Therefore, when a set of branch commands causes a branch to a macro function, the set of branch commands may also comprise information indicating which command stream is to be executed after the macro function. This causes the command decoder to provide this information to the embedded processor. The embedded processor then stores this information in memory accessible to the embedded processor (e.g. in an allocated memory location of the internal memory 430). Then, when the embedded processor executes the branch commands in the macro function the embedded processor uses the information received from the command decoder to determine which command stream to execute next.
For example, as shown in
In some cases, macro function calls may be nested. For example, a command stream may call a first macro function which calls a second macro function which calls a third macro function. In such cases, each set of branch commands that call a macro function may include commands which cause the command decoder to notify the embedded processor of the next command stream to execute after the macro function is complete (which may be referred to as the return command stream), and thus the embedded processor may keep a stack of return command streams similar to a software function call stack and pop the command stream information off the stack one at a time. For example, in the example above, after the third macro function is executed it may pop the information off the stack indicating that the it should return to the second macro function; after the second macro function is executed it may pop the information off the stack indicating that it should return to the first macro function; and after the first macro function is executed it may pop the information off the stack indicating that it should now execute the return command stream provided by the initial calling command stream.
Reference is now made to
The method 1300 begins at block 1302 where the operations of the dynamic neural network are divided into a plurality of segments wherein the operations in each segment are performed as a block.
As described above, some NNAs are configured to perform a pass (e.g. forward pass/inference or backward pass) of a neural network (including a dynamic neural network) over one or more hardware passes of the NNA. A hardware pass of an NNA is defined as performing some processing using one or more components (e.g. one or more accelerators and/or the embedded processor) of the NNA. Some NNAs have hardware constraints (e.g., the size of buffers, number of convolution engines, number of pooling engines) that limit the processing that can be performed in a hardware pass, or the order in which, or number of times that, a hardware pass can use components (e.g. hardware accelerators, embedded processor) of the NNA. Where all of the processing to implement a neural network cannot be completed in a single hardware pass of the NNA, the processing may have to be split into multiple hardware passes of the NNA.
To execute a neural network on such an NNA the hardware passes to perform a pass (e.g. forward pass/inference or backward pass) of a dynamic neural network may be identified by first mapping each layer of the dynamic neural network to a sequence of one or more low level layers, wherein a low level layer is a set of one or more neural network operations that can be performed by a single component of the NNA (e.g. by single hardware accelerator or by the embedded processor). Once the layers of the dynamic neural network have been mapped to low level layers, the low level layers may be divided into one or more layer groups, wherein each layer group comprises a sequence of one or more low level layers that can be implemented on the NNA. The sequences of lower level layers that can be implemented by the NNA depend on the components (e.g. hardware accelerators, embedded processor etc.) of the NNA and how they can be connected to process data.
Once the low level layers have been split into one or more layer groups, it is determined, for each layer group, whether that layer group can be implemented in a single hardware pass of the NNA. Specifically, depending on the NNA hardware constraints, it may not be possible to perform all of the processing associated with a layer group in the same hardware pass of the NNA. For example, the input tensor to the first layer of the layer group may be too large to be processed in a single hardware pass of the NNA. Accordingly, if it is determined that a layer group cannot be implemented in a single hardware pass of the NNA that layer group is divided into a plurality of hardware passes. An example method for identifying hardware passes to implement a pass of a neural network is described in the Applicant's UK patent application no. 2209584.8. Since the hardware passes of the NNA identified at this stage are used to implement operations they are referred to herein as operations hardware passes.
In such cases, dividing the operations of the dynamic neural network into a plurality of segments may comprise dividing the hardware passes into a plurality of segments, wherein each segment comprises a set of hardware passes that are executed as a block.
Once the operations of the dynamic neural network have been divided into a plurality of segments the method 1300 proceeds to block 1304.
At block 1304, an initial command stream is generated for each segment which, when executed by the command decoder, causes the neural network accelerator core to perform the operations in that segment. Where each segment comprises one or more operations hardware passes the control stream for a segment comprises a command stream for each operations hardware pass in the segment. The command stream for an operations hardware pass may comprise a set of commands that cause the neural network accelerator to perform the operations in the operations hardware pass. As described above, the set of commands for an operations hardware pass may specify the components (e.g. hardware accelerators and/or embedded processor) that are active (i.e. used) in the hardware pass, how the active components are to be configured, the format of one or more data sets used in the hardware pass etc. Once a command stream for each segment has been generated, the method proceeds to block 1306.
At block 1306, a set of one or more branch commands is added to the initial command stream of each segment, except an end segment of the plurality of segments, to generate a final command stream for that segment. The set of one or more branch commands cause the neural network accelerator to perform an unconditional or conditional branch to a next command stream. A set of one or more branch commands that cause the neural network accelerator to perform a conditional branch cause the command decoder to cause the embedded processor to determine the next command stream and provide information to the command decoder identifying the next command stream. The set of one or more branch commands added to the initial command streams effectively link the plurality of segments in a dynamic manner.
Each set of one or more branch commands may comprise (i) information identifying whether the branch is a conditional branch or an unconditional branch, (ii) if a conditional branch, information identifying the program or function that the embedded processor is to execute to identify the next command stream, and (iii) if an unconditional branch, information identifying the next command stream. For example, each set of branch commands may comprise a field, such as a BRANCH_STREAM field that indicates whether the set of branch commands relates to an unconditional or conditional branch. The BRANCH_STREAM field may, for example, be set to NONE, UNCONDITIONAL or CONDITIONAL.
In some cases, one or more sets of branch commands may comprise information defining a BRANCH_TABLE as described above. For example, the one or more sets of branch commands may comprise information identifying one or more entries of a BRANCH_TABLE where each entry is identified by a unique index and comprises an offset (e.g. from a base address) to a command stream. Each entry may further comprise information identifying a pre-fetch size or amount and/or information indicating whether pre-fetch is disabled for that entry. In some cases, the set of one or more branch commands may comprise all or some of the fields shown in Table 2 that define a BRANCH_TABLE. Where a set of branch commands relates to an unconditional branch and the set of branch comprises information defining a branch table then the information identifying the next command stream may be the offset in a particular entry (e.g. the first entry) of the BRANCH_TABLE.
Where the segments of block 1302 comprise one or more operations hardware passes, the set of one or more branch commands may be in the form of a set of commands to implement a branch hardware pass.
Once the branch commands have been added to the initial command streams of the relevant segments to generate the final command streams, the method 1300 proceeds to block 1308.
At block 1308, the command decoder is caused to control the operation of the one or more hardware accelerators and the command decoder in accordance with a command stream for a first segment of the plurality of segments. This will cause the command decoder to jump to the appropriate next segment based on the status and circumstances. The method 1300 may then end.
Reference is now made to
Each convolution engine 1402 comprises hardware logic configured to receive a set of weights {k1, k2 . . . , k8} that represent all or a portion of a filter, and a set of input data values {x1, x2, . . . , x8} that represent all or a portion of a window of the input data, and perform a multiply-accumulate calculation on the received weights and input data values. In some examples, as shown in
Since it may take more than one hardware pass of the convolution engines 1402 to generate a complete filter result (e.g. because a convolution engine may only receive and process a portion of the weights of a filter and/or a portion of the input data values of a window in a cycle), the convolution accelerator 406 may also comprise a plurality of accumulators 1404. A pass of the convolution engines comprises receiving a set of weights and a set of input data values and performing a multiply-accumulate operation thereon. Each accumulator 1404 receives the output of one convolution engine 1402 and adds the output to previous convolution engine outputs that relate to the same filter. Since a convolution engine 1402 may not generate or produce outputs that relate to the same filter in consecutive cycles the partial results of one or more filters may be stored in an accumulation buffer 1406 and then the appropriate partial results may be provided to the accumulators 1404 each cycle by the accumulation buffer 1406.
In some cases, the convolution accelerator 406 may comprise or have access to an input buffer 1408 for storing the elements of the input tensor and a coefficient buffer 1410 for storing the weights of the convolution. In some cases the input buffer 1408 may be implemented as a plurality of banks of memory. In these cases, there may be a multiplexor (not shown) for each convolution engine 1402 that is coupled to each bank of the input buffer 1408 to allow the data stored in any of the banks to be selectively directed to any of the convolution engines 1402.
The neural network accelerators, neural network accelerator cores, convolution accelerators and convolution engines of
The neural network accelerators described herein may be embodied in hardware on an integrated circuit. The neural network accelerators described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The terms “module,” “functionality,” “component”, “element”, “unit”, “block” and “logic” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs the specified tasks when executed on a processor. The algorithms and methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the algorithms/methods. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.
The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor of the computer system at which the executable code is supported to perform the tasks specified by the code.
A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be or comprise any kind of general purpose or dedicated processor, such as a CPU, GPU, NNA, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), or the like. A computer or computer system may comprise one or more processors.
It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed (i.e. run) in an integrated circuit manufacturing system configures the system to manufacture a neural network accelerator configured to perform any of the methods described herein, or to manufacture a neural network accelerator comprising any apparatus described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.
Therefore, there may be provided a method of manufacturing, at an integrated circuit manufacturing system, a neural network accelerator as described herein. Furthermore, there may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, causes the method of manufacturing a neural network accelerator to be performed.
An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining hardware suitable for manufacture in an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS® and GDSII. Higher level representations which logically define hardware suitable for manufacture in an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g. providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.
An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture a neural network accelerator will now be described with respect to
The layout processing system 1704 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g. in terms of logical components (e.g. NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1704 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1706. A circuit layout definition may be, for example, a circuit layout description.
The IC generation system 1706 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1706 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1706 may be in the form of computer-readable code which the IC generation system 1706 can use to form a suitable mask for use in generating an IC.
The different processes performed by the IC manufacturing system 1702 may be implemented all in one location, e.g. by one party. Alternatively, the IC manufacturing system 1702 may be a distributed system such that some of the processes may be performed at different locations, and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.
In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture a neural network accelerator without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g. by loading configuration data to the FPGA).
In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to
In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in
The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g. in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Further examples and aspects of the invention are provided by way of the following clauses.
Clause 1. A method 1300 of executing a dynamic neural network comprising a plurality of operations on a neural network accelerator comprising a neural network accelerator core, the neural network accelerator core comprising one or more hardware accelerators to accelerate one or more neural network operations, an embedded processor and a command decoder configured to control the operation of the one or more hardware accelerators and the embedded processor, the method comprising: dividing 1302 the plurality of operations of the dynamic neural network into a plurality of segments; generating 1304 an initial command stream for each segment that, when executed by the command decoder, causes the neural network accelerator core to perform the operations in that segment; adding 1306 a set of one or more branch commands to the command stream of each segment, except an end segment of the plurality of segments, to generate a final command stream for that segment, each set of one or more branch commands, when executed by the command decoder, cause the neural network accelerator core to perform an unconditional or conditional branch to a next command stream, wherein causing the neural network accelerator core to perform an unconditional branch to a next command stream comprises causing the embedded processor to determine the next command stream and provide information to the command decoder identifying the next command stream; and causing 1308 the command decoder to control the operation of the one or more hardware accelerators and the embedded processor in accordance with the final command stream for a first segment of the plurality of segments.
Clause 2. The method 1300 of clause 1, wherein each set of one or more branch commands comprises information indicating whether the neural network accelerator core is to perform an unconditional branch or a conditional branch.
Cause 3. The method 1300 of clause 2, wherein, when a set of one or more branch commands comprises information indicating that the neural network accelerator core is to perform an unconditional branch, which set of one or more branch commands also comprises information identifying the next command stream.
Clause 4. The method 1300 of clause 3, wherein that set of one or more branch commands comprises information defining a branch table that comprises one or more entries, each entry comprises information identifying a final command stream or an initial command stream for an end segment; and wherein the information identifying the next command stream comprises the information in a particular entry of the branch table that identifies a final command stream or an initial command stream associated with an end segment.
Clause 5. The method 1300 of clause 4, wherein the particular entry of the branch table is a first entry of the branch table.
Clause 6. The method 1300 of any of clauses 2 to 5, wherein causing the embedded processor to determine the next command stream and provide information to the command decoder identifying the next command stream comprises causing the embedded processor to execute one of a plurality of functions; and, when a set of one or more branch commands comprises information indicating that the neural network accelerator core is to perform a conditional branch, that set of one or more branch commands also comprises information identifying a function of the plurality of functions that the embedded processor is to execute.
Clause 7. The method 1300 of clause 6, wherein each of the plurality of functions is associated with a unique index and the information identifying a function of the plurality of functions is an index associated with a particular function of the plurality of functions.
Cause 8. The method 1300 of any of clauses 2 to 7, wherein: the set of branch commands for at least one of the segments comprises information indicating that the neural network accelerator is to perform an unconditional branch; the set of branch commands for the at least one of the segments also comprises information defining a branch table comprising one or more entries, each entry comprises information identifying a final command stream or an initial command stream for an end segment; and causing the embedded processor to determine the next command stream comprises causing the embedded processor to identify one entry of the branch table.
Clause 9. The method 1300 of clause 8, wherein each entry is associated with a unique index and the information provided to the command decoder identifying the next command stream is the index associated with an entry of the branch table.
Clause 10. The method 1300 of any of clauses 1 to 9, wherein dividing the plurality operations of the dynamic neural network into a plurality of segments comprises (i) dividing the plurality of operations of the dynamic neural network into a plurality of operations hardware passes of the neural network accelerator and (ii) dividing the operations hardware passes of the neural network accelerator into a plurality of segments.
Clause 11. The method 1300 of clause 10, wherein dividing the operations of the dynamic neural network into a plurality of operations hardware passes comprises: (i) mapping each layer of the dynamic neural network to a sequence of one or more low level layers, wherein a low level layer is a set of one or more neural network operations that can be performed by a single hardware accelerator or by the embedded processor, (ii) dividing the low level layers into layer groups, wherein each layer group comprises a sequence of one or more low level layers that can be implemented on the neural network accelerator, and (iii) for any layer group that cannot be implemented in a single hardware pass, sub-dividing the layer group into a plurality of hardware passes based on one or more hardware constraints of the neural network accelerator.
Clause 12. The method 1300 of clause 10 or clause 11, wherein the initial command stream for a segment comprises a command stream for each hardware pass of that segment, and the command stream for each hardware pass comprises information identifying which of the one or more hardware accelerators and the embedded processor are active in the hardware pass.
Clause 13. The method 1300 of any of clauses 10 to 12, wherein the plurality of operations hardware passes comprises a least one neural network hardware pass in which at least one of the hardware accelerators is active in the hardware pass and at least one embedded processor hardware pass in which the embedded processor is active in the hardware pass.
Clause 14. The method 1300 of any of clauses 10 to 13, wherein adding a set of one or more branch commands to an initial command stream to generate a final command stream comprises adding a set of commands to the initial command stream to implement a branch hardware pass.
Clause 15. The method 1300 of any of clauses 2 to 14, wherein the neural network accelerator comprises at least one additional neural network accelerator core, and when a set of one or more branch commands comprises information indicating that the neural network accelerator is to perform an unconditional branch, that set of one or more branch commands also comprises information indicating whether the neural network accelerator core is to provide the information received from the embedded processor to the at least one additional neural network accelerator core.
Clause 16. The method 1300 of any of clauses 1 to 15, wherein the sets of one or more branch commands added to the initial command streams for the segments cause the final command streams to be dynamically linked based on one or more criteria.
Clause 17. Computer readable code configured to cause the method of any of clauses 1 to 16 to be performed when the code is run.
Clause 18. A computer readable storage medium having encoded thereon the computer readable code of clause 17.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2320004.1 | Dec 2023 | GB | national |