The present disclosure relates to transposing information using hardware elements, and more particularly, to transposing information using latches.
Neural networks are relied upon for disparate uses and are increasingly forming the underpinnings of technology. For example, a neural network may be leveraged to perform object classification on an image obtained via a user device (e.g., a smart phone). In this example, the neural network may represent a convolutional neural network which applies convolutional layers, pooling layers, and one or more fully-connected layers to classify objects depicted in the image. As another example, a neural network may be leveraged for translation of text between languages. For this example, the neural network may represent a recurrent-neural network.
Typically, information may need to be transposed as part of training neural networks or at inference time. For example, a first matrix may be used to store information such as input data, weight data, and so on. In this example, the first matrix may be multiplied by a second matrix or by a vector (e.g., a column vector). As may be appreciated, multiplying the first matrix by the second matrix, or the first matrix by the column vector, requires that the number of columns in the first matrix be equal to the number of rows in the second matrix or column vector. Thus, for certain multiplications the first matrix may need to be transposed to switch the row and column indices of the matrix.
At present, hardware techniques to transpose matrices, for example in a streaming context, requires a large array of hardware elements which consume substantial die area.
Embodiments of the present disclosure and their advantages are best understood by referring to the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures, wherein showings therein are for purposes of illustrating embodiments of the present disclosure and not for purposes of limiting the same.
This application describes techniques to enable transposing data in a streaming context while efficiently utilizing die area. As will be described, information may be streamed during operation of a processor with the information being rapidly transposed when needed. The processor may be, for example, a matrix processor which is used to perform neural network processing. It may be appreciated that a matrix, such as a portion of an input image, feature maps, a filter or kernel, may be required to be transposed to allow for multiplying the matrix by another matrix or vector. Thus, transposing information may be a required step during processing of layers, such as convolutional layers, in a neural network. As will be described, the techniques described herein leverage an array or combination of active and shadow latches (referred to herein as active/shadow storage) to efficiently transpose information while limiting an extent to which die area devoted to transposing is required.
To transpose data, such as a matrix, using hardware elements, an example technique may use an array or combination of flip-flops (e.g., a 2D set of flip-flops). Each flip-flop may be used to store a bit included in the matrix. Thus, if the matrix is a 4×3 matrix which includes 12 values, there may be 12 flip-flops used to store the values. As may be appreciated, the 12 values may require more than one bit and there may therefore be more than 12 flip-flops used to store the values. To transpose the 4×3 matrix, the output of the flip-flops may be ordered such that the values are rotated about the diagonal of the matrix. In this way, the 4×3 matrix may be output (e.g., read out) as a 3×4 matrix.
While using a single flip-flop for each bit included in a matrix may allow for transposing data, it will not allow for doing so in a streaming context. For example, full pipelining cannot be used assuming the matrix cannot be read or written in a particular number of clock cycles (e.g., a single clock cycle, a clock cycle per row of a matrix, and so on). As known by those skilled in the art, the inability to function in a streaming context will preclude use of this technique as new inputs cannot be written until all outputs have been read. Thus, the throughput and overall processing speed will be limited.
Another example technique to transpose data may include using two flip-flops for each bit. This may allow for transposing data in a streaming context, for example allowing new information to be written while old information is being read. As an example, the two-flip-flops for each bit may include a first flip-flop and a second flip-flop. The second flip-flops may receive copies of information from first flip-flops after the information is written to the flip-flops. The information may then be read from the second flip-flops while new information is being written to the first flip-flops. However, this example technique may substantially increase the required die area assigned to transposing data. The increased complexity and cost, along with potentially reducing space for other hardware elements in a processor, may cause use of two-flip-flops for each bit to be disfavored.
The techniques described herein advantageously allow for two latches to be used for each bit of a matrix which is to be transposed. A latch, for example an SR latch, may require substantially less die area than a flip-flop which, as an example, may be formed from two latches. Additionally, the regular structure may lend itself to scripted physical placement such that manufacturing of active/shadow storage may be enhanced.
As will be described, the two latches may include an active latch and a shadow latch. For example, values included in a matrix may be written to the active latches over one or more clock cycles. In this example, the values may then be stored in the shadow latches. New values included in a new matrix may then be written to the active latches while the values are read from the shadow latches. In this way, the techniques described herein allow for streaming of information to be transposed during operation of a processor.
To control operation of the active and shadow latches, different clock gates may be used. For example, the active latches may be controlled via a first clock gate signal while the shadow latches may be controlled via a second clock gate signal. In this example, the active and shadow latches may therefore be enabled at specific times to ensure proper operation of the active/shadow storage.
In some embodiments, the active latch may be transparent on the high phase of a clock when input data is ready. The shadow latch may be transparent on the low phase of the clock when data is ready to be moved from the active latch to the shadow latch. Thus, in some embodiments the same clock may be used (e.g., as described at least in
In some embodiments, the innovation may be used for algorithms other than a pure transpose. For example, a circuit may read the diagonal elements of an array out one at a time. Without the techniques described herein, this could not be done naively without either stalling the line writes until the diagonal reads were done.
In the illustrated example, input data 102A is being provided to the matrix processor system 100. The input data 102A, as described above, may represent an input image, one or more feature maps, or a portion thereof. For example, the portion may represent a particular window of the input data which is to be multiplied by weight data (e.g., one or more kernels or filters). The input data 102A may also represent information being input into a layer of a neural network, such as a convolutional, transformer, or fully-connected network, which is to be multiplied with weight information (e.g., a weight matrix).
As known by those skilled in the art, the input data 102A may need to be transposed to multiply the input data 102A with other input data 102B. For example, the input data 102A may represent a 20×10 matrix which includes 120 values. In this example, the input data 102B may represent a 20×4 matrix which includes 80 values. Thus, to multiply these matrices the input data 102A is transposed into a 10×20 matrix.
The matrix processor system 100 includes an active/shadow storage 110 to transpose the input data 102A into transposed data 112. As described above, and as will be described in more detail, the active/shadow storage 110 includes a multitude of active and shadow latches. Each active latch may be connected to an individual shadow latch in the storage 110. In some embodiments, the active/shadow storage 110 may include an array of active and shadow latches. For example, there may be an array of groups of active and shadow latches organized into rows and columns. In this example, there may be 32×32, 20×20, 36×48, and so on groups of active and shadow latches.
The input data 102A may be written to the active latches over one or more clock cycles. For example, at each clock cycle a particular amount of data (e.g., obtained from outside memory) may be written to the active/shadow storage 110. In some embodiments, the amount of data may be 8 bits, 10 bits, 16 bits, 32 bits, and so on. In some embodiments, and as will be described below with respect to
In this way, the matrix processor 120 may determine a processing result 122 associated with the multiplication of input data 102A and input data 102B. The processing result 122 may represent, for example, an output associated with multiplying input data and weights in a layer of a neural network. The processing result 122 may also represent, for example, an output associated with a portion of processing performed in a layer of a neural network. In some embodiments, the processing result 122 may represent one or more windows of input information (e.g., input image, feature maps, and so on) convolved with one or more filters or kernels.
While
An example matrix processor 120 is described in more detail with respect to U.S. Pat. Nos. 11,157,287, 11,409,692, and 11,157,441, which are hereby incorporated by reference in their entirety and form part of this disclosure as if set forth herein.
The active latches, such as active latch 160, may receive respective bits of input data. For example, the input data may include a matrix of values which is to be transposed. In this example, each value of the matrix (e.g., a particular row and column element of the matrix) may be represented by one or more bits. Thus, each active latch may receive one of the bits for storage.
In some embodiments, active latch 160 may represent a group of active latches. For example, each active latch in the group of active latches may be used to a store one bit associated with a larger value (e.g., a byte, a word, an integer, a float, and so on). Similarly, shadow latch 162 may represent a group of shadow latches which receives information from the group of active latches. In this way, the active/shadow storage 110 may be used to transpose a matrix of values which are bytes, words, integers, and so on.
As may be appreciated, the values included in a matrix of values may be provided to the active/shadow storage 110 over one or more cycles. For example, the values may be obtained from an element (e.g., memory, which is not illustrated in
Once the bits included in the matrix of values are stored in the active latches, the shadow latches may store copies of the bits. For example, shadow latch 162 may receive the bit stored in active latch 160. As will be described in
Subsequently, the information stored in the shadow latches may be read over one or more clock cycles. For example, each row of shadow latches included in the active/shadow storage 110 may be addressed and the bits stored therein read from the shadow latches. As another example, each column of shadow latches included in the active/shadow storage 110 may be addressed and the bits stored therein read from the shadow latches. As illustrated, the matrix of values may be read such that the matrix is transposed. For example, the values may be flipped over the diagonal of the matrix.
The number of clock cycles to read the information may optionally be based on a number of columns included in the matrix of values. For example, if the matrix of values is an M×N matrix where each value is a bit then there may be M clock cycles to write N bits of information into the active latches. In this example, there may be N clock cycles to read M bits of information from the shadow latches.
As may be appreciated, while the information stored in the shadow latches is being read, new information (e.g., a new matrix of values) may be stored in the active latches over one or more clock cycles. During operation of the active/shadow storage 110, for example during steady-state operation, the active latches may be written to while the shadow latches are read from. Advantageously, the active latches and shadow latches may store different data such that during steady-state operation matrices may be streamed to the storage 110 and then transposed in a streaming context.
In some embodiments, the active/shadow storage 110 may write information to the active latches at each clock cycle while also reading information from the shadow latches at the clock cycle. For example, square matrices may cause a same number of writes to be performed as a number of reads. With respect to the example above of an M×N matrix, if M is the same as N then there may be M clock cycles to write M bits of information into the active latches and M clock cycles to read M bits of information from the shadow latches. However, in some embodiments non-square matrices may cause a different number of writes to be performed as a number of reads. With respect to an M×N matrix, where M is 10 and N is 16, there may be 10 cycles of 16 bit writes into the active latches and 16 cycles of 10 bit reads from the shadow latches. Thus, pipelining (e.g., streaming) of matrices may be adjusted depending on the dimensions of a matrix. For example, the storage 110 may limit an extent to which new information is written to the active latches (e.g., the storage 110 may write 10 out of every 16 clock cycles with respect to the example above).
To allow for use of latches, and thus to reduce a die area as compared to use of another technique for transposing matrices (e.g., flip-flops), a clock 150 may be separately gated for the active latches and shadow latches. For example, clock gate 152A may be toggled on or off to enable or disable the active latches. As another example, clock gate 152B may be toggled on or off to enable or disable the shadow latches. In the illustrated embodiment, a same clock is used such that the active latches and shadow latches use opposite edges of the clock (e.g., high phase, low phase).
In this way, the active latches and shadow latches may be separately controlled. In some embodiments, the clock gates 152A and 152B may be toggled according to instructions being executed by the matrix processor system 100. The active latches and shadow latches may, in some embodiments, set to be active on opposite edges of the clock 150. For example, the active latches may be enabled based on rising edges of the clock 150 while the shadow latches may be enabled based on falling edges of the clock 150 (e.g., or vice-versa).
An example implementation of inputting the input data 170 into the storage 110 is illustrated. In the example, the first row of values is input into the upper row of active latches. The second row of values is input into the middle row of active latches. The third row of values is input into the bottom row of active latches.
In the illustrated example, the top row of the transposed data 172 is read from left-most shadow latches. The middle row of the transposed data 172 is read from the middle shadow latches. The bottom of the transposed data 172 is read from the right-most shadow latches. As may be appreciated, and as described herein, while data is being read from the shadow latches new data (e.g., new input data to be transposed) may be stored in the active latches. In this way, data may be transposed in a streaming fashion.
To store value A 202 in active latch 210, clock gate 152A may be enabled. As illustrated, clock gate 152A is set to be enabled such that the active latch 210 can store value A 202. To write a matrix of values to the active latches 110, one or more clock cycles may be required. In
In
In the illustrated example, clock gate 152B has been set to enable (e.g., logical ‘1’), such that on falling edge 212 of the clock 150 the shadow latch 220 is enabled. In addition to shadow latch 220, in some embodiments all shadow latches 220 may be enabled to replicate values in their corresponding active latches. Clock gate 152A is optionally set to be disabled, such that new information (e.g., a new matrix of values) is not written into the active latches.
In the illustrated example, clock gate 152A is set to enable while clock gate 152B is set to disable. Thus, new information (e.g., value B 204) may be written to active latch 210. For example, at rising edge 214A of the clock 150 the value B 204 may be stored in the active latch 210. With clock gate 152B set to disable, at the subsequent falling edge of the clock 150 the shadow latch 220 may retain value A 202. Advantageously, value A 202 may be read during the clock cycle such that new information (e.g., value B 204) is written while old information (e.g., value A 202) is read.
With respect to steady-state operation, as may be appreciated the values of the new information may be written to the active latches over one or more clock cycles. Similarly, during these clock cycles the old information may be read from the shadow latches.
At block 302, the system causes input data to be stored in active latches of the active/shadow storage. During operation of the system, input data may be obtained (e.g., based on instructions, such as software instructions) which is to be transposed. This information may be provided to the active/shadow storage which includes a multitude of groups of active latches and shadow latches. Each group of active latches and shadow latches may be associated with a bit of the input data. As described in
At block 304, the system copies active latch data into the shadow latches. A second clock gate signal may be enabled, and optionally the first clock gate signal set to be disabled, which causes the shadow latches to be enabled. Once enabled, the shadow latches may store (e.g., replicate or copy) the information stored in the active latches.
At block 306, the system reads data from the shadow latches while new data is written to at least some of the active latches. As described in
The vehicle 400 further includes a propulsion system 406 usable to set a gear (e.g., a propulsion direction) for the vehicle. With respect to an electric vehicle, the propulsion system 406 may adjust operation of the electric motor 402 to change propulsion direction.
Additionally, the vehicle includes the matrix processor system 100 which is configured to transpose information as described herein. The matrix processor system 100 may process data, such as images received from image sensors positioned about the vehicle 400 (e.g., cameras 104A-104N). The matrix processor system 100 may additionally output information to, and receive information (e.g., user input) from, a display 408 included in the vehicle 400.
All of the processes described herein may be embodied in, and fully automated, via software code modules executed by a computing system that includes one or more computers or processors. The code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Some or all the methods may be embodied in specialized computer hardware.
Many other variations than those described herein will be apparent from this disclosure. For example, depending on the embodiment, certain acts, events, or functions of any of the algorithms described herein can be performed in a different sequence or can be added, merged, or left out altogether (for example, not all described acts or events are necessary for the practice of the algorithms). Moreover, in certain embodiments, acts or events can be performed concurrently, for example, through multi-threaded processing, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than sequentially. In addition, different tasks or processes can be performed by different machines and/or computing systems that can function together.
The various illustrative logical blocks, modules, and engines described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions. A processor can also be implemented as a combination of computing devices, for example, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
Conditional language such as, among others, “can,” “could,” “might” or “may,” unless specifically stated otherwise, are understood within the context as used in general to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (for example, X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present.
Any process descriptions, elements or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or elements in the process. Alternate implementations are included within the scope of the embodiments described herein in which elements or functions may be deleted, executed out of order from that shown, or discussed, including substantially concurrently or in reverse order, depending on the functionality involved as would be understood by those skilled in the art.
Unless otherwise explicitly stated, articles such as “a” or “an” should generally be interpreted to include one or more described items. Accordingly, phrases such as “a device configured to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations. For example, “a processor configured to carry out recitations A, B and C” can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.
It should be emphasized that many variations and modifications may be made to the above-described embodiments, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure.
This application claims priority to U.S. Prov. App. No. 63/362,876 titled “TRANSPOSING INFORMATION USING SHADOW LATCHES AND ACTIVE LATCHES FOR EFFICIENT DIE AREA IN PROCESSING SYSTEM” and filed on Apr. 12, 2022, the disclosure of which is hereby incorporated herein by reference in its entirety.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2023/018060 | 4/10/2023 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63362876 | Apr 2022 | US |