Performing inference on a machine learning model typically requires retrieving data from memory and applying one or more computational array operations on the data. Applications of machine learning, such as those targeting self-driving and driver-assisted automobiles, often utilize computational array operations to calculate matrix and vector results. These operations require loading data, such captured sensor data, and performing image processing to identify key features, such as lane markers and other objects in a scene. Traditionally, these operations may be implemented using a generic microprocessor system that loads the computation data from memory before performing a computational array instruction. While the data is loading, the microprocessor system often sits idle. The software platform running these applications will initiate the computational array instruction once the data has completed loading. The length of stalls and the time required to synchronize the computational operation with the retrieved data can be particularly long for when accessing variable latency memory. Stalls and synchronization efforts by the software platform reduce the efficiency of the microprocessor system and result in higher power consumption and lower throughput. Therefore, there exists a need for a microprocessor system with increased throughput that performs array computational operations using variable latency memory access.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
One technique for loading a large number of elements and synchronizing the loading of the elements with a control operation is to stall the microprocessor system pending the completion of each memory read. A software platform is configured to initiate the load of the data from memory by issuing a processor instruction and the processor stalls until the load is complete. While the memory read is pending, the software platform waits for the load to complete. Upon completion of the memory read, a next processor instruction corresponding to a computational operation is processed and the data arguments are prepared using the result of the memory read. This computational operation instruction specifying a computational operation and operands is issued for processing by the computational array. An alternative technique requires stalling the processor and waiting for an interrupt to resume the execution of the processor. Both these techniques incur significant performance penalties waiting for the memory read request to be granted access to memory and for the memory read to be performed once access has been granted. Moreover, the techniques increase power consumption by stalling the microprocessor system while each memory read completes. Since the memory reads incur an access time with a variable latency, the length of each stall is difficult to predict. A microprocessor system relying on these techniques is limited in both its throughput and power efficiency.
To address these limitations, a microprocessor system for performing high throughput array computational operations is disclosed. In some embodiments, a microprocessor system includes a hardware arbiter to manage memory requests and is in communication with a control unit and a control queue to synchronize computational operations associated with the memory requests. The hardware arbiter queues memory read requests to retrieve data from memory with variable access latency. Each request is queued until the request is granted access to memory and the request can be serviced. A control queue queues a control operation that corresponds to the memory request and describes a computational operation. The dequeueing of the control operation is synchronized with the availability of the data retrieved via the memory read request. The synchronization allows the data retrieved from memory and the control operation to be synchronized and provided to a computational array together to perform a computational operation.
In various embodiments, a microprocessor system comprises at least a computational array and a hardware arbiter for performing arbitration of memory access requests and synchronizing the granted requests with a control unit. For example, a microprocessor system includes a hardware arbiter for controlling memory access requests to data that is operated on by a computational array such as a matrix processor. The computational array includes a plurality of computation units, wherein each of the plurality of computation units operates on a corresponding value addressed from memory. For example, a value address from memory may correspond to a portion of sensor data that is first loaded from memory before it can be fed to a corresponding computation unit of the computational array. In some embodiments, the hardware arbiter is configured to control the issuing of at least one memory request for one or more of the corresponding value addressed from the memory for the computation units. For example, the hardware arbiter receives memory read requests and queues them until each corresponding request is granted access by the hardware arbiter to read from memory. In some embodiments, the hardware arbiter is configured to schedule a control signal to be issued based on the issuing of the memory requests. For example, once the hardware arbiter grants a memory request, the hardware arbiter sends a ready control signal corresponding to the memory read request. In some embodiments, the ready signal is sent once the read has completed. In various embodiments, the ready signal is received and results in the release of a queued control operation so that the operation can be made available at a computational array together with the data read from memory. In various embodiments, the data is first formatted by a hardware data formatter before presented to a computational array.
In some embodiments, a microprocessor system includes a computational array (e.g., matrix processor) in communication with a hardware data formatter for aligning the data to minimize data reads and the latency incurred by reading input data for processing. For example, a matrix processor allows a plurality of elements of a matrix and/or vector to be loaded and processed in parallel together. Thus, using data formatted by one or more hardware data formatters, a computational operation such as a convolution operation may be performed by the computational array.
One technique includes loading a large number of consecutive elements (e.g., consecutive in memory) of a matrix/vector together and performing operations on the consecutive elements in parallel using the matrix processor. By loading consecutive elements together, a single memory load and/or cache check for the entire group of elements can be performed—allowing the entire group of elements to be loaded using minimal processing resources. However, requiring the input elements of each processing iteration of the matrix processor to be consecutive elements could potentially require the matrix processor to load a large number of matrix/vector elements that are to be not utilized. For example, performing a convolution operation using a stride greater than one requires access to matrix elements that are not consecutive. If parallel input elements to the matrix processor are required to be consecutive, each processing iteration of the matrix processor is unable to fully utilize every individual input element for workloads only requiring non-consecutive elements. An alternative technique is to not require every individual input element of the matrix processor be consecutive (e.g., every individual input element can be independently specified without regard to whether it is consecutive in memory to a previous input element). This technique incurs significant performance costs since each referenced element incurs the cost of determining its memory address and performing a cache check for the individual element with the potential of an even more expensive load from memory in the case of a cache miss.
In an embodiment of a disclosed microprocessor system, the group of input elements of a matrix processor are divided into a plurality of subsets, wherein elements within each subset are required be consecutive but the different subsets are not required to be consecutive. This allows the benefit of reduce resources required to load consecutive elements within each subset while providing the flexibility of loading non-consecutive elements across the different subsets. For example, a hardware data formatter loads multiple subsets of elements where the elements of each subset are located consecutively in memory. By loading the elements of each subset together, a memory address calculation and cache check is performed only with respect to the start and end elements of each subset. In the event of a cache miss, an entire subset of elements is loaded together from memory. Rather than incurring a memory lookup penalty on a per element basis as with the previous discussed technique, a cache check is minimized to two checks for each subset (the start and end elements) and a single memory read for the entire subset in the event of a cache miss. Computational operations on non-consecutive elements, such as the performing convolution using a stride greater than one, are more efficient since the memory locations of the subsets need not be consecutively located in memory. Using the disclosed system and techniques, computational operations may be performed on non-consecutive elements with increased throughput and a high clock frequency.
In various embodiments, a computational array performs matrix operations involving input vectors and includes a plurality of computation units to receive M operands and N operands from the input vectors. Using a sequence of input vectors, a computational array may perform matrix operations such as a matrix multiplication. In some embodiments, the computation units are sub-circuits that include an arithmetic logic unit, an accumulator, a shadow register, and a shifter for performing operations such as generating dot-products and various processing for convolution. Unlike conventional graphical processing unit (GPU) or central processing unit (CPU) processing cores, where each core is configured to receive its own unique processing instruction, the computation units of the computational array each perform the same computation in parallel in response to an individual instruction received by the computational array.
In various embodiments, the data input to the computational array is prepared using a hardware data formatter. For example, a hardware data formatter is utilized to load and align data elements using subsets of elements where the elements of each subset are located consecutively in memory and the subsets need not be located consecutively in memory. In various embodiments, the various subsets may each have a memory location independent from other subsets. For example, the different subsets may be located non-consecutively in memory from one another. By restricting the data elements within a subset to consecutive data, multiple consecutive data elements are processed together, which minimizes the calculations and delay incurred when preparing the data for a computational array. For example, a subset of data elements may be cached as a consecutive sequence of data elements by performing a cache check on the start and end element and, in the event of a cache miss on either element, a single data read to load the entire subset from memory into a memory cache. Once all the data elements are available, the data may be provided together to the computational array as a group of values to be processed in parallel.
In some embodiments, a microprocessor system comprises a computational array and a hardware data formatter. For example, a microprocessor system includes a matrix processor capable of performing matrix and vector operations. In various embodiments, the computational array includes a plurality of computation units. For example, the computation units may be sub-circuits of a matrix processor that include the functionality for performing one or more multiply, add, accumulate, and shift operations. As another example, computation units may be sub-circuits that include the functionality for performing a dot-product operation. In various embodiments, the computational array includes a sufficient number of computation units for performing multiple operations on the data inputs in parallel. For example, a computational array configured to receive M operands and N operands may include at least M×N computation units. In various embodiments, each of the plurality of computation units operates on a corresponding value formatted by a hardware data formatter and the values operated by the plurality of computation units are synchronously provided together to the computational array as a group of values to be processed in parallel. For example, values corresponding to elements of a matrix are processed by one or more hardware data formatters and provided to the computational array together as a group of values to be processing in parallel.
In various embodiments, a hardware data formatter is configured to gather the group of values to be processed in parallel by the computational array. For example, a hardware data formatter retrieves the values from memory, such as static random access memory (SRAM), via a cache. In some embodiments, in the event of a cache miss, the hardware data formatter loads the values into the cache from memory and subsequently retrieves the values from the cache. In various embodiments, the values provided to the computational array correspond to computational operands. For example, a hardware formatter may process M operands as an input vector to a computational array. In various embodiments, a second hardware formatter may process N operands as a second input vector to the computational array. In some embodiments, each hardware data formatter processes a group of values synchronously provided together to the computational array, where each group of values includes a first subset of values located consecutively in memory and a second subset of values located consecutively in memory, yet the first subset of values are not located consecutively in the memory from the second subset of values. For example, a hardware data formatter loads a first subset of values stored consecutively in memory and a second subset of values also stored consecutively in memory but with a gap in memory between the two subsets of values. Each subset of values is loaded as consecutive values into the hardware data formatter. To prepare an entire vector of inputs for a computational array, the hardware data formatter performs loads based on the number of subsets instead of based on the total number of elements needed for an input operand to a computational array.
In the example shown, data formatter 104 and weight formatter 106 are hardware data formatters for preparing data for matrix processor 107. In various embodiments, the data values received at data formatter 104 and/or the weight values received data weight formatter 106 are provided by memory 102 and/or cache 103. In various embodiments, the values are requested by the data formatter 104 and/or weight formatter 106. In some embodiments, the values are requested by control unit 101 and provided to data formatter 104 and/or weight formatter 106. In some embodiments, data formatter 104 and weight formatter 106 include a logic circuit for preparing data for matrix processor 107 and/or a memory cache or buffer for storing and processing input data. For example, data formatter 104 may prepare N operands from a two-dimensional array retrieved from memory 102 (potentially via cache 103) that correspond to image data. Weight formatter 106 may prepare M operands retrieved from memory 102 (potentially via cache 103) that correspond to a vector of weight values. Data formatter 104 and weight formatter 106 prepare the N and M operands to be processed by matrix processor 107. In some embodiments, microprocessor system 100, including at least hardware data formatters data formatter 104 and weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115, perform the processes described below with respect to
In some embodiments, matrix processor 107 is a computational array that includes a plurality of computation units. For example, a matrix processor receiving M operands and N operands from weight formatter 106 and data formatter 104, respectively, includes M×N computation units. In the figure shown, the small squares inside matrix processor 107 depict that matrix processor 107 includes a logical two-dimensional array of computation units. Computation unit 109 is one of a plurality of computation units of matrix processor 107. In some embodiments, each computation unit is configured to receive one operand from data formatter 104 and one operand from weight formatter 106. In some embodiments, the computation units are configured according to a logical two-dimensional array but the matrix processor is not necessarily fabricated with computation units laid out as a physical two-dimensional array. For example, the i-th operand of data formatter 104 and the j-th operand of weight formatter 106 are configured to be processed by the i-th×j-th computation unit of matrix processor 107.
In various embodiments, the data width of components data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115 are wide data widths and include the ability to transfer more than one operand in parallel. In some embodiments, data formatter 104 and weight formatter 106 are each 96-bytes wide. In some embodiments, data formatter 104 is 192-bytes wide and weight formatter 106 is 96-bytes wide. In various embodiments, the width of data formatter 104 and weight formatter 106 is dynamically configurable. For example, data formatter 104 may be dynamically configured to 96 or 192 bytes and weight formatter 106 may be dynamically configured to 96 or 48 bytes. In some embodiments, the dynamic configuration is controlled by control unit 101. In various embodiments, a data width of 96 bytes allows 96 operands to be processed in parallel. For example, in an embodiment with data formatter 104 configured to be 96-bytes wide, data formatter 104 can transfer 96 operands to matrix processor 107 in parallel.
In various embodiments, memory 102 and/or cache 103 provide input data to hardware data formatters data formatter 104 and weight formatter 106 based on memory addresses calculated by the hardware data formatters. In some embodiments, data formatter 104 and/or weight formatter 106 retrieves, via memory 102 and/or cache 103, a stream of data corresponding to one or more subsets of values stored consecutively in memory. Data formatter 104 and/or weight formatter 106 may retrieve one or more subsets of values stored consecutively in memory and prepare the data as input values for matrix processor 107. In various embodiments, the one or more subsets of values are not themselves stored consecutively in memory with other subsets of values. In some embodiments, memory 102 is a memory module that contains a single read port. In some embodiments, memory 102 is static random access memory (SRAM). In some embodiments, the memory contains a limited number of read ports and the number of read ports is fewer than the data width of components data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115. In various embodiments, reads to memory 102 are managed by arbiter 123. Arbiter 123 queues the read requests and determines when each read request may be granted access to memory 102. In various embodiments, the request are queued in a first-in-first-out manner by arbiter 123. In some embodiments, the requests are queued by arbiter 123 by associating a priority with each request. In various embodiments, once a read request is granted access to memory and/or the read is performed, arbiter 123 signals control queue 121 that the read is or will be ready in a fixed number of clock cycles. In some embodiments, arbiter 123 signals control queue 121 that the read has been initiated. In some embodiments, arbiter 123 signals control queue 121 that the read has completed. In various embodiments, the read allowed by arbiter 123 results in data read and transferred to data formatter 104 and/or weight formatter 106. In some embodiments, a hardware data formatter, such as data formatter 104 and weight formatter 106, which will perform a cache check on cache 103 to determine whether each subset of values is in cache 103 prior to issuing a read request to memory 102. In various embodiments, the read request is issued to arbiter 123. In the event the subset of values is cached, a hardware data formatter (e.g., data formatter 104 or weight formatter 106) will retrieve the data from cache 103. In various embodiments, in the event of a cache miss, the hardware data formatter (e.g., data formatter 104 or weight formatter 106) will retrieve the entire subset of values from memory 102 and populate cache 103 with the retrieved values.
In various embodiments, control queue 121 queues control operations to matrix processor 107 in order to synchronize the arrival of a control operation at matrix processor 107 with the arrival of the corresponding operands from data formatter 104 and/or weight formatter 106. For example, control queue 121 includes a first-in-first-out queue for queuing computational operations, such as matrix operations and/or convolution operations, for a computational array such as a matrix processor. Control queue 121 receives a signal from arbiter 123, such as a ready signal, when the corresponding operands for a queued control operation are ready. In some embodiments, the ready state is based on the operands for matrix processor 107 being available in a fixed number of clock cycles. In some embodiments, the ready signal corresponds to the memory access granted for reading the operands from memory 102. In some embodiments, the ready signal corresponds to the memory read completing for the operands corresponding to queued control operation. Although not depicted in
In various embodiments, matrix processor 107 is configured to receive N bytes from data formatter 104 and M bytes from weight formatter 106 and includes at least M×N computation units. For example, matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 and includes at least 96×96 computation units. As another example, matrix processor 107 may be configured to receive 192 bytes from data formatter 104 and 48 bytes from weight formatter 106 and includes at least 192× 48 computation units. In various embodiments, the dimensions of matrix processor 107 may be dynamically configured. For example, the default dimensions of matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 but the input dimensions may be dynamically configured to 192 bytes and 48 bytes, respectively. In various embodiments, the output size of each computation unit is equal to or larger than the input size. For example, in some embodiments, the input to each computation unit is two 1-byte operands, one corresponding to an operand from data formatter 104 and one from weight formatter 106, and the output of processing the two operands is a 4-byte result. As another example, matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 and output 96 4-byte results. In some embodiments, the output of matrix processor 107 is a vector. For example, a matrix processor configured to receive two 96-wide input vectors, where each element (or operand) of the input vector is one byte in size, can output a 96-wide vector result where each element of the vector result is 4-bytes in size.
In various embodiments, each computation unit of matrix processor 107 is a sub-circuit that includes an arithmetic logic unit, an accumulator, and a shadow register. In the example shown, the computation units of matrix processor 107 can perform an arithmetic operation on the M operands and N operands from weight formatter 106 and data formatter 104, respectively. In various embodiments, each computation unit is configured to perform one or more multiply, add, accumulate, and/or shift operations. In some embodiments, each computation unit is configured to perform a dot-product operation. For example, in some embodiments, a computation unit may perform multiple dot-product component operations to calculate a dot-product result. For example, the array of computation units of matrix processor 107 may be utilized to perform convolution steps required for performing inference using a machine learning model. A two-dimensional data set, such as an image, may be formatted and fed into matrix processor 107 using data formatter 104, one vector at a time. In parallel, a filter of weights may be applied to the two-dimensional data set by formatting the weights and feeding them as a vector into matrix processor 107 using weight formatter 106. Corresponding computation units of matrix processor 107 perform a matrix processor instruction on the corresponding operands of the weight and data inputs in parallel.
In some embodiments, vector engine 111 is a vector computational unit that is communicatively coupled to matrix processor 107. Vector engine 111 includes a plurality of processing elements including processing element 113. In the figure shown, the small squares inside vector engine 111 depict that vector engine 111 includes a plurality of processing elements arranged as a vector. In some embodiments, the processing elements are arranged in a vector in the same direction as data formatter 104. In some embodiments, the processing elements are arranged in a vector in the same direction as weight formatter 106. In various embodiments, the data size of the processing elements of vector engine 111 is the same size or larger than the data size of the computation units of matrix processor 107. For example, in some embodiments, computation unit 109 receives two operands each 1 byte in size and outputs a result 4 bytes in size. Processing element 113 receives the 4-byte result from computation unit 109 as an input 4 bytes in size. In various embodiments, the output of vector engine 111 is the same size as the input to vector engine 111. In some embodiments, the output of vector engine 111 is smaller in size compared to the input to vector engine 111. For example, vector engine 111 may receive up to 96 elements each 4 bytes in size and output 96 elements each 1 byte in size. As described above, in some embodiments, the communication channel from data formatter 104 and weight formatter 106 to matrix processor 107 is 96-elements wide with each element 1 byte in size and matches the output size of vector engine 111 (96-elements wide with each element 1 byte in size).
In some embodiments, the processing elements of vector engine 111, including processing element 113, each include an arithmetic logic unit (ALU) (not shown). For example, in some embodiments, the ALU of each processing element is capable of performing arithmetic operations. In some embodiments, each ALU of the processing elements is capable of performing in parallel a rectified linear unit (ReLU) function and/or scaling functions. In some embodiments, each ALU is capable of performing a non-linear function including non-linear activation functions. In various embodiments, each processing element of vector engine 111 includes one or more flip-flops for receiving input operands. In some embodiments, each processing element has access to a slice of a vector engine accumulator and/or vector registers of vector engine 111. For example, a vector engine capable of receiving 96-elements includes a 96-element wide accumulator and one or more 96-element vector registers. Each processing element has access to a one-element slice of the accumulator and/or vector registers. In some embodiments, each element is 4-bytes in size. In various embodiments, the accumulator and/or vector registers are sized to fit at least the size of an input data vector. In some embodiments, vector engine 111 includes additional vector registers sized to fit the output of vector engine 111.
In some embodiments, the processing elements of vector engine 111 are configured to receive data from matrix processor 107 and each of the processing elements can process the received portion of data in parallel. As one example of a processing element, processing element 113 of vector engine 111 receives data from computation unit 109 of matrix processor 107. In various embodiments, vector engine 111 receives a single vector processor instruction and in turn each of the processing elements performs the processor instruction in parallel with the other processing elements. In some embodiments, the processor instruction includes one or more component instructions, such as a load, a store, and/or an arithmetic logic unit operation. In various embodiments, a no-op operation may be used to replace a component instruction.
In the example shown, the dotted arrows between data formatter 104 and matrix processor 107, weight formatter 106 and matrix processor 107, matrix processor 107 and vector engine 111, and vector engine 111 and post-processing unit 115 depict couplings between the respective pairs of components that are capable of sending multiple data elements such as a vector of data elements. As an example, the communication channel between matrix processor 107 and vector engine 111 may be 96×32 bits wide and support transferring 96 elements in parallel where each element is 32 bits in size. As another example, the communication channel between vector engine 111 and post-processing unit 115 may be 96×1 byte wide and support transferring 96 elements in parallel where each element is 1 byte in size. In various embodiments, input to data formatter 104 and weight formatter 106 are retrieved from memory 102 and/or cache 103. In some embodiments, vector engine 111 is additionally coupled to a memory module (not shown in
In some embodiments, one or more computation units of matrix processor 107 may be grouped together into a lane such that matrix processor 107 has multiple lanes. In various embodiments, the lanes of matrix processor 107 may be aligned with either data formatter 104 or weight formatter 106. For example, a lane aligned with weight formatter 106 includes a set of computation units that are configured to receive as input every operand of weight formatter 106. Similarly, a lane aligned with data formatter 104 includes a set of computation units that are configured to receive as input every operand of data formatter 104. In the example shown in
In some embodiments, control unit 101 synchronizes the processing performed by data formatter 104, weight formatter 106, arbiter 123, matrix processor 107, vector engine 111, and post-processing unit 115. For example, control unit 101 may send processor specific control signals and/or instructions to each of data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115. In some embodiments, a control signal is utilized instead of a processor instruction. Control unit 101 may send matrix processor instructions to matrix processor 107. A matrix processor instruction may be a computational array instruction that instructs a computational array to perform an arithmetic operation, such as a dot-product or dot-product component, using specified operands retrieved from memory 102 and/or cache 103 that are formatted by data formatter 104 and/or weight formatter 106, respectively. Control unit 101 may send vector processor instructions to vector engine 111. For example, a vector processor instruction may include a single processor instruction with a plurality of component instructions to be executed together by the vector computational unit. Control unit 101 may send post-processing instructions to post-processing unit 115. In various embodiments, control unit 101 synchronizes data that is fed to matrix processor 107 from data formatter 104 and weight formatter 106, to vector engine 111 from matrix processor 107, and to post-processing unit 115 from vector engine 111. In some embodiments, control unit 101 synchronizes the data between different components of microprocessor system 100 including between data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115 by utilizing processor specific memory, queue, and/or dequeue operations and/or control signals. In some embodiments, data and instruction synchronization is performed by control unit 101. In some embodiments, data and instruction synchronization is performed by control unit 101 that includes one or more sequencers to synchronize processing between data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115. In some embodiments, data and instruction synchronization is performed by using arbiter 123 to initiate the dequeueing of a control operation queued at control queue 121 to synchronize the arrival of operands at matrix processor 107 via data formatter 103 and weight formatter 106 with the arrival of the corresponding control operation.
In some embodiments, data formatter 104, weight formatter 106, matrix processor 107, and vector engine 111 are utilized for processing convolution layers. For example, matrix processor 107 may be used to perform calculations associated with one or more convolution layers of a convolution neural network. Data formatter 104 and weight formatter 106 may be utilized to prepare matrix and/or vector data in a format for processing by matrix processor 107. Memory 102 may store image data such as one or more image channels captured by sensors (not shown), where sensors include, as an example, cameras mounted to a vehicle. Memory 102 may store weights determined by training a machine learning model for autonomous driving. In some embodiments, vector engine 111 is utilized for performing non-linear functions such as an activation function on the output of matrix processor 107. For example, matrix processor 107 may be used to calculate a dot-product and vector engine 111 may be used to perform an activation function such as a rectified linear unit (ReLU) or sigmoid function. In some embodiments, post-processing unit 115 is utilized for performing pooling operations. In some embodiments, post-processing unit 115 is utilized for formatting and storing the processed data to memory and may be utilized for synchronizing memory writing latency.
At 201, input channels are received as input data to the microprocessor system. For example, vision data is captured using sensors and may include one or more channels corresponding to different color channels for the colors red, green, and blue. In various embodiments, multiple channels may be utilized as the different channels may contain different forms of information. As another example, non-sensor data may be utilized as input data. In various embodiments, the input channels may be loaded from memory via a cache using subsets of consecutively stored data in memory. In some embodiments, the input channels may be retrieved and/or formatted for processing using a hardware data formatter such as data formatter 104 of
At 203, one or more filters are received for processing the input channels. For example, a filter in the form of a matrix contains learned weights and is used to identify activations in the channels. In some embodiments, the filter is a square matrix kernel smaller than the input channel. In various embodiments, filters may be utilized to identify particular shapes, edges, lines, and other features and/or activations in the input data. In some embodiments, the filters and associated weights that make up the filter are created by training a machine learning model using a training corpus of data similar to the input data. In various embodiments, the received filters may be streamed from memory. In some embodiments, the filters may be retrieved and/or formatted for processing using a hardware data formatter such as weight formatter 106 of
At 205, one or more feature layers are determined using the received input channels and filters. In various embodiments, the feature layers are determined by performing one or more convolution operations using a computational array such as matrix processor 107 of
At 207, an activation function is performed on one or more feature layers. For example, an element-wise activation function, such as a rectified linear unit (ReLU) function, is performed using a vector processor such as vector engine 111 of
At 209, pooling is performed on the activation layers created at 207. For example, a pooling layer is generated by a post-processing unit such as post-processing unit 115 of
In various embodiments, the process of
At 301, data input is received. For example, data input corresponding to sensor data is received by a hardware data formatter for formatting. In some embodiments, data input is retrieved from memory 102 of
At 303, data input is formatted using a hardware data formatter. For example, a hardware data formatter such as data formatter 104 of
At 305, weight input is received. For example, weight input corresponding to machine learning weights of a filter are received by a hardware data formatter for formatting. In some embodiments, weight input is retrieved from memory 102 of
At 307, weight input is formatted using a hardware data formatter. For example, a hardware data formatter such as weight formatter 106 of
At 309, matrix processing is performed. For example, the operands formatted at 303 and 307 are received by each of the computation units of a computational array for processing. In some embodiments, the matrix processing is performed using a matrix processor such as matrix processor 107 of
At 311, vector processing is performed. For example, an element-wise activation function may be performed on the result of the matrix processing performed at 309. In some embodiments, an activation function is a non-linear activation function such as a rectified linear unit (ReLU), sigmoid, or other appropriate function. In some embodiments, the vector processor is utilized to implement scaling, normalization, or other appropriate techniques. For example, a bias parameter may be introduced to the result of a dot-product using the vector processor. In some embodiments, the result of 311 is a series of activation maps or activation layers. In some embodiments, vector processing is performed using a vector engine such as vector engine 111 of
At 313, post-processing is performed. For example, a pooling layer may be implemented using a post-processing processor such as post-processing unit 115 of
In various embodiments, the process of
In some embodiments, Clock signal 410 is a clock signal received by computation unit 400. In various embodiments, each computation unit of the computational array receives the same clock signal and the clock signal is utilized to synchronize the processing of each computation unit with the other computation units.
In the example shown, multiplier 430 receives and performs a multiplication operation on the input values data 404 and weight 402. The output of multiplier 430 is fed to adder 432. Adder 432 receives and performs an addition on the output of multiplier 430 and the output of logic 434. The output of adder 432 is fed to accumulator 424. In some embodiments, input values data 404 and weight 402 are lines that cross computation units and feed the corresponding data and/or weight to neighboring computation units. For example, in some embodiments, data 404 is fed to all computation units in the same column and weight 402 is fed to all computation units in the same row. In various embodiments, data 404 and weight 402 correspond to input elements fed to computation unit 400 from a data hardware data formatter and a weight hardware data formatter, respectively. In some embodiments, the data hardware data formatter and the weight hardware data formatter are data formatter 104 and weight formatter 106 of
In some embodiments, ClearAcc signal 408 clears the contents of accumulator 424. As an example, accumulation operations can be reset by clearing accumulator 424 and used to accumulate the result of multiplier 430. In some embodiments, ClearAcc signal 408 is used to clear accumulator 424 for performing a new dot-product operation. For example, elements-wise multiplications are performed by multiplier 430 and the partial-dot-product results are added using adder 432 and accumulator 424.
In various embodiments, accumulator 424 is an accumulator capable of accumulating the result of adder 432 and indirectly the result of multiplier 430. For example, in some embodiments, accumulator 424 is configured to accumulate the result of multiplier 430 with the contents of accumulator 424 based on the status of ClearAcc signal 408. As another example, based on the status of ClearAcc signal 408, the current result stored in accumulator 424 may be ignored by adder 432. In the example shown, accumulator 424 is a 32-bit wide accumulator. In various embodiments, accumulator 424 may be sized differently, e.g., 8-bits, 16-bits, 64-bits, etc., as appropriate. In various embodiments, each accumulator of the plurality of computation units of a computational array is the same size. In various embodiments, accumulator 424 may accumulate and save data, accumulate and clear data, or just clear data. In some embodiments, accumulator 424 may be implemented as an accumulation register. In some embodiments, accumulator 424 may include a set of arithmetic logic units (ALUs) that include registers.
In some embodiments, ResultEnable signal 412 is activated in response to a determination that data 404 is valid. For example, ResultEnable signal 412 may be enabled to enable processing by a computation unit such as processing by multiplier 430 and adder 432 into accumulator 424.
In some embodiments, ResultCapture signal 414 is utilized to determine the functionality of multiplexer 426. Multiplexer 426 receives as input ResultIn 406, output of accumulator 424, and ResultCapture signal 414. In various embodiments, ResultCapture signal 414 is used to enable either ResultIn 406 or the output of accumulator 424 to pass through as the output of multiplexer 426. In some embodiments, multiplexer 426 is implemented as an output register. In some embodiments, ResultIn 406 is connected to a computation unit in the same column as computation unit 400. For example, the output of a neighboring computation unit is fed in as an input value ResultIn 406 to computation unit 400. In some embodiments, the input of a neighboring computation unit is the computation unit's corresponding ResultOut value.
In some embodiments, shadow register 428 receives as input the output of multiplexer 426. In some embodiments, shadow register 428 is configured to receive the output of accumulator 424 via multiplexer 426 depending on the value of ResultCapture signal 414. In the example shown, the output of shadow register 428 is output value ResultOut 450. In various embodiments, once a result is inserted into shadow register 428, accumulator 424 may be used to commence new calculations. For example, once the final dot-product result is stored in shadow register 428, accumulator 424 may be cleared and used to accumulate and store the partial result and eventually the final result of a new dot-product operation on new weight and data input values. In the example shown, shadow register 428 receives a signal ShiftEn signal 416. In various embodiments, ShiftEn signal 416 is used to enable or disable the storing of values in the shadow register 428. In some embodiments, ShiftEn signal 416 is used to shift the value stored in shadow register 428 to output value ResultOut 450. For example, when ShiftEn signal 416 is enabled, the value stored in shadow register 428 is shifted out of shadow register 428 as output value ResultOut 450. In some embodiments, ResultOut 450 is connected to a neighboring computation unit's input value ResultIn. In some embodiments, the last cell of a column of computation units is connected to the output of the computational array. In various embodiments, the output of the computational array feeds into a vector engine such as vector engine 111 of
In the example shown, shadow register 428 is 32-bits wide. In various embodiments, shadow register 428 may be sized differently, e.g., 8-bits, 16-bits, 64-bits, etc., as appropriate. In various embodiments, each shadow register of the plurality of computation units of a computational array is the same size. In various embodiments, shadow register 428 is the same size as accumulator 424. In various embodiments, the size of multiplexer 426 is based on the size of accumulator 424 and/or shadow register 428 (e.g., the same size or larger).
In some embodiments, logic 434, 436, and 438 receive signals, such as control signals, to enable and/or configure the functionality of computation unit 400. In various embodiments, logic 434, 436, and 438 are implemented using AND gates and/or functionality corresponding to an AND gate. For example, as described above, logic 434 receives ClearAcc signal 408 and an input value corresponding to the value stored in accumulator 424. Based on ClearAcc signal 408, the output of logic 434 is determined and fed to adder 432. As another example, logic 436 receives ResultEnable signal 412 and Clock signal 410. Based on ResultEnable signal 412, the output of logic 436 is determined and fed to accumulator 424. As another example, logic 438 receives ShiftEn signal 416 and Clock signal 410. Based on ShiftEn signal 416, the output of logic 438 is determined and fed to shadow register 428.
In various embodiments, computation units may perform a multiplication, an addition operation, and a shift operation at the same time, i.e., within a single cycle, thereby doubling the total number of operations that occur each cycle. In some embodiments, results are moved from multiplexer 426 to shadow register 428 in a single clock cycle, i.e., without the need of intermediate execute and save operations. In various embodiments, the clock cycle is based on the signal received at Clock signal 410.
In various embodiments, input values weight 402 and data 404 are 8-bit values. In some embodiments, weight 402 is a signed value and data 404 is unsigned. In various embodiments, weight 402 and data 404 may be signed or unsigned, as appropriate. In some embodiments, ResultIn 406 and ResultOut 450 are 32-bit values. In various embodiments ResultIn 406 and ResultOut 450 are implemented using a larger number of bits than input operands weight 402 and data 404. By utilizing a large number of bits, the results of multiplying multiple pairs of weight 402 and data 404, for example, to calculate a dot-product result, may be accumulated without overflowing the scalar result.
In some embodiments, computation unit 400 generates an intermediate and/or final computation result in accumulator 424. The final computation result is then stored in shadow register 428 via multiplexer 426. In some embodiments, multiplexer 426 functions as an output register and store the output of accumulator 424. In various embodiments, the final computation result is the result of a convolution operation. For example, the final result at ResultOut 450 is the result of convolution between a filter received by computation unit 400 as input values using weight 402 and a two-dimensional region of sensor data received by computation unit 400 as input values using data 404.
As an example, a convolution operation may be performed using computation unit 400 on a 2×2 data input matrix [d0 d1; d2 d3] corresponding to a region of sensor data and a filter corresponding to a 2×2 matrix of weights [w0 w1; w2 w3]. The 2×2 data input matrix has a first row [d0 d1] and a second row [d2 d3]. The filter matrix has a first row [w0 w1] and a second row [w2 w3]. In various embodiments, computation unit 400 receives the data matrix via data 404 as a one-dimensional input vector [d0 d1 d2 d3] one element per clock cycle and weight matrix via weight 402 as a one-dimensional input vector [w0 w1 w2 w3] one element per clock cycle. Using computation unit 400, the dot product of the two input vectors is performed to produce a scalar result at ResultOut 450. For example, multiplier 430 is used to multiply each corresponding element of the input weight and data vectors and the results are stored and added to previous results in accumulator 424. For example, the result of element d0 multiplied by element w0 (e.g., d0*w0) is first stored in cleared accumulator 424. Next, element d1 is multiplied by element w1 and added using adder 432 to the previous result stored in accumulator 424 (e.g., d0*w0) to compute the equivalent of d0*w0+d1*w1. Processing continues to the third pair of elements d2 and w2 to compute the equivalent of d0*w0+d1*w1+d2*w2 at accumulator 424. The last pair of elements is multiplied and the final result of the dot product is now stored in accumulator 424 (e.g., d0*w0+d1*w1+d2*w2+d3*w3). The dot-product result is then copied to shadow register 428. Once stored in shadow register 428, a new dot-product operation may be initiated, for example, using a different region of sensor data. Based on ShiftEn signal 416, the dot-product result stored in shadow register 428 is shifted out of shadow register 428 to ResultOut 450. In various embodiments, the weight and data matrices may be different dimensions than the example above. For example, larger dimensions may be used.
In some embodiments, a bias parameter is introduced and added to the dot-product result using accumulator 424. In some embodiments, the bias parameter is received as input at either weight 402 or data 404 along with a multiplication identity element as the other input value. The bias parameter is multiplied against the identity element to preserve the bias parameter and the multiplication result (e.g., the bias parameter) is added to the dot-product result using adder 432. The addition result, a dot-product result offset by a bias value, is stored in accumulator 424 and later shifted out at ResultOut 450 using shadow register 428. In some embodiments, a bias is introduced using a vector engine such as vector engine 111 of
In various embodiments, microprocessor system 500 is microprocessor system 100 of
In some embodiments, matrix processor 507 is a computational array that includes a plurality of computation units. For example, a matrix processor receiving M operands and N operands from weight formatter 506 and data formatter 504, respectively, includes M×N computation units. In the figure shown, the small squares inside matrix processor 507 depict that matrix processor 507 includes a logical two-dimensional array of computation units. Computation unit 509 is one of a plurality of computation units of matrix processor 507. In some embodiments, each computation unit is configured to receive one operand from data formatter 504 and one operand from weight formatter 506. Matrix processor 507 and computation unit 509 are described in further detail with respect to matrix processor 107 and computation unit 109, respectively, of
In the example shown, the dotted arrows between data formatter 504 and matrix processor 507 and between weight formatter 506 and matrix processor 507 depict a coupling between the respective pairs of components that are capable of sending multiple data elements such as a vector of data elements. In various embodiments, the data width of components data formatter 504, weight formatter 506, and matrix processor 507 are wide data widths and include the ability to transfer more than one operand in parallel. The data widths of components data formatter 504, weight formatter 506, and matrix processor 507 are described in further detail with respect to corresponding components data formatter 104, weight formatter 106, and matrix processor 107 of
In various embodiments, the arrows in
In various embodiments, memory 502 is typically static random access memory (SRAM). In some embodiments, memory 502 has a single read port or a limited number of read ports. In some embodiments, the amount of memory 502 dedicated to storing data (e.g., sensor data, image data, etc.), weights (e.g., weight associated with image filters, etc.), and/or other data may be dynamically allocated. For example, memory 502 may be configured to partition more or less memory for data input compared to weight input based on a particular workload. In some embodiments, cache 503 includes one or more cache lines. For example, in some embodiments, cache 503 is a 1 KB cache that includes four cache lines where each cache line is 256 bytes. In various embodiments, the size of the cache may be larger or small, with fewer or more cache lines, have larger or smaller cache lines, and may be determined based on expected computation workload.
In various embodiments, hardware data formatters (e.g., data formatter 504 and weight formatter 506) calculate memory addresses to retrieve input values from memory 502 and cache 503 for processing by matrix processor 507. In some embodiments, data formatter 504 and/or weight formatter 506 stream data corresponding to a subset of values stored consecutively in memory 502 and/or cache 503. Data formatter 504 and/or weight formatter 506 may retrieve one or more subsets of values stored consecutively in memory and prepare the data as input values for matrix processor 507. In various embodiments, the one or more subsets of values are not themselves stored consecutively in memory with other subsets. In some embodiments, memory 502 contains a single read port. In some embodiments, memory 502 contains a limited number of read ports and the number of read ports is fewer than the data width of components data formatter 504, weight formatter 506, and matrix processor 507. In some embodiments, hardware data formatters 504, 506 will perform a cache check to determine whether a subset of values is in cache 503 prior to issuing a read request to memory 502. In the event the subset of values is cached, hardware data formatters 504, 506 will retrieve the data from cache 503. In various embodiments, in the event of a cache miss, hardware data formatters 504, 506 will retrieve the entire subset of values from memory 502 and populate a cache line of cache 503 with the retrieved values.
In some embodiments, control unit 501 initiates and synchronizes processing between components of microprocessor system 500, including components memory 502, data formatter 504, weight formatter 506, and matrix processor 507. In some embodiments, control unit 501 coordinates access to memory 502 including the issuance of read requests. In some embodiments, control unit 501 interfaces with memory 502 to initiate read requests. In various embodiments, the read requests are initiated by hardware data formatters 504, 506 via the control unit 501. In various embodiments, control unit 501 synchronizes data that is fed to matrix processor 507 from data formatter 504 and weight formatter 506. In some embodiments, control unit 501 synchronizes the data between different components of microprocessor system 500 including between data formatter 504, weight formatter 506, and matrix processor 507, by utilizing processor specific memory, queue, and/or dequeue operations and/or control signals. Additional functionality performed by control unit 501 is described in further detail with respect to control unit 101 of
In some embodiments, microprocessor system 500 is utilized for performing convolution operations. For example, matrix processor 507 may be used to perform calculations, including dot-product operations, associated with one or more convolution layers of a convolution neural network. Data formatter 504 and weight formatter 506 may be utilized to prepare matrix and/or vector data in a format for processing by matrix processor 507. Memory 502 may be utilized to store data such as one or more image channels captured by sensors (not shown). Memory 502 may also include weights, including weights in the context of convolution filters, determined by training a machine learning model for autonomous driving.
In various embodiments, microprocessor system 500 may include additional components (not shown in
In various embodiments, a control unit (not shown) such as control unit 101 of
In various embodiments, the output of hardware data formatter 605 is fed as input to a computational array such as matrix processor 107 of
In some embodiments, only a portion of the elements in read buffers 621-632 is utilized as input to a computational array. For example, a two-dimensional 80×80 matrix may only utilize read buffers 621-630 (corresponding to 80 bytes, numbered bytes 0-79) to feed an 80-element row into a matrix processor. In various embodiments, hardware data formatter 605 may perform additional processing on one or more elements of read buffers 621-632 to prepare the elements as input to a computational array. For example, a computational array may be configured to receive 48 16-bit elements instead of 96 8-bit elements and hardware data formatter 605 may be configured to combine pairs of 1-byte elements to form 16-bit elements to prepare a 48 16-bit input vector for the computational array.
In various embodiments, cache 603 is a memory cache of memory 601. In some embodiments, memory 601 is implemented using static random access memory (SRAM). In some embodiments, cache 603 is a 1 KB memory cache and each cache line 611, 613, 615, and 617 is 256 bytes. In various embodiments, reading data into cache 603 loads an entire cache line of data into one of cache lines 611, 613, 615, and 617. In various embodiments, cache 603 may be larger or small and have fewer or more cache lines. Moreover, in various embodiments, the cache lines may be a different size. The size and configuration of cache 603, cache lines 611, 613, 615, and 617, and memory 601 may be sized as appropriate for the particular workload of computational operations. For example, the size and number of image filters used for convolution may dictate a larger or smaller cache line and a larger or smaller cache.
In the example shown, the dotted-lined arrows originating from read buffers 621-632 indicate whether the data requested by hardware data formatter 605 exists as a valid entry in cache 603 and in particular which cache line holds the data. For example, read buffers 621, 622, and 623 request data that is found in cache line 611. Read buffers 626 and 627 request data that is found in cache line 613 and read buffers 630, 631, and 632 request data that is found in cache line 617. In various embodiments, each read buffer stores a subset of values located consecutively in the memory. The subsets of values stored at read buffers 621, 622, and 623 may not be located consecutively in memory with the subsets of values stored at read buffers 626 and 627 and also may not be located consecutively in memory with the subsets of values stored at read buffers 630, 631, and 632. In some scenarios, read buffers referencing the same cache line may store subsets of values that are not located consecutively in memory. For example, two read buffers may reference the same cache line of 256 bytes but different 8-byte subsets of consecutive values.
In the example shown, the data requested for read buffers 624, 625, 628, and 629 are not found in cache 603 and are cache misses. In the example shown, an “X” depicts a cache miss. In various embodiments, cache misses must be resolved by issuing a read for the corresponding subset of data from memory 601. In some embodiments, an entire cache line containing the requested subset of data is read from memory 601 and placed into a cache line of cache 603. Various techniques for cache replacement may be utilized as appropriate. Examples of cache replacement policies for determining the cache line to use include First In First Out, Least Recently Used, etc.
In some embodiments, each of read buffers 621-632 stores a subset of values located consecutively in memory. For example, in the example shown, read buffer 621 is 8-bytes in size and stores a subset of 8-bytes of values stored consecutively in memory. In various embodiments, the values are located consecutively in memory 601 and read as a continuous block of values into a cache line of cache 603. By implementing read buffers using the concept of a subset of values, where each of the values is located consecutively in memory, each read buffer is capable of loading multiple elements (e.g., up to eight elements for an 8-byte read buffer) together. In the example shown, a fewer number of reads are required than the number of elements to populate every read buffer with an element. For example, up to twelve reads are required to load 96-elements into the twelve read buffers 621-632. In many scenarios, even fewer reads are necessary in the event that a cache contains the requested subset of data. Similarly, in some scenarios, a single cache line is capable of storing the data requested for multiple read buffers.
In some embodiments, read buffers 621-632 are utilized by hardware data formatter 605 to prepare input operands such as an vector of inputs for a computational array, such as matrix processor 107 of
As an example, in a scenario with a stride parameter set to two, the initial input elements for a convolution operation are every other element of a row of an input matrix. Depending on the input matrix size, the elements include the 1st, 3rd, 5th, and 7th elements, etc., for the first group of input elements necessary for a convolution operation. Read buffer 621 is configured to read the first 8 elements (1 through 8), and thus elements 2, 4, 6, and 8 are not needed for a stride of two. As another example, using a stride of five, four elements are skipped when determining the start of the next neighboring region. Depending on the size of the input data, the 1st, 6th, 11th, 16th, and 21st elements, etc., are the first input elements necessary for a convolution operation. The elements 2-5 and 7-8 are loaded into a read buffer 621 but are not used for calculating the first dot-product component result corresponding to each region and may be filtered out.
In various embodiments, each read buffer loads eight consecutive elements and can satisfy two elements for a stride of five. For example, read buffer 621 initiates a read at element 1 and also reads in element 6, read buffer 621 initiates a read at element 11 and also reads in element 16, read buffer 622 initiates a read at element 21 and also reads in element 26, etc. In some embodiments, the reads are aligned to multiples of the read buffer size. In some embodiments, only the first read buffer is aligned to a multiple of the read buffer size. In various embodiments, only the start of each matrix row must be aligned to a multiple of the read buffer size. Depending on the stride and the size of the input matrix, in various embodiments, only a subset of the read buffers may be utilized. In various embodiments, the elements corresponding to least twelve regions, one element for each read buffer 621-632, are loaded and fed to a computational array in parallel. In various embodiments, the number of input elements provided in parallel to a computational array is at least the number of read buffers in the hardware data formatter.
In some embodiments, the elements not needed for the particular stride are filtered out and not passed to the computational array. In various embodiments, using, for example, a multiplexer, the input elements conforming to the stride are selected from the loaded read buffers and formatted into an input vector for a computational array. Once the input vector is formatted, hardware data formatter 605 feeds the input vector to the computational array. The unneeded elements may be discarded. In some embodiments, the unneeded elements may be utilized for the next dot-product component and a future clock cycle and are not discarded from read buffers 621-632. In various embodiments, the elements not needed for implementing a particular stride are fed as inputs to a computational array and the computational array and/or post-processing will filter the results to remove them. For example, the elements not needed may be provided as input to a computation array but the computation units corresponding to the unnecessary elements may be disabled.
In some embodiments, hardware data formatter 605 formats the input vector for a computational array to include padding. For example, hardware data formatter 605 may insert padding using read buffers 621-632. In various embodiments, one or more padding parameters may be described by a control unit using a control signal and/or instruction parameter.
In some embodiments, hardware data formatter 605 determines a set of addresses for preparing operands for a computational array. For example, hardware data formatter 605 calculates associated memory locations required to load a subset of values, determines whether the subset is cached, and potentially issues a read to memory for the subset in the event of a cache miss. In some scenarios, a pending read may satisfy a cache miss. In various embodiments, hardware data formatter 605 only processes the memory address associated with the start element and end element of each read buffer 621-632. In various embodiments, each read buffer 621-632 associates the validity of the cache entry for a subset of values with the memory addresses of the start and end values of the corresponding read buffer. In the example shown, read buffer 621 is configured to store 8-bytes corresponding to up to eight elements. In various embodiments, hardware data formatter 605 calculates the address of the first element and the address of the last element of read buffer 621. Hardware data formatter 605 performs a cache check on the first and last element addresses. In the event either of the addresses is a cache miss, hardware data formatter 605 issues a memory read for 8-bytes starting at the address of the first element. In the event that both addresses are a cache hit from the same cache line, hardware data formatter 605 considers every element in the subset to be a valid cache hit and loads the subset of values from the cache via the appropriate cache line. In this manner, an entire row of elements may be loaded by processing the addresses of at most the first and last addresses of each read buffer 621-632 (e.g., at most 24 addresses).
At 701, one or more matrices may be sliced. In some embodiments, the size of a matrix, for example, a matrix representing a frame of vision data, is larger than will fit in a computational array. In the event the matrix exceeds the size of the computational array, the matrix is sliced into a smaller two-dimensional matrix with a size limited to the appropriate dimensions of the computational array. In some embodiments, the sliced matrix is a smaller matrix with addresses to elements referencing the original matrix. In various embodiments, the sliced matrix is serialized into a vector for processing. In some embodiments, each pass of the process of
At 703, a computational operation is received. For example, a matrix operation is received by the microprocessor system. As one example, a computational operation requesting a convolution of an image with a filter is received. In some embodiments, the operation may include the necessary parameters to perform the computational operation including the operations involved and the operands. For example, the operation may include the size of the input operands (e.g., the size of each input matrix), the start address of each input matrix, a stride parameter, a padding parameter, and/or matrix, vector, and/or post-processing commands. For example, a computational operation may describe an image data size (e.g., 96×96, 1920×1080, etc.) and bit depth (e.g., 8-bits, 16-bits, etc.) and a filter size and bit depth, etc. In some embodiments, the computational operation is received by a control unit such as control unit 101 of
At 705, each hardware data formatter receives a data formatting operation. In some embodiments, the data formatting operation is utilized to prepare input arguments for a computational array such as matrix processor 107 of
At 707, data addresses are processed by one or more hardware data formatters. For example, addresses corresponding to elements of the computational operation are processed by one or more hardware data formatters based on the formatting operations received at 705. In some embodiments, the addresses are processed in order for the hardware data formatter to load the elements (from a cache or memory) and prepare an input vector for a computational array. In various embodiments, a hardware data formatter first calculates a pair of memory addresses for each subset of values to determine whether a subset of elements exists in a cache before issuing a request to memory in the event of a cache miss. In various embodiments, a read request to memory incurs a large latency that may be minimized by reading elements from a cache. In some scenarios, all elements are read from a cache and thus require any cache misses to first populate the cache by issuing a read to memory. To minimize the latency for each read, in various embodiments, the reads are performed on subsets of elements (or values). In some embodiments, memory may only have a limited number of read ports, for example, a single read port, and all reads are processed one at a time. For example, performing 96 independent reads incurs the latency of 96 independent reads for a memory with a single read port. To reduce read latency, subsets of values are read together from memory into corresponding read buffers of a hardware data formatter. For example, using subsets of eight values, at most 12 memory reads are required to read 96 values. In the event some of the subsets are in the cache from previous memory reads, even fewer memory reads are required.
In various embodiments, subsets of values are prepared by determining the memory addresses for the start value of each subset (where each value corresponds to an element) and the end value of each subset. For example, to prepare a subset of 8-values each of 1-byte, a cache check is performed using the calculated address of the start value and the calculated address of the end value of the subset. In the event either of the addresses are cache misses, a memory read is issued to read 8-bytes from memory beginning at the address of the start value. In some embodiments, in addition to reading the requested 8-bytes from memory, an entire cache line of data (corresponding to multiple subsets) is read from memory and stored in the cache. In various embodiments, in the event the start and end addresses of a subset are cached at the same cache line, the entire subset of values is considered cached and no cache check is needed for the remaining elements of the subset. The entire subset is considered cached in the event the start and end elements are cached in the same cache line. In various embodiments, the processing at 707 determines the addresses of the start value of the subset and the end value of the subset for each subset of values. In various embodiments, one read buffer exists for each subset of values. In various embodiments, read buffers of a hardware data formatter are read buffers 621-632 of hardware data formatter 605 of
In some embodiments, a stride parameter is implemented and non-consecutive subsets of values are loaded into each read buffer. In various embodiments, each subset of continuous values includes one or more elements needed to implement a particular stride parameter. For example, for a stride of one, every value in a subset of values located consecutively in memory is a utilized element. As another example, for a stride of two, every other value located consecutively in memory is utilized and a subset of eight consecutive values includes four utilized elements and four that are not utilized. As another example, for a stride of five, a subset of eight values located consecutively in memory may include two utilized elements and six unused elements. For each subset of elements located consecutively in memory, the memory addresses for the start and end elements of the subset are determined and utilized to perform a cache check at 709. In various embodiments, the start element of the subset is the first element of the subset. In some embodiments, the end element of the subset is the last element of the subset, regardless of whether the element is utilized to implement the stride parameter. In some embodiments, the end element of the subset is the last utilized element and not the last element of the subset.
In various embodiments, once the number of utilized elements that are included in a subset of consecutive elements is determined, the next subset of elements begins with the next element needed to satisfy the stride parameter. The next element may result in a memory location that is located at an address non-consecutive with the address of the last element of the previous subset. As an example, using a stride of five, four elements are skipped when determining the start of the next subset of values. Depending on the size of the input data, the 1st and 6th elements are stored in the first subset of values, 11th and 16th elements in the second subset of values, and 21st and 26th elements in the third subset of values, etc. In various embodiments, the second subset of values starts with the 11th element and the third subset of values starts with the 21st element. Each subset is located in memory at locations non-consecutive with the other subsets. Examples of unused elements in the first subset of values include the elements 2-5 and 7-8. In some embodiments, the first row of each matrix is aligned to a multiple of the subset size. In some embodiments, this alignment restriction is required to prevent gaps of invalid values between rows when a matrix is serialized. In some embodiments, all subsets are aligned to the multiple of the subset size.
In various embodiments, each subset of values is loaded in a read buffer such as read buffers 621-632 of
In some embodiments, the formatting performed by a hardware data formatter includes converting a matrix into a vector with elements of the vector fed to a computational array over multiple clock cycles. For example, in some embodiments, a matrix corresponding to data (e.g., image data) is formatted to prepare vectors corresponding to sub-regions of the data. In some embodiments, each element fed to a computational array for a particular clock cycle corresponds to the n-th element of a vector associated with a sub-region of the data. As an example, a 3×3 matrix may be formatted into a one-dimensional vector of nine elements. Each of the nine elements may be fed into the same computation unit of a computational array. In various embodiments, feeding the 9 elements requires are least 9 clock cycles.
At 709, a determination is made whether the data corresponding to the addresses determined for each subset at 707 are cached. For example, a cache check is performed on each subset by determining whether the data associated with the address of the start value of the subset and the address of the end value of the subset is in the same cache line. In various embodiments, a cache check is performed for each read buffer, such as read buffers 621-632 of
At 711, each requested subset of data is read into the cache as an entire subset of values. In various embodiments, each subset data is read into the cache from memory. In some embodiments, the memory is memory 102 of
At 713, matrix processing is performed. For example, a matrix processor performs a matrix operation using the data cached and received by a hardware data formatter. In various embodiments, the cached data is received by the hardware data formatter and processed according to a formatting operation by a hardware data formatter into input values for matrix processing. In some embodiments, the processing by the hardware data formatter includes filtering out a portion of the received cached data. For example, in some embodiments, subsets of values located consecutively in memory are read into the cache and received by the hardware data formatter. In various embodiments, a computational operation may specify a stride and/or padding parameters. For example, to implement a specified stride for convolution, one or more data elements may be filtered from each subset of values. In some embodiments, only a subset of the elements from each of the subsets of values is selected to create an input vector for matrix processing.
In various embodiments, the matrix processor performs the computational operation specified at 703. For example, a matrix processor such as matrix processor 107 of
At 715, vector and/or post-processing operations are performed. For example, vector processing may include the application of an activation function such as a rectified linear unit (ReLU) function. In some embodiments, vector processing includes scaling and/or normalization. In various embodiments, vector processing is performed on one vector of the output of a computational array at a time. In some embodiments, vector processing is performed by a vector processor such as vector engine 111 of
In some embodiments, the process of
At 801, the first subset of data elements located consecutively in memory is processed. In various embodiments, the first consecutive subset of data corresponds to the data element designated for the first read buffer of a hardware data formatter. In some embodiments, the address of the first element must be a multiple of the number of elements in each subset. For example, using an 8-byte read buffer, the address of the first element must be a multiple of eight.
At 803, start and end memory addresses are determined for the current subset. For example, the memory address of the start element of a subset and the memory address of the end element of a subset are determined. In various embodiments, the start and end addresses are determined by a hardware data formatter, such as the hardware data formatters of
At 805, a determination is made on whether the subset of data is cached or pending a read. For example, a determination is made whether the data corresponding to the start and end addresses determined at 803 are cached at the same cache line or will be cached as a result of an already issued memory read. In some embodiments, a pending read for a different subset brings an entire cache line of data into memory and will result in caching the current subset. In the event the data is not cached or will not be cached as a result of a pending memory read, processing continues to 807. In the event the data is cached or will be cached by a pending memory read, processing continues to 811.
At 807, a determination is made on whether a memory read is already issued. In the event a memory read is already issued, processing completes for the current clock cycle. In the event a memory read has not been issued, processing continues to 809. In some embodiments, the memory is configured with a single read port (e.g., to increase density) and the memory can only process one read at a time. In various embodiments, the determination of whether a memory read has been issued is based on the capability of the memory configuration and/or the availability of memory read ports. Not shown in
At 809, a read is issued to cache a subset of data elements. For example, a block of memory beginning at the start address determined at 803 and extending for the length based on the size of a read buffer is read from memory into the memory cache. In various embodiments, an entire cache line of memory is read into the memory cache. For example, in a scenario with a cache line of 256 bytes and read buffers each capable of storing 8-bytes, a memory read will read 256 bytes of continuous data into a cache line, which corresponds to 32 subsets of non-overlapping 8-byte values. In various embodiments, reading a subset of values as a single memory read request reduces the latency associated with loading each element. Moreover, reading multiple subsets of values together may further reduce the latency by caching other subsets of values that may be associated with other read buffers. In some embodiments, loading multiple subsets of values takes advantage of potential locality between the subsets resulting in lower latency. In some embodiments, the read issued is arbitrated by a hardware arbiter such as arbiter 123 of
At 811, a determination is made on whether there are additional subsets of data elements. In the event that every subset has been processed, processing continues to 813. In the event that there are additional subsets to be processed, processing loops back to 803. In some embodiments, depending on the input size, one or more read buffers of a hardware data formatter may not be utilized.
At 813, a determination is made on whether all the data elements are cached. In the event some elements are not cached, processing completes for the current clock cycle to allow the non-cached data elements to be loaded from memory into the cache. In the event all the data elements are cached, the data elements are all available for processing and processing proceeds to 815.
At 815, matrix processing is performed. For example, the cached data elements are received at one or more hardware data formatters, formatted, and fed as input vector(s) to a computational array for processing. A computational array, such as matrix processor 107 of
In various embodiments, the arrows of
In various embodiments, control unit 901 is communicatively connected to data formatter 911 and control queue 903. In some embodiments, control unit 901 is communicatively connected to arbiter 905, depicted as a dotted line. In various embodiments, control unit 901 sends a control operation corresponding to a computational array operation to be queued in control queue 903. In various embodiments, control unit 901 sends a control signal to data formatter 911. For example, control unit 901 may send a control signal to data formatter 911 describing arguments for formatting corresponding to the computational operation queued at control queue 903. In some embodiments, control unit 901 sends a control signal to arbiter 905 that describes memory access operations corresponding to the queued computational operation. In other embodiments, data formatter 911 sends a control signal to arbiter 905 that describes memory access operations corresponding to the queued computational operation and the data to be formatted, for example, in response to a control signal received by control unit 901.
In various embodiments, control queue 903 is a queue for storing computational array operations. In various embodiments, control queue 903 is a first-in-first-out queue that receives computational array operations from control unit 901 and de-queues computational array operations to computation engine 915. In various embodiments, the de-queue operation is performed in response to a control signal, such as a ready signal, from arbiter 905. For example, once an arbiter grants memory access to a data operand corresponding to the computational array operation queued at control queue 903, control queue 903 de-queues the computational array operation. In various embodiments, the dequeue action is timed so that the data operand retrieved from memory via arbiter 905 is synchronized to arrive at computation engine 915 with the computational array operation. In some embodiments, the ready signal from arbiter 905 is based on a completed read corresponding to a read request. In some embodiments, a computational array operation queued at control queue 903 relies on more than one data operand. For example, a matrix multiplication may require more than one memory access operations. In some embodiments, in the event the computational array operation queued at control queue 903 relies on more than one data operand, the computational array operation is de-queued so that all the data operands are synchronized to arrive at computation engine 915 with the computational array operation. For example, in the event two memory access operations are required and arbiter 905 generates one control signal for each memory access, control queue 903 will only release the computational array operation once the second control signal is received.
In some embodiments, control queue 903 includes additional stages to adjust for the latency required for data operands to be retrieved from memory 907 and formatted by data formatter 911. For example, control queue 903 may include one or more flip-flops to propagate a computational array operation from control queue 903 to computation engine 915. In some embodiments, alternative techniques are utilized to introduce a fixed latency from control queue 903 to computation engine 915 that corresponds to the latency to load data operand by data formatter 911. In various embodiments, the latency is a fixed number of clock cycles based on the amount of time required to perform a memory read and to format the retrieved data into operands for computation engine 915. Although not depicted in
In various embodiments, the control signal received at control queue 903 initiate the release of a queued computational array operation may be received (not shown) from one or more data formatters, such as data formatter 911, in response to a control signal received at the data formatter from arbiter 905. For example, instead of arbiter 905 directly sending a ready control signal to control queue 903, the control signal is sent to data formatter 911. In various embodiments, the control signal received at control queue 903 is received indirectly from arbiter 905.
In some embodiments, arbiter 905 is utilized to control access to memory 907. In various embodiments, memory 907 has a limited number of read ports, for example, a single read port capable of only performing a single read at a time. As a result of a limited number of read ports, access to memory 907 must be limited. In various embodiments, arbiter 905 grants read access to read ports (not shown) of memory 907. In the example shown, arbiter 905 includes arbiter control logic 921 for processing memory access request, such as receiving and queuing read requests, granting memory access to queued read requests, and coordinating memory access with computational array operations. In various embodiments, arbiter 905 is a hardware arbiter. For example, arbiter 905 does not rely on software implementations to synchronize memory access with computational array operations.
In the example shown, arbiter 905 includes read queue 923 for queuing memory read access requests. In various embodiments, memory access requests are read requests to memory, such as memory 907. For example, a request to load data associated with a memory address of a matrix operand is a memory access request. In various embodiments, memory read requests are initiated by a data formatter such as data formatter 911. In various embodiments, one or more data formatters initiate memory access requests. For example, a hardware data formatter corresponding to data, such as sensor data, and a separate hardware data formatter corresponding to weights, such as weights representing a machine learning model, initiate read access requests for memory. The various read requests are queued in read queue 923 and may originate from different components of microprocessor system 900. In some embodiments, additional read queues may exist (not shown), for example, corresponding to different requesters, different memory modules, different read ports, etc. In various embodiments, the memory read requests correspond to the issued memory reads performed at 711 of
In some embodiments, memory 907 is memory used for storing data operands for computation engine 915. For example, memory 907 may be static random access memory (SRAM). In various embodiments, memory 907 is high-density memory with limited read ports. For example, in order to increase the density of memory 907, the number of read ports are limited. In some embodiments, memory 907 includes a cache (not shown). In various embodiments, memory 907 may be dynamically partitioned to allocate portions of memory between data and weights. In various embodiments, memory 907 may be dynamically partitioned to allocate portions of memory for different purposes. In some embodiments, memory 907 is memory 102 of
In some embodiments, data formatter 911 is a hardware data formatter for preparing operands for a computational engine, such as computation engine 915. For example, data formatter 911 may initiate the loading of data operands from memory (and/or cache) and prepare the loaded operands as a group of values for input to a computation engine. In various embodiments, the length of time to load and format a data operand by data formatter 911 is a variable amount of time since the amount of time needed to read data from memory is variable. In some embodiments, the data formatter will issue a read request for data from memory and will stall a variable amount of time as the read request is pending access to memory. In various embodiments, the amount of time to format and send an input operand to computation engine 915 is a fixed amount and only the amount of time required to read an operand from memory is variable.
In various embodiments, one or more data formatters prepare operands for a computation engine. For example, a hardware data formatter 911 may align the data retrieved from memory 907 into a format compatible with computation engine 915. In some embodiments, hardware data formatter 911 inserts padding and/or applies a particular stride parameter to the retrieved data from memory 907. In various embodiments, additional data formatters (not shown) may exist and may be utilized to format additional operands for a computational array operation. For example, a hardware data formatter may exist for formatting data input and a separate hardware data formatter may exist for formatting weight input. In various embodiments, two or more separate hardware data formatter pipelines may exist in a microprocessor system (not shown) and arbiter 905 arbitrates the memory requests issued by each hardware data formatter and synchronizes the granted memory read requests with control operations from control unit 901.
In some embodiments, computation engine 915 is a computational array for preforming computational array operations. For example, computation engine 915 receives input operands from one or more data formatters and performs a matrix operation on the formatter operands. In various embodiments, computation engine 915 receives a computational operation from control queue 903. For example, computation engine 915 may receive an operation corresponding to a convolution operation from control queue 903. In some embodiments, the computation operation and the data operands must be synchronized and arrive at computation engine 915 for processing at the same clock cycle. In various embodiments, the output of computation engine 915 is fed into a vector processor (not shown) and/or post-processing processor (not shown). In various embodiments, computation engine 915 is matrix processor 107 of
At 1001, a read memory address is generated. In some embodiments, the memory address is generated by a data formatter. In various embodiments, the address is generated by a hardware data formatter such as data formatter 104 or weight formatter 106 of
At 1003, a memory read is issued. For example, a memory read is issued for the data corresponding to the data address generated at 1001. In various embodiments, the memory read request may be a read for a block of elements starting at an address corresponding to a first element of a subset of elements located consecutively in memory.
At 1005, a control operation is queued. For example, a control operation is queued in a control queue such as control queue 103 and 903 of
At 1007, a determination is made whether memory access is granted. For example, for each memory read request issued, access to memory must be first granted before a memory read can be performed. In some embodiments, the memory has a limited number of read ports and thus a limited number of reads may be performed simultaneously. In some embodiments, the memory has a single read port and only one read can be performed at a time. In various embodiments, reads are queued up and issued by an arbiter, such as arbiter 123 and 905 of
At 1009, the control queue is signaled. In some embodiments, the signal is sent based on a determination that memory access is granted at 1007. In various embodiments, the signal is a ready signal corresponding to a memory access request. Once access to memory is granted, the latency to perform a memory read and/or to format the retrieved data can be determined. In various embodiments, the latency is a fixed amount of time. For example, in some embodiments, the latency to retrieve data from memory once memory access is granted and to format the received data as an operand is a fixed number of clock cycles. By determining, in advance the fixed number of clock cycles required to read and format a data operand, a computation operation queued in a control queue can be released and be configured to arrive at a computational array in sync with formatted data operands.
At 1011, data is read from memory. In some embodiments, a block of data corresponding to a subset of elements located consecutively in memory is read. In various embodiments, the read is the read issued at 1003.
In an alternative embodiment (not shown), the control queue signaled at 1009 is signaled after the data is read from memory, effectively swapping the steps 1009 and 1011. For example, the data is read from memory and once the data is received at a hardware data formatter, the hardware data formatter signals a control queue. In some embodiments, the ready signal received at control queue is based on a completed memory read instead of a memory access grant (as shown).
At 1013, data is formatted for computation. For example, data is retrieved from memory at 1011 and arrives at a hardware data formatter such as data formatter 104 or weight formatter 106 of
At 1015, a computational array operation is performed. For example, a matrix operation is performed by a computational array. As another example, a convolution operation is performed using a matrix processor. In some embodiments, vector processing and/or post-processing may be performed as well. In various embodiments, a group of values is made available from one or more hardware data formatters along with a computational array operation during the same clock cycle. For example, a group of values is formatted by a hardware data formatter at 1013 and arrives at a computational array in sync with a computational operation via a control queue. The computation array performs a computational operation as described by the computational operation with the provided data operands.
At 1101, a read request is received by a hardware arbiter. In various embodiments, the read request is a memory read request. For example, a read request may be a memory read request corresponding to one or more elements in memory. As another example, the read request corresponds to a subset of elements located consecutively in memory. In various embodiments, a read request may arrive from one or more different hardware data formatters. For example, a read request may arrive from either a data or a weight data formatter to read data corresponding to data or weights. In some embodiments, a read request is issued by data formatter 104 and/or weight formatter 106 of
At 1103, the read request received at 1101 is queued. In various embodiments, read requests issued from different sources are queued in a single queue. For example, a request from a data hardware data formatter and a weight hardware data formatter are queued in the same queue and arranged based on arrival time. In some embodiments, one or more queues exist. For example, in some embodiments, more than one queue exists and queues exist corresponding to the hardware data formatter requesting the memory read. For example, a separate queue exists for data requests and for weight requests. In various embodiments, having separate queues allows the arbiter to prioritize requests from one queue over another queue, direct requests to different memory read ports, direct requests to different memory regions, etc. In some embodiments, a single queue is used to implement similar functionality by storing metadata associated with the source of the read request.
At 1105, a determination is made on whether memory access is granted. For example, the pending element of a read queue is examined and determined whether to grant memory access to perform the memory read corresponding to the elements. In some embodiments, a determination is made whether an existing memory read is being performed and/or whether an existing memory read has completed. In various embodiments, at step 1105, a determination is made whether memory may be accessed based on the availability of read ports of the memory.
At 1107, in the event the memory is available to service a memory read, processing proceeds to 1109. In the event the memory is not available to service a memory, processing loops back to 1105 to determine the appropriate time to grant access to read memory for a particular read request.
At 1109, a read request is dequeued from the read queue. In various embodiments, the read request corresponds to a read request queued at 1103. For example, one or more read requests are queued in a read queue at 1103 and the first arrived request is dequeued at 1109. The first arrived request corresponds to the request that arrived the earliest. In some embodiments, the request with the highest priority is dequeued and may not correspond to the request that arrived the earliest. In some embodiments, the request is a memory request for a subset of elements located consecutively in memory. In various embodiments, once a read request is dequeued, the read corresponding to the request is performed to retrieve the data requested from memory.
At 1111, a ready signal is sent to a control queue corresponding to the read request dequeued at 1109. In some embodiments, the ready signal is sent once the read has completed. In some embodiments, the ready signal is sent when the read request is dequeued. In various embodiments, the latency used to synchronize a control operation with one or more data reads is based on the amount of time (e.g., clock cycles) it takes for the data to be formatted and provided to the computational array. For example, the read request dequeued at 1109 corresponds to a computational operation queued at a control queue. At 1111, the control queue receives a signal from the arbiter that informs the control queue that memory access has been granted for the data associated with a queued computational operation. In various embodiments, once memory access is granted, the data is available in a fixed number of clock cycles. In various embodiments, the signal sent from the arbiter to the control queue informs the control queue to make the corresponding computational operation available after the determined fixed number of clock cycles. As described above and with respect to
At 1201, initialization is performed on the control operation and the memory reads. For example, a control operation is initialed using a computational operation and prepared to be issued. As another example, the initialization includes calculating one or more memory addresses corresponding to data operands for a computational array and issuing the corresponding memory read requests. In some embodiments, the step of 1201 may be performed by a control unit and/or a hardware data formatter. Examples of a control unit include control unit 101 and 901 of
At 1211, a memory read corresponding to one or more data operands is queued at an arbiter. For example, a memory read corresponding to a sensor data, such as data from a camera, is queued. In some embodiments, the data corresponds to an input channel of sensor data. In some embodiments, the memory read is queued at an arbiter such as arbiter 105 and 905 of
At 1221, a memory read corresponding to one or more weight operands is queued at an arbiter. For example, a memory read corresponding to weight data is queued. In some embodiments, the weight operands are a two-dimensional image filter. In some embodiments, the weight operands are machine learning weights determined by training a machine learning model. In some embodiments, the memory read is queued at an arbiter such as arbiter 105 and 905 of
At 1231, a control operation is queued. For example, a control operation corresponding to a convolution computational array operation is queued. As another example, a control operation corresponding to a matrix operation is queued. In various embodiments, the control operation is queued in a control queue such as control queue 103 and 903 of
At 1213, in the event access to memory is granted for a queued data read, processing proceeds to 1215. In the event access is not granted, processing loops back to 1213 until a later time when memory access is granted. At 1213, once memory access is granted, a data read is dequeued and the memory read for the corresponding data is performed.
At 1223, in the event access to memory is granted for a queued weight read, processing proceeds to 1225. In the event access is not granted, processing loops back to 1223 until a later time when memory access is granted. At 1223, once memory access is granted, a weight read is dequeued and the memory read for the corresponding weight is performed.
At 1215, a signal, such as a ready signal, is sent to the control queue to indicate that memory access has been granted for a data read and that the data element(s) will be read from memory. In various embodiments, the number of clock cycles to read data element(s) is fixed and the signal is used by the control queue to determine the appropriate timing for dequeueing the corresponding control operation for the data element(s) being read. In various embodiments, the signal is sent from the hardware arbiter that grants access for the memory read. In some embodiments, the memory read may be serviced from a cache (not shown). In some embodiments, the signal is sent once a memory read has completed and the data has been retrieved from memory.
At 1225, a signal, such as a ready signal, is sent to the control queue to indicate that memory access has been granted for a weight read and that the weight element(s) will be read from memory. In various embodiments, the number of clock cycles to read the weight element(s) is fixed and the signal is used by the control queue to determine the appropriate timing for dequeueing the corresponding control operation for the weight element(s) being read. In various embodiments, similar to 1213, the signal is sent from the hardware arbiter that grants access for the memory read. In some embodiments, the memory read may be serviced from a cache (not shown). In some embodiments, the signal is sent once a memory read has completed and the weight data has been retrieved from memory.
At 1235, a control queue receives one or more control signals from an arbiter. For example, a control queue receives a ready signal corresponding to a data read being granted access to read from memory. As another example, a control queue receives a ready signal corresponding to a weight read being granted access to read from memory. In various embodiments, the signals are not received at the same time or during the same clock cycle. For example, a memory that services a single memory read at a time will require the first read to complete before a second read can be performed. In some embodiments, at 1235, the control queue waits to receive a signal corresponding to each memory read issued and/or acknowledging that each of the operands has been read from memory (or a cache of the memory). In various embodiments, only once signals have been received for each of the corresponding memory reads of a control operation does processing proceeds to 1239.
At 1217, a read is dequeued and the corresponding data element(s) are retrieved from memory. In various embodiments, the read corresponds to the next data read in a read queue. In some embodiments, the next read to be dequeued corresponds to the data read that arrives first. For example, the next read is based on the time the data read is queued in the read queue. In some embodiments, the next read is based on the data read with the highest priority.
At 1227, a read is dequeued and the corresponding weight element(s) are retrieved from memory. In various embodiments, the read corresponds to the next weight read in a read queue. In some embodiments, the next read to be dequeued corresponds to the weight read that arrives first. For example, the next read is based on the time the weight read is queued in the read queue. In some embodiments, the next read is based on the weight read with the highest priority.
At 1219, the data element(s) retrieved from memory are formatted for a computational array. For example, the one or more data elements retrieved from memory are formatted by a hardware data formatter into a group of values to be provided together to and operated on by a computational array. For example, formatting may include formatting data arguments as a group of values that make up a portion of a two-dimensional region of sensor data and providing the group of values together to a computational array. In some embodiments, formatting includes formatting the data arguments based on a stride parameter. In some embodiments, formatting includes formatting the data arguments based on a padding parameter. In various embodiments, formatted may be performed by a hardware data formatter such as data formatter 104 of
At 1229, the weight element(s) retrieved from memory are formatted for a computational array. For example, the one or more weight elements retrieved from memory are formatted by a hardware data formatter into a group of values to be provided together to and operated on by a computational array. For example, formatting may include formatting weight arguments as a group of values that make up an image filter and providing the group of values together to a computational array. In some embodiments, formatting includes formatting the weight arguments based on a parameter such as a matrix dimension, stride, padding, etc., as appropriate. In various embodiments, formatted may be performed by a hardware data formatter such as weight formatter 106 of
At 1239, a control operation is dequeued and provided to a computational array. For example, a control operation corresponding to a computational array operation to be performed on matrix operands is dequeued from a read queue and provided to a computational array in sync with providing operands to the computational array. In some embodiments, a control operation corresponds to a matrix operation. In some embodiments, a control operation corresponds to performing a convolution operation. In various embodiments, the control operation is queued in a control queue and is only dequeued when all associated operands are retrieved or being retrieved from memory once memory access is granted. For example, a control operation associated with two groups of operands is dequeued from a control queue only after a first group of operands has already been retrieved and/or is being streamed from memory (or cache) and when a memory read associated with a second group of operands is granted access to memory. The latency to retrieve and format the second group of operands is a fixed number of clock cycles and the control operation is dequeued and provided to a computational array at the same clock cycle as the different groups of operands.
At 1251, a computational operation is performed by a computational array. In various embodiments, a control operation corresponding to a computational array operation and the operands retrieved from memory are available at the computational array at the same clock cycle. A computational operation is performed on the computational array operands made available to the computational array. In some embodiments, the computation(s) performed at 1251 correspond to the computation(s) performed at 309 of
In various embodiments, the process of
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application is a continuation of, and claims priority to, U.S. patent application Ser. No. 15/920,150 titled “COMPUTATIONAL ARRAY MICROPROCESSOR SYSTEM WITH VARIABLE LATENCY MEMORY ACCESS” and filed on Mar. 13, 2018, which is now U.S. Pat. No. 11,157,287. U.S. patent application Ser. No. 15/920,150 claims priority to U.S. Provisional Patent Application No. 62/635,399 entitled A COMPUTATIONAL ARRAY MICROPROCESSOR SYSTEM WITH VARIABLE LATENCY MEMORY ACCESS filed Feb. 26, 2018, and this application claims priority to U.S. Provisional Patent Application No. 62/625,251 entitled VECTOR COMPUTATIONAL UNIT filed Feb. 1, 2018, and this application claims priority to U.S. Provisional Patent Application No. 62/536,399 entitled ACCELERATED MATHEMATICAL ENGINE filed Jul. 24, 2017, and this application is a continuation-in-part of U.S. Pat. No. 10,671,349 entitled ACCELERATED MATHEMATICAL ENGINE filed Sep. 20, 2017, which claims priority to U.S. Provisional Patent Application No. 62/536,399 entitled ACCELERATED MATHEMATICAL ENGINE filed Jul. 24, 2017, Each of the above-recited applications are hereby incorporated herein by reference in their entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5014235 | Morton | May 1991 | A |
5239636 | Fischer | Aug 1993 | A |
5267185 | Akabane | Nov 1993 | A |
5311459 | D'Luna et al. | May 1994 | A |
5333296 | Bouchard | Jul 1994 | A |
5471627 | Means et al. | Nov 1995 | A |
5519864 | Martell | May 1996 | A |
5600843 | Kato et al. | Feb 1997 | A |
5717947 | Gallup et al. | Feb 1998 | A |
5742782 | Ito | Apr 1998 | A |
5850530 | Chen | Dec 1998 | A |
5887183 | Agarwal et al. | Mar 1999 | A |
5909572 | Thayer et al. | Jun 1999 | A |
6122722 | Slavenburg | Sep 2000 | A |
6195674 | Elbourne | Feb 2001 | B1 |
6289138 | Yip et al. | Sep 2001 | B1 |
6425090 | Arimilli | Jul 2002 | B1 |
6446190 | Barry | Sep 2002 | B1 |
6847365 | Miller et al. | Jan 2005 | B1 |
6882755 | Silverstein et al. | May 2005 | B2 |
7209031 | Nakai et al. | Apr 2007 | B2 |
7747070 | Puri | Jun 2010 | B2 |
7904867 | Burch et al. | Mar 2011 | B2 |
7974492 | Nishijima | Jul 2011 | B2 |
8165380 | Choi et al. | Apr 2012 | B2 |
8369633 | Lu et al. | Feb 2013 | B2 |
8406515 | Cheatle et al. | Mar 2013 | B2 |
8509478 | Haas et al. | Aug 2013 | B2 |
8588470 | Rodriguez et al. | Nov 2013 | B2 |
8744174 | Hamada et al. | Jun 2014 | B2 |
8773498 | Lindbersgh | Jul 2014 | B2 |
8912476 | Fogg et al. | Dec 2014 | B2 |
8913830 | Sun et al. | Dec 2014 | B2 |
8924455 | Barman et al. | Dec 2014 | B1 |
8928753 | Han et al. | Jan 2015 | B2 |
8972095 | Furuno et al. | Mar 2015 | B2 |
8976269 | Duong | Mar 2015 | B2 |
9008422 | Eid et al. | Apr 2015 | B2 |
9081385 | Ferguson et al. | Jul 2015 | B1 |
9275289 | Li et al. | Mar 2016 | B2 |
9586455 | Sugai et al. | Mar 2017 | B2 |
9672437 | McCarthy | Jun 2017 | B2 |
9697463 | Ross | Jul 2017 | B2 |
9710696 | Wang et al. | Jul 2017 | B2 |
9738223 | Zhang et al. | Aug 2017 | B2 |
9754154 | Craig et al. | Sep 2017 | B2 |
9767369 | Furman et al. | Sep 2017 | B2 |
9965865 | Agrawal et al. | May 2018 | B1 |
10074051 | Thorson | Sep 2018 | B2 |
10133273 | Linke | Nov 2018 | B2 |
10140252 | Fowers et al. | Nov 2018 | B2 |
10140544 | Zhao et al. | Nov 2018 | B1 |
10146225 | Ryan | Dec 2018 | B2 |
10152655 | Krishnamurthy et al. | Dec 2018 | B2 |
10167800 | Chung et al. | Jan 2019 | B1 |
10169680 | Sachdeva et al. | Jan 2019 | B1 |
10192016 | Ng et al. | Jan 2019 | B2 |
10216189 | Haynes | Feb 2019 | B1 |
10228693 | Micks et al. | Mar 2019 | B2 |
10242293 | Shim et al. | Mar 2019 | B2 |
10248121 | VandenBerg, III | Apr 2019 | B2 |
10262218 | Lee et al. | Apr 2019 | B2 |
10282623 | Ziyaee et al. | May 2019 | B1 |
10296828 | Viswanathan | May 2019 | B2 |
10303961 | Stoffel et al. | May 2019 | B1 |
10310087 | Laddha et al. | Jun 2019 | B2 |
10311312 | Yu et al. | Jun 2019 | B2 |
10318848 | Dijkman et al. | Jun 2019 | B2 |
10325178 | Tang et al. | Jun 2019 | B1 |
10331974 | Zia et al. | Jun 2019 | B2 |
10338600 | Yoon et al. | Jul 2019 | B2 |
10343607 | Kumon et al. | Jul 2019 | B2 |
10359783 | Williams et al. | Jul 2019 | B2 |
10366290 | Wang et al. | Jul 2019 | B2 |
10372130 | Kaushansky et al. | Aug 2019 | B1 |
10373019 | Nariyambut Murali et al. | Aug 2019 | B2 |
10373026 | Kim et al. | Aug 2019 | B1 |
10380741 | Yedla et al. | Aug 2019 | B2 |
10394237 | Xu et al. | Aug 2019 | B2 |
10395144 | Zeng et al. | Aug 2019 | B2 |
10402646 | Klaus | Sep 2019 | B2 |
10402986 | Ray et al. | Sep 2019 | B2 |
10414395 | Sapp et al. | Sep 2019 | B1 |
10417560 | Henry et al. | Sep 2019 | B2 |
10423934 | Zanghi et al. | Sep 2019 | B1 |
10436615 | Agarwal et al. | Oct 2019 | B2 |
10438115 | Henry et al. | Oct 2019 | B2 |
10452905 | Segalovitz et al. | Oct 2019 | B2 |
10460053 | Olson et al. | Oct 2019 | B2 |
10467459 | Chen et al. | Nov 2019 | B2 |
10468008 | Beckman et al. | Nov 2019 | B2 |
10468062 | Levinson et al. | Nov 2019 | B1 |
10470510 | Koh et al. | Nov 2019 | B1 |
10474160 | Huang et al. | Nov 2019 | B2 |
10474161 | Huang et al. | Nov 2019 | B2 |
10474928 | Sivakumar et al. | Nov 2019 | B2 |
10489126 | Kumar et al. | Nov 2019 | B2 |
10489478 | Shalev | Nov 2019 | B2 |
10489972 | Atsmon | Nov 2019 | B2 |
10503971 | Dang et al. | Dec 2019 | B1 |
10514711 | Bar-Nahum et al. | Dec 2019 | B2 |
10528824 | Zou | Jan 2020 | B2 |
10529078 | Abreu et al. | Jan 2020 | B2 |
10529088 | Fine et al. | Jan 2020 | B2 |
10534854 | Sharma et al. | Jan 2020 | B2 |
10535191 | Sachdeva et al. | Jan 2020 | B2 |
10542930 | Sanchez et al. | Jan 2020 | B1 |
10546197 | Shrestha et al. | Jan 2020 | B2 |
10546217 | Albright et al. | Jan 2020 | B2 |
10552682 | Jonsson et al. | Feb 2020 | B2 |
10559386 | Neuman | Feb 2020 | B1 |
10565475 | Lecue et al. | Feb 2020 | B2 |
10567674 | Kirsch | Feb 2020 | B2 |
10568570 | Sherpa et al. | Feb 2020 | B1 |
10572717 | Zhu et al. | Feb 2020 | B1 |
10574905 | Srikanth et al. | Feb 2020 | B2 |
10579058 | Oh et al. | Mar 2020 | B2 |
10579063 | Haynes et al. | Mar 2020 | B2 |
10579897 | Redmon et al. | Mar 2020 | B2 |
10586280 | McKenna et al. | Mar 2020 | B2 |
10591914 | Palanisamy et al. | Mar 2020 | B2 |
10592785 | Zhu et al. | Mar 2020 | B2 |
10599701 | Liu | Mar 2020 | B2 |
10599930 | Lee et al. | Mar 2020 | B2 |
10599958 | He et al. | Mar 2020 | B2 |
10606990 | Tuli et al. | Mar 2020 | B2 |
10609434 | Singhai et al. | Mar 2020 | B2 |
10614344 | Anthony et al. | Apr 2020 | B2 |
10621513 | Deshpande et al. | Apr 2020 | B2 |
10627818 | Sapp et al. | Apr 2020 | B2 |
10628432 | Guo et al. | Apr 2020 | B2 |
10628686 | Ogale et al. | Apr 2020 | B2 |
10628688 | Kim et al. | Apr 2020 | B1 |
10629080 | Kazemi et al. | Apr 2020 | B2 |
10636161 | Uchigaito | Apr 2020 | B2 |
10636169 | Estrada et al. | Apr 2020 | B2 |
10642275 | Silva et al. | May 2020 | B2 |
10645344 | Marman et al. | May 2020 | B2 |
10649464 | Gray | May 2020 | B2 |
10650071 | Asgekar et al. | May 2020 | B2 |
10652565 | Zhang et al. | May 2020 | B1 |
10656657 | Djuric et al. | May 2020 | B2 |
10657391 | Chen et al. | May 2020 | B2 |
10657418 | Marder et al. | May 2020 | B2 |
10657934 | Kolen et al. | May 2020 | B1 |
10661902 | Tavshikar | May 2020 | B1 |
10664750 | Greene | May 2020 | B2 |
10671082 | Huang et al. | Jun 2020 | B2 |
10671349 | Bannon et al. | Jun 2020 | B2 |
10671886 | Price et al. | Jun 2020 | B2 |
10678244 | Iandola et al. | Jun 2020 | B2 |
10678839 | Gordon et al. | Jun 2020 | B2 |
10678997 | Ahuja et al. | Jun 2020 | B2 |
10679129 | Baker | Jun 2020 | B2 |
10685159 | Su et al. | Jun 2020 | B2 |
10685188 | Zhang et al. | Jun 2020 | B1 |
10692000 | Surazhsky et al. | Jun 2020 | B2 |
10692242 | Morrison et al. | Jun 2020 | B1 |
10693740 | Coccia et al. | Jun 2020 | B2 |
10698868 | Guggilla et al. | Jun 2020 | B2 |
10699119 | Lo et al. | Jun 2020 | B2 |
10699140 | Kench et al. | Jun 2020 | B2 |
10699477 | Levinson et al. | Jun 2020 | B2 |
10713502 | Tiziani | Jul 2020 | B2 |
10719759 | Kutliroff | Jul 2020 | B2 |
10725475 | Yang et al. | Jul 2020 | B2 |
10726264 | Sawhney et al. | Jul 2020 | B2 |
10726279 | Kim et al. | Jul 2020 | B1 |
10726374 | Engineer et al. | Jul 2020 | B1 |
10732261 | Wang et al. | Aug 2020 | B1 |
10733262 | Miller et al. | Aug 2020 | B2 |
10733482 | Lee et al. | Aug 2020 | B1 |
10733638 | Jain et al. | Aug 2020 | B1 |
10733755 | Liao et al. | Aug 2020 | B2 |
10733876 | Moura et al. | Aug 2020 | B2 |
10740563 | Dugan | Aug 2020 | B2 |
10740914 | Xiao et al. | Aug 2020 | B2 |
10747844 | Bannon et al. | Aug 2020 | B2 |
10748062 | Rippel et al. | Aug 2020 | B2 |
10748247 | Paluri | Aug 2020 | B2 |
10751879 | Li et al. | Aug 2020 | B2 |
10755112 | Mabuchi | Aug 2020 | B2 |
10755575 | Johnston et al. | Aug 2020 | B2 |
10757330 | Ashrafi | Aug 2020 | B2 |
10762396 | Vallespi et al. | Sep 2020 | B2 |
10768628 | Martin et al. | Sep 2020 | B2 |
10768629 | Song et al. | Sep 2020 | B2 |
10769446 | Chang et al. | Sep 2020 | B2 |
10769483 | Nirenberg et al. | Sep 2020 | B2 |
10769493 | Yu et al. | Sep 2020 | B2 |
10769494 | Xiao et al. | Sep 2020 | B2 |
10769525 | Redding et al. | Sep 2020 | B2 |
10776626 | Lin et al. | Sep 2020 | B1 |
10776673 | Kim et al. | Sep 2020 | B2 |
10776939 | Ma et al. | Sep 2020 | B2 |
10779760 | Lee et al. | Sep 2020 | B2 |
10783381 | Yu et al. | Sep 2020 | B2 |
10783454 | Shoaib et al. | Sep 2020 | B2 |
10789402 | Vemuri et al. | Sep 2020 | B1 |
10789544 | Fiedel et al. | Sep 2020 | B2 |
10790919 | Kolen et al. | Sep 2020 | B1 |
10796221 | Zhang et al. | Oct 2020 | B2 |
10796355 | Price et al. | Oct 2020 | B1 |
10796423 | Goja | Oct 2020 | B2 |
10798368 | Briggs et al. | Oct 2020 | B2 |
10803325 | Bai et al. | Oct 2020 | B2 |
10803328 | Bai et al. | Oct 2020 | B1 |
10803743 | Abari et al. | Oct 2020 | B2 |
10805629 | Liu et al. | Oct 2020 | B2 |
10809730 | Chintakindi | Oct 2020 | B2 |
10810445 | Kangaspunta | Oct 2020 | B1 |
10816346 | Wheeler et al. | Oct 2020 | B2 |
10816992 | Chen | Oct 2020 | B2 |
10817731 | Vallespi et al. | Oct 2020 | B2 |
10817732 | Porter et al. | Oct 2020 | B2 |
10819923 | McCauley et al. | Oct 2020 | B1 |
10824122 | Mummadi et al. | Nov 2020 | B2 |
10824862 | Qi et al. | Nov 2020 | B2 |
10828790 | Nemallan | Nov 2020 | B2 |
10832057 | Chan et al. | Nov 2020 | B2 |
10832093 | Taralova et al. | Nov 2020 | B1 |
10832414 | Pfeiffer | Nov 2020 | B2 |
10832418 | Karasev et al. | Nov 2020 | B1 |
10833785 | O'Shea et al. | Nov 2020 | B1 |
10836379 | Xiao et al. | Nov 2020 | B2 |
10838936 | Cohen | Nov 2020 | B2 |
10839230 | Charette et al. | Nov 2020 | B2 |
10839578 | Coppersmith et al. | Nov 2020 | B2 |
10843628 | Kawamoto et al. | Nov 2020 | B2 |
10845820 | Wheeler | Nov 2020 | B2 |
10845943 | Ansari et al. | Nov 2020 | B1 |
10846831 | Raduta | Nov 2020 | B2 |
10846888 | Kaplanyan et al. | Nov 2020 | B2 |
10853670 | Sholingar et al. | Dec 2020 | B2 |
10853739 | Truong et al. | Dec 2020 | B2 |
10860919 | Kanazawa et al. | Dec 2020 | B2 |
10860924 | Burger | Dec 2020 | B2 |
10867444 | Russell et al. | Dec 2020 | B2 |
10871444 | Al et al. | Dec 2020 | B2 |
10871782 | Milstein et al. | Dec 2020 | B2 |
10872204 | Zhu et al. | Dec 2020 | B2 |
10872254 | Mangla et al. | Dec 2020 | B2 |
10872326 | Garner | Dec 2020 | B2 |
10872531 | Liu et al. | Dec 2020 | B2 |
10885083 | Moeller-Bertram et al. | Jan 2021 | B2 |
10887433 | Fu et al. | Jan 2021 | B2 |
10890898 | Akella et al. | Jan 2021 | B2 |
10891715 | Li | Jan 2021 | B2 |
10891735 | Yang et al. | Jan 2021 | B2 |
10893070 | Wang et al. | Jan 2021 | B2 |
10893107 | Callari et al. | Jan 2021 | B1 |
10896763 | Kempanna et al. | Jan 2021 | B2 |
10901416 | Khanna et al. | Jan 2021 | B2 |
10901508 | Laszlo et al. | Jan 2021 | B2 |
10902551 | Mellado et al. | Jan 2021 | B1 |
10908068 | Amer et al. | Feb 2021 | B2 |
10908606 | Stein et al. | Feb 2021 | B2 |
10909368 | Guo et al. | Feb 2021 | B2 |
10909453 | Myers et al. | Feb 2021 | B1 |
10915783 | Hallman et al. | Feb 2021 | B1 |
10917522 | Segalis et al. | Feb 2021 | B2 |
10921817 | Kangaspunta | Feb 2021 | B1 |
10922578 | Banerjee et al. | Feb 2021 | B2 |
10924661 | Vasconcelos et al. | Feb 2021 | B2 |
10928508 | Swaminathan | Feb 2021 | B2 |
10929757 | Baker et al. | Feb 2021 | B2 |
10930065 | Grant et al. | Feb 2021 | B2 |
10936908 | Ho et al. | Mar 2021 | B1 |
10937186 | Wang et al. | Mar 2021 | B2 |
10942737 | Ivanov | Mar 2021 | B2 |
10943101 | Agarwal et al. | Mar 2021 | B2 |
10943132 | Wang et al. | Mar 2021 | B2 |
10943355 | Fagg et al. | Mar 2021 | B2 |
11157287 | Talpes | Oct 2021 | B2 |
11157441 | Talpes | Oct 2021 | B2 |
11210584 | Brand | Dec 2021 | B2 |
11403069 | Bannon et al. | Aug 2022 | B2 |
11409692 | Das Sarma et al. | Aug 2022 | B2 |
20020169942 | Sugimoto | Nov 2002 | A1 |
20030035481 | Hahm | Feb 2003 | A1 |
20040091135 | Bourg et al. | May 2004 | A1 |
20040148321 | Guevorkian et al. | Jul 2004 | A1 |
20050125369 | Buck et al. | Jun 2005 | A1 |
20050162445 | Sheasby et al. | Jul 2005 | A1 |
20050172106 | Ford et al. | Aug 2005 | A1 |
20060072847 | Chor et al. | Apr 2006 | A1 |
20060224533 | Thaler | Oct 2006 | A1 |
20060280364 | Ma et al. | Dec 2006 | A1 |
20070255903 | Tsadik | Nov 2007 | A1 |
20080209135 | Clark | Aug 2008 | A1 |
20090016571 | Tijerina et al. | Jan 2009 | A1 |
20090113182 | Abernathy | Apr 2009 | A1 |
20090192958 | Todorokihara | Jul 2009 | A1 |
20100017351 | Hench | Jan 2010 | A1 |
20100118157 | Kameyama | May 2010 | A1 |
20110029471 | Chakradhar et al. | Feb 2011 | A1 |
20110239032 | Kato et al. | Sep 2011 | A1 |
20110307681 | Piry et al. | Dec 2011 | A1 |
20120017066 | Vorbach et al. | Jan 2012 | A1 |
20120109915 | Kamekawa et al. | May 2012 | A1 |
20120110491 | Cheung | May 2012 | A1 |
20120134595 | Fonseca et al. | May 2012 | A1 |
20120278376 | Bakos | Nov 2012 | A1 |
20120323832 | Snook et al. | Dec 2012 | A1 |
20130159665 | Kashyap | Jun 2013 | A1 |
20140046995 | Ranous | Feb 2014 | A1 |
20140089232 | Buibas et al. | Mar 2014 | A1 |
20140115278 | Redford | Apr 2014 | A1 |
20140142929 | Seide et al. | May 2014 | A1 |
20140180989 | Krizhevsky et al. | Jun 2014 | A1 |
20140277718 | Tzhikevich et al. | Sep 2014 | A1 |
20140351190 | Levin et al. | Nov 2014 | A1 |
20150046332 | Adjaoute | Feb 2015 | A1 |
20150067273 | Strauss et al. | Mar 2015 | A1 |
20150104102 | Carreira et al. | Apr 2015 | A1 |
20150170021 | Lupon et al. | Jun 2015 | A1 |
20150199272 | Goel | Jul 2015 | A1 |
20150331832 | Minoya | Nov 2015 | A1 |
20160085721 | Abali | Mar 2016 | A1 |
20160132786 | Balan et al. | May 2016 | A1 |
20160328856 | Mannino et al. | Nov 2016 | A1 |
20160342889 | Thorson et al. | Nov 2016 | A1 |
20160342890 | Young | Nov 2016 | A1 |
20160342891 | Ross | Nov 2016 | A1 |
20160342892 | Ross | Nov 2016 | A1 |
20160342893 | Ross et al. | Nov 2016 | A1 |
20160351195 | Falik et al. | Dec 2016 | A1 |
20160364334 | Asaro | Dec 2016 | A1 |
20160379109 | Chung | Dec 2016 | A1 |
20170011281 | Dihkman et al. | Jan 2017 | A1 |
20170017489 | Kimura | Jan 2017 | A1 |
20170052785 | Uliel | Feb 2017 | A1 |
20170060811 | Yang | Mar 2017 | A1 |
20170097884 | Werner | Apr 2017 | A1 |
20170103298 | Ling | Apr 2017 | A1 |
20170103299 | Aydonat | Apr 2017 | A1 |
20170103313 | Ross et al. | Apr 2017 | A1 |
20170103318 | Ross | Apr 2017 | A1 |
20170158134 | Shigemura | Jun 2017 | A1 |
20170193360 | Gao | Jul 2017 | A1 |
20170206434 | Nariyambut et al. | Jul 2017 | A1 |
20170277537 | Grocutt | Sep 2017 | A1 |
20170277658 | Pratas | Sep 2017 | A1 |
20180012411 | Richey et al. | Jan 2018 | A1 |
20180018590 | Szeto et al. | Jan 2018 | A1 |
20180032857 | Lele | Feb 2018 | A1 |
20180039853 | Liu et al. | Feb 2018 | A1 |
20180046900 | Dally | Feb 2018 | A1 |
20180067489 | Oder et al. | Mar 2018 | A1 |
20180068459 | Zhang et al. | Mar 2018 | A1 |
20180068540 | Romanenko et al. | Mar 2018 | A1 |
20180074506 | Branson | Mar 2018 | A1 |
20180107484 | Sebexen | Apr 2018 | A1 |
20180121762 | Han et al. | May 2018 | A1 |
20180150081 | Gross et al. | May 2018 | A1 |
20180157961 | Henry | Jun 2018 | A1 |
20180157962 | Henry | Jun 2018 | A1 |
20180157966 | Henry | Jun 2018 | A1 |
20180189633 | Henry | Jul 2018 | A1 |
20180189639 | Henry | Jul 2018 | A1 |
20180189640 | Henry | Jul 2018 | A1 |
20180189649 | Naranyan | Jul 2018 | A1 |
20180189651 | Henry | Jul 2018 | A1 |
20180197067 | Mody | Jul 2018 | A1 |
20180211403 | Hotson et al. | Jul 2018 | A1 |
20180218260 | Brand | Aug 2018 | A1 |
20180247180 | Cheng | Aug 2018 | A1 |
20180260220 | Lacy | Sep 2018 | A1 |
20180307438 | Huang | Oct 2018 | A1 |
20180307783 | Hah | Oct 2018 | A1 |
20180308012 | Mummadi et al. | Oct 2018 | A1 |
20180314878 | Lee et al. | Nov 2018 | A1 |
20180315153 | Park | Nov 2018 | A1 |
20180336164 | Phelps | Nov 2018 | A1 |
20180357511 | Misra et al. | Dec 2018 | A1 |
20180374105 | Azout et al. | Dec 2018 | A1 |
20190011551 | Yamamoto | Jan 2019 | A1 |
20190023277 | Roger et al. | Jan 2019 | A1 |
20190025773 | Yang et al. | Jan 2019 | A1 |
20190026250 | Das Sarma | Jan 2019 | A1 |
20190042894 | Anderson | Feb 2019 | A1 |
20190042919 | Peysakhovich et al. | Feb 2019 | A1 |
20190042944 | Nair et al. | Feb 2019 | A1 |
20190042948 | Lee et al. | Feb 2019 | A1 |
20190057314 | Julian et al. | Feb 2019 | A1 |
20190065637 | Bogdoll et al. | Feb 2019 | A1 |
20190072978 | Levi | Mar 2019 | A1 |
20190079526 | Vallespi et al. | Mar 2019 | A1 |
20190080602 | Rice et al. | Mar 2019 | A1 |
20190088948 | Rasale | Mar 2019 | A1 |
20190095780 | Zhong et al. | Mar 2019 | A1 |
20190095946 | Azout et al. | Mar 2019 | A1 |
20190101914 | Coleman et al. | Apr 2019 | A1 |
20190108417 | Talagala et al. | Apr 2019 | A1 |
20190122111 | Min et al. | Apr 2019 | A1 |
20190130255 | Yim et al. | May 2019 | A1 |
20190145765 | Luo et al. | May 2019 | A1 |
20190146497 | Urtasun et al. | May 2019 | A1 |
20190147112 | Gordon | May 2019 | A1 |
20190147250 | Zhang et al. | May 2019 | A1 |
20190147254 | Bai et al. | May 2019 | A1 |
20190147255 | Homayounfar et al. | May 2019 | A1 |
20190147335 | Wang et al. | May 2019 | A1 |
20190147372 | Luo et al. | May 2019 | A1 |
20190158784 | Ahn et al. | May 2019 | A1 |
20190179870 | Bannon | Jun 2019 | A1 |
20190180154 | Orlov et al. | Jun 2019 | A1 |
20190185010 | Ganguli et al. | Jun 2019 | A1 |
20190189251 | Horiuchi et al. | Jun 2019 | A1 |
20190197357 | Anderson et al. | Jun 2019 | A1 |
20190204842 | Jafari et al. | Jul 2019 | A1 |
20190205402 | Sernau et al. | Jul 2019 | A1 |
20190205667 | Avidan et al. | Jul 2019 | A1 |
20190217791 | Bradley et al. | Jul 2019 | A1 |
20190227562 | Mohammadiha et al. | Jul 2019 | A1 |
20190228037 | Nicol et al. | Jul 2019 | A1 |
20190230282 | Sypitkowski et al. | Jul 2019 | A1 |
20190235499 | Kazemi et al. | Aug 2019 | A1 |
20190235866 | Das Sarma | Aug 2019 | A1 |
20190236437 | Shin et al. | Aug 2019 | A1 |
20190243371 | Nister et al. | Aug 2019 | A1 |
20190244138 | Bhowmick et al. | Aug 2019 | A1 |
20190250622 | Nister et al. | Aug 2019 | A1 |
20190250626 | Ghafarianzadeh et al. | Aug 2019 | A1 |
20190250640 | O'Flaherty et al. | Aug 2019 | A1 |
20190258878 | Koivisto et al. | Aug 2019 | A1 |
20190266418 | Xu et al. | Aug 2019 | A1 |
20190266610 | Ghatage et al. | Aug 2019 | A1 |
20190272446 | Kangaspunta et al. | Sep 2019 | A1 |
20190276041 | Choi et al. | Sep 2019 | A1 |
20190279004 | Kwon et al. | Sep 2019 | A1 |
20190286652 | Habbecke et al. | Sep 2019 | A1 |
20190286972 | El Husseini et al. | Sep 2019 | A1 |
20190287028 | St Amant et al. | Sep 2019 | A1 |
20190289281 | Badrinarayanan et al. | Sep 2019 | A1 |
20190294177 | Kwon et al. | Sep 2019 | A1 |
20190294975 | Sachs | Sep 2019 | A1 |
20190311253 | Chung | Oct 2019 | A1 |
20190311290 | Huang et al. | Oct 2019 | A1 |
20190318099 | Carvalho et al. | Oct 2019 | A1 |
20190325088 | Dubey et al. | Oct 2019 | A1 |
20190325266 | Klepper et al. | Oct 2019 | A1 |
20190325269 | Bagherinezhad et al. | Oct 2019 | A1 |
20190325580 | Lukac et al. | Oct 2019 | A1 |
20190325595 | Stein et al. | Oct 2019 | A1 |
20190329790 | Nandakumar et al. | Oct 2019 | A1 |
20190332875 | Vallespi-Gonzalez et al. | Oct 2019 | A1 |
20190333232 | Vallespi-Gonzalez et al. | Oct 2019 | A1 |
20190336063 | Dascalu | Nov 2019 | A1 |
20190339989 | Liang et al. | Nov 2019 | A1 |
20190340462 | Pao et al. | Nov 2019 | A1 |
20190340492 | Burger et al. | Nov 2019 | A1 |
20190340499 | Burger et al. | Nov 2019 | A1 |
20190347501 | Kim et al. | Nov 2019 | A1 |
20190349571 | Herman et al. | Nov 2019 | A1 |
20190354782 | Kee et al. | Nov 2019 | A1 |
20190354786 | Lee et al. | Nov 2019 | A1 |
20190354808 | Park et al. | Nov 2019 | A1 |
20190354817 | Shlens et al. | Nov 2019 | A1 |
20190354850 | Watson et al. | Nov 2019 | A1 |
20190370398 | He et al. | Dec 2019 | A1 |
20190370575 | Nandakumar et al. | Dec 2019 | A1 |
20190370645 | Lee | Dec 2019 | A1 |
20190370935 | Chang et al. | Dec 2019 | A1 |
20190373322 | Rojas-Echenique et al. | Dec 2019 | A1 |
20190377345 | Bachrach et al. | Dec 2019 | A1 |
20190377965 | Totolos et al. | Dec 2019 | A1 |
20190378049 | Widmann et al. | Dec 2019 | A1 |
20190378051 | Widmann et al. | Dec 2019 | A1 |
20190382007 | Casas et al. | Dec 2019 | A1 |
20190384303 | Muller et al. | Dec 2019 | A1 |
20190384304 | Towal et al. | Dec 2019 | A1 |
20190384309 | Silva et al. | Dec 2019 | A1 |
20190384994 | Frossard et al. | Dec 2019 | A1 |
20190385048 | Cassidy et al. | Dec 2019 | A1 |
20190385360 | Yang et al. | Dec 2019 | A1 |
20200004259 | Gulino et al. | Jan 2020 | A1 |
20200004351 | Marchant et al. | Jan 2020 | A1 |
20200012936 | Lee et al. | Jan 2020 | A1 |
20200017117 | Milton | Jan 2020 | A1 |
20200025931 | Liang et al. | Jan 2020 | A1 |
20200026282 | Choe et al. | Jan 2020 | A1 |
20200026283 | Barnes et al. | Jan 2020 | A1 |
20200026992 | Zhang et al. | Jan 2020 | A1 |
20200027210 | Haemel et al. | Jan 2020 | A1 |
20200033858 | Xiao | Jan 2020 | A1 |
20200033865 | Mellinger et al. | Jan 2020 | A1 |
20200034148 | Sumbu | Jan 2020 | A1 |
20200034665 | Ghanta et al. | Jan 2020 | A1 |
20200034710 | Sidhu et al. | Jan 2020 | A1 |
20200036948 | Song | Jan 2020 | A1 |
20200039520 | Misu et al. | Feb 2020 | A1 |
20200051550 | Baker | Feb 2020 | A1 |
20200060757 | Ben-Haim et al. | Feb 2020 | A1 |
20200065711 | Clément et al. | Feb 2020 | A1 |
20200065879 | Hu et al. | Feb 2020 | A1 |
20200069973 | Lou et al. | Mar 2020 | A1 |
20200073385 | Jobanputra et al. | Mar 2020 | A1 |
20200074230 | Englard et al. | Mar 2020 | A1 |
20200086880 | Poeppel et al. | Mar 2020 | A1 |
20200089243 | Poeppel et al. | Mar 2020 | A1 |
20200089969 | Lakshmi et al. | Mar 2020 | A1 |
20200090056 | Singhal et al. | Mar 2020 | A1 |
20200097841 | Petousis et al. | Mar 2020 | A1 |
20200098095 | Borcs et al. | Mar 2020 | A1 |
20200103894 | Cella et al. | Apr 2020 | A1 |
20200104705 | Bhowmick et al. | Apr 2020 | A1 |
20200110416 | Hong et al. | Apr 2020 | A1 |
20200117180 | Cella et al. | Apr 2020 | A1 |
20200117889 | Laput et al. | Apr 2020 | A1 |
20200117916 | Liu | Apr 2020 | A1 |
20200117917 | Yoo | Apr 2020 | A1 |
20200118035 | Asawa et al. | Apr 2020 | A1 |
20200125844 | She et al. | Apr 2020 | A1 |
20200125845 | Hess et al. | Apr 2020 | A1 |
20200126129 | Lkhamsuren et al. | Apr 2020 | A1 |
20200134427 | Oh et al. | Apr 2020 | A1 |
20200134461 | Chai et al. | Apr 2020 | A1 |
20200134466 | Weintraub et al. | Apr 2020 | A1 |
20200134848 | El-Khamy et al. | Apr 2020 | A1 |
20200143231 | Fusi et al. | May 2020 | A1 |
20200143279 | West et al. | May 2020 | A1 |
20200148201 | King et al. | May 2020 | A1 |
20200149898 | Felip et al. | May 2020 | A1 |
20200151201 | Chandrasekhar et al. | May 2020 | A1 |
20200151619 | Mopur et al. | May 2020 | A1 |
20200151692 | Gao et al. | May 2020 | A1 |
20200158822 | Owens et al. | May 2020 | A1 |
20200158869 | Amirloo et al. | May 2020 | A1 |
20200159225 | Zeng et al. | May 2020 | A1 |
20200160064 | Wang et al. | May 2020 | A1 |
20200160104 | Urtasun et al. | May 2020 | A1 |
20200160117 | Urtasun et al. | May 2020 | A1 |
20200160178 | Kar et al. | May 2020 | A1 |
20200160532 | Urtasun et al. | May 2020 | A1 |
20200160558 | Urtasun et al. | May 2020 | A1 |
20200160559 | Urtasun et al. | May 2020 | A1 |
20200160598 | Manivasagam et al. | May 2020 | A1 |
20200162489 | Bar-Nahum et al. | May 2020 | A1 |
20200167438 | Herring | May 2020 | A1 |
20200167554 | Wang et al. | May 2020 | A1 |
20200174481 | Van Heukelom et al. | Jun 2020 | A1 |
20200175326 | Shen et al. | Jun 2020 | A1 |
20200175354 | Volodarskiy et al. | Jun 2020 | A1 |
20200175371 | Kursun | Jun 2020 | A1 |
20200175401 | Shen | Jun 2020 | A1 |
20200183482 | Sebot et al. | Jun 2020 | A1 |
20200184250 | Oko | Jun 2020 | A1 |
20200184333 | Oh | Jun 2020 | A1 |
20200192389 | ReMine et al. | Jun 2020 | A1 |
20200193313 | Ghanta et al. | Jun 2020 | A1 |
20200193328 | Guestrin et al. | Jun 2020 | A1 |
20200202136 | Shrestha et al. | Jun 2020 | A1 |
20200202196 | Guo et al. | Jun 2020 | A1 |
20200209857 | Djuric et al. | Jul 2020 | A1 |
20200209867 | Valois et al. | Jul 2020 | A1 |
20200209874 | Chen et al. | Jul 2020 | A1 |
20200210175 | Alexander et al. | Jul 2020 | A1 |
20200210187 | Alexander et al. | Jul 2020 | A1 |
20200210717 | Hou et al. | Jul 2020 | A1 |
20200210769 | Hou et al. | Jul 2020 | A1 |
20200210777 | Valois et al. | Jul 2020 | A1 |
20200216064 | du Toit et al. | Jul 2020 | A1 |
20200218722 | Mai et al. | Jul 2020 | A1 |
20200218979 | Kwon et al. | Jul 2020 | A1 |
20200223434 | Campos et al. | Jul 2020 | A1 |
20200225758 | Tang et al. | Jul 2020 | A1 |
20200226377 | Campos et al. | Jul 2020 | A1 |
20200226430 | Ahuja et al. | Jul 2020 | A1 |
20200238998 | Dasalukunte et al. | Jul 2020 | A1 |
20200242381 | Chao et al. | Jul 2020 | A1 |
20200242408 | Kim et al. | Jul 2020 | A1 |
20200242511 | Kale et al. | Jul 2020 | A1 |
20200245869 | Sivan et al. | Aug 2020 | A1 |
20200249685 | Elluswamy et al. | Aug 2020 | A1 |
20200250456 | Wang et al. | Aug 2020 | A1 |
20200250515 | Rifkin et al. | Aug 2020 | A1 |
20200250874 | Assouline et al. | Aug 2020 | A1 |
20200257301 | Weiser et al. | Aug 2020 | A1 |
20200257306 | Nisenzon | Aug 2020 | A1 |
20200258057 | Farahat et al. | Aug 2020 | A1 |
20200265247 | Musk et al. | Aug 2020 | A1 |
20200272160 | Djuric et al. | Aug 2020 | A1 |
20200272162 | Hasselgren et al. | Aug 2020 | A1 |
20200272859 | Iashyn et al. | Aug 2020 | A1 |
20200273231 | Schied et al. | Aug 2020 | A1 |
20200279354 | Klaiman | Sep 2020 | A1 |
20200279364 | Sarkisian et al. | Sep 2020 | A1 |
20200279371 | Wenzel et al. | Sep 2020 | A1 |
20200285464 | Brebner | Sep 2020 | A1 |
20200286256 | Houts et al. | Sep 2020 | A1 |
20200293786 | Jia et al. | Sep 2020 | A1 |
20200293796 | Sajjadi et al. | Sep 2020 | A1 |
20200293828 | Wang et al. | Sep 2020 | A1 |
20200293905 | Huang et al. | Sep 2020 | A1 |
20200294162 | Shah | Sep 2020 | A1 |
20200294257 | Yoo et al. | Sep 2020 | A1 |
20200294310 | Lee et al. | Sep 2020 | A1 |
20200297237 | Tamersoy et al. | Sep 2020 | A1 |
20200298891 | Liang et al. | Sep 2020 | A1 |
20200301799 | Manivasagam et al. | Sep 2020 | A1 |
20200302276 | Yang et al. | Sep 2020 | A1 |
20200302291 | Hong | Sep 2020 | A1 |
20200302627 | Duggal et al. | Sep 2020 | A1 |
20200302662 | Homayounfar et al. | Sep 2020 | A1 |
20200304441 | Bradley et al. | Sep 2020 | A1 |
20200306640 | Kolen et al. | Oct 2020 | A1 |
20200307562 | Ghafarianzadeh et al. | Oct 2020 | A1 |
20200307563 | Ghafarianzadeh et al. | Oct 2020 | A1 |
20200309536 | Omari et al. | Oct 2020 | A1 |
20200309923 | Bhaskaran et al. | Oct 2020 | A1 |
20200310442 | Halder et al. | Oct 2020 | A1 |
20200311601 | Robinson et al. | Oct 2020 | A1 |
20200312003 | Borovikov et al. | Oct 2020 | A1 |
20200315708 | Mosnier et al. | Oct 2020 | A1 |
20200320132 | Neumann | Oct 2020 | A1 |
20200324073 | Rajan et al. | Oct 2020 | A1 |
20200327192 | Hackman et al. | Oct 2020 | A1 |
20200327443 | Van et al. | Oct 2020 | A1 |
20200327449 | Tiwari et al. | Oct 2020 | A1 |
20200327662 | Liu et al. | Oct 2020 | A1 |
20200327667 | Arbel et al. | Oct 2020 | A1 |
20200331476 | Chen et al. | Oct 2020 | A1 |
20200334416 | Vianu et al. | Oct 2020 | A1 |
20200334495 | Al et al. | Oct 2020 | A1 |
20200334501 | Lin et al. | Oct 2020 | A1 |
20200334551 | Javidi et al. | Oct 2020 | A1 |
20200334574 | Ishida | Oct 2020 | A1 |
20200337648 | Saripalli et al. | Oct 2020 | A1 |
20200341466 | Pham et al. | Oct 2020 | A1 |
20200342350 | Madar et al. | Oct 2020 | A1 |
20200342548 | Mazed et al. | Oct 2020 | A1 |
20200342652 | Rowell et al. | Oct 2020 | A1 |
20200348909 | Das Sarma et al. | Nov 2020 | A1 |
20200350063 | Thornton et al. | Nov 2020 | A1 |
20200351438 | Dewhurst et al. | Nov 2020 | A1 |
20200356107 | Wells | Nov 2020 | A1 |
20200356790 | Jaipuria et al. | Nov 2020 | A1 |
20200356864 | Neumann | Nov 2020 | A1 |
20200356905 | Luk et al. | Nov 2020 | A1 |
20200361083 | Mousavian et al. | Nov 2020 | A1 |
20200361485 | Zhu et al. | Nov 2020 | A1 |
20200364481 | Kornienko et al. | Nov 2020 | A1 |
20200364508 | Gurel et al. | Nov 2020 | A1 |
20200364540 | Elsayed et al. | Nov 2020 | A1 |
20200364746 | Longano et al. | Nov 2020 | A1 |
20200364953 | Simoudis | Nov 2020 | A1 |
20200372362 | Kim | Nov 2020 | A1 |
20200372402 | Kursun et al. | Nov 2020 | A1 |
20200380362 | Cao et al. | Dec 2020 | A1 |
20200380383 | Kwong et al. | Dec 2020 | A1 |
20200393841 | Frisbie et al. | Dec 2020 | A1 |
20200394421 | Yu et al. | Dec 2020 | A1 |
20200394457 | Brady | Dec 2020 | A1 |
20200394495 | Moudgill et al. | Dec 2020 | A1 |
20200394813 | Theverapperuma et al. | Dec 2020 | A1 |
20200396394 | Zlokolica et al. | Dec 2020 | A1 |
20200398855 | Thompson | Dec 2020 | A1 |
20200401850 | Bazarsky et al. | Dec 2020 | A1 |
20200401886 | Deng et al. | Dec 2020 | A1 |
20200402155 | Kurian et al. | Dec 2020 | A1 |
20200402226 | Peng | Dec 2020 | A1 |
20200410012 | Moon et al. | Dec 2020 | A1 |
20200410224 | Goel | Dec 2020 | A1 |
20200410254 | Pham et al. | Dec 2020 | A1 |
20200410288 | Capota et al. | Dec 2020 | A1 |
20200410751 | Omari et al. | Dec 2020 | A1 |
20210004014 | Sivakumar | Jan 2021 | A1 |
20210004580 | Sundararaman et al. | Jan 2021 | A1 |
20210004611 | Garimella et al. | Jan 2021 | A1 |
20210004663 | Park et al. | Jan 2021 | A1 |
20210006835 | Slattery et al. | Jan 2021 | A1 |
20210011908 | Hayes et al. | Jan 2021 | A1 |
20210012116 | Urtasun et al. | Jan 2021 | A1 |
20210012210 | Sikka et al. | Jan 2021 | A1 |
20210012230 | Hayes et al. | Jan 2021 | A1 |
20210012239 | Arzani et al. | Jan 2021 | A1 |
20210015240 | Elfakhri et al. | Jan 2021 | A1 |
20210019215 | Neeter | Jan 2021 | A1 |
20210026360 | Luo | Jan 2021 | A1 |
20210027112 | Brewington et al. | Jan 2021 | A1 |
20210027117 | McGavran et al. | Jan 2021 | A1 |
20210030276 | Li et al. | Feb 2021 | A1 |
20210034921 | Pinkovich et al. | Feb 2021 | A1 |
20210042575 | Firner | Feb 2021 | A1 |
20210042928 | Takeda et al. | Feb 2021 | A1 |
20210046954 | Haynes | Feb 2021 | A1 |
20210048984 | Bannon | Feb 2021 | A1 |
20210049378 | Gautam et al. | Feb 2021 | A1 |
20210049455 | Kursun | Feb 2021 | A1 |
20210049456 | Kursun | Feb 2021 | A1 |
20210049548 | Grisz et al. | Feb 2021 | A1 |
20210049700 | Nguyen et al. | Feb 2021 | A1 |
20210056114 | Price et al. | Feb 2021 | A1 |
20210056306 | Hu et al. | Feb 2021 | A1 |
20210056317 | Golov | Feb 2021 | A1 |
20210056420 | Konishi et al. | Feb 2021 | A1 |
20210056701 | Vranceanu et al. | Feb 2021 | A1 |
20210089316 | Rash et al. | Mar 2021 | A1 |
20210132943 | Valentine et al. | May 2021 | A1 |
20220050806 | Talpes | Feb 2022 | A1 |
20220188123 | Talpes et al. | Jun 2022 | A1 |
20220365753 | Bannon | Nov 2022 | A1 |
20230115874 | Das Sarma | Apr 2023 | A1 |
20230195458 | Das Sarma | Jun 2023 | A1 |
Number | Date | Country |
---|---|---|
2019261735 | Jun 2020 | AU |
2019201716 | Oct 2020 | AU |
2769788 | Sep 2012 | CA |
110599537 | Dec 2010 | CN |
102541814 | Jul 2012 | CN |
102737236 | Oct 2012 | CN |
102771176 | Nov 2012 | CN |
103198512 | Jul 2013 | CN |
103366339 | Oct 2013 | CN |
104835114 | Aug 2015 | CN |
103236037 | May 2016 | CN |
103500322 | Aug 2016 | CN |
106250103 | Dec 2016 | CN |
106419893 | Feb 2017 | CN |
106504253 | Mar 2017 | CN |
107031600 | Aug 2017 | CN |
107169421 | Sep 2017 | CN |
107507134 | Dec 2017 | CN |
107885214 | Apr 2018 | CN |
108122234 | Jun 2018 | CN |
107133943 | Jul 2018 | CN |
107368926 | Jul 2018 | CN |
105318888 | Aug 2018 | CN |
108491889 | Sep 2018 | CN |
108647591 | Oct 2018 | CN |
108710865 | Oct 2018 | CN |
105550701 | Nov 2018 | CN |
108764185 | Nov 2018 | CN |
108845574 | Nov 2018 | CN |
108898177 | Nov 2018 | CN |
109086867 | Dec 2018 | CN |
107103113 | Jan 2019 | CN |
109215067 | Jan 2019 | CN |
109359731 | Feb 2019 | CN |
109389207 | Feb 2019 | CN |
109389552 | Feb 2019 | CN |
106779060 | Mar 2019 | CN |
109579856 | Apr 2019 | CN |
109615073 | Apr 2019 | CN |
106156754 | May 2019 | CN |
106598226 | May 2019 | CN |
106650922 | May 2019 | CN |
109791626 | May 2019 | CN |
109901595 | Jun 2019 | CN |
109902732 | Jun 2019 | CN |
109934163 | Jun 2019 | CN |
109948428 | Jun 2019 | CN |
109949257 | Jun 2019 | CN |
109951710 | Jun 2019 | CN |
109975308 | Jul 2019 | CN |
109978132 | Jul 2019 | CN |
109978161 | Jul 2019 | CN |
110060202 | Jul 2019 | CN |
110069071 | Jul 2019 | CN |
110084086 | Aug 2019 | CN |
110096937 | Aug 2019 | CN |
110111340 | Aug 2019 | CN |
110135485 | Aug 2019 | CN |
110197270 | Sep 2019 | CN |
110310264 | Oct 2019 | CN |
110321965 | Oct 2019 | CN |
110334801 | Oct 2019 | CN |
110399875 | Nov 2019 | CN |
110414362 | Nov 2019 | CN |
110426051 | Nov 2019 | CN |
110473173 | Nov 2019 | CN |
110516665 | Nov 2019 | CN |
110543837 | Dec 2019 | CN |
110569899 | Dec 2019 | CN |
110599864 | Dec 2019 | CN |
110619282 | Dec 2019 | CN |
110619283 | Dec 2019 | CN |
110619330 | Dec 2019 | CN |
110659628 | Jan 2020 | CN |
110688992 | Jan 2020 | CN |
107742311 | Feb 2020 | CN |
110751280 | Feb 2020 | CN |
110826566 | Feb 2020 | CN |
107451659 | Apr 2020 | CN |
108111873 | Apr 2020 | CN |
110956185 | Apr 2020 | CN |
110966991 | Apr 2020 | CN |
111027549 | Apr 2020 | CN |
111027575 | Apr 2020 | CN |
111047225 | Apr 2020 | CN |
111126453 | May 2020 | CN |
111158355 | May 2020 | CN |
107729998 | Jun 2020 | CN |
108549934 | Jun 2020 | CN |
111275129 | Jun 2020 | CN |
111275618 | Jun 2020 | CN |
111326023 | Jun 2020 | CN |
111428943 | Jul 2020 | CN |
111444821 | Jul 2020 | CN |
111445420 | Jul 2020 | CN |
111461052 | Jul 2020 | CN |
111461053 | Jul 2020 | CN |
111461110 | Jul 2020 | CN |
110225341 | Aug 2020 | CN |
111307162 | Aug 2020 | CN |
111488770 | Aug 2020 | CN |
111539514 | Aug 2020 | CN |
111565318 | Aug 2020 | CN |
111582216 | Aug 2020 | CN |
111598095 | Aug 2020 | CN |
108229526 | Sep 2020 | CN |
111693972 | Sep 2020 | CN |
106558058 | Oct 2020 | CN |
107169560 | Oct 2020 | CN |
107622258 | Oct 2020 | CN |
111767801 | Oct 2020 | CN |
111768002 | Oct 2020 | CN |
111783545 | Oct 2020 | CN |
111783971 | Oct 2020 | CN |
111797657 | Oct 2020 | CN |
111814623 | Oct 2020 | CN |
111814902 | Oct 2020 | CN |
111860499 | Oct 2020 | CN |
111881856 | Nov 2020 | CN |
111882579 | Nov 2020 | CN |
111897639 | Nov 2020 | CN |
111898507 | Nov 2020 | CN |
111898523 | Nov 2020 | CN |
111899227 | Nov 2020 | CN |
112101175 | Dec 2020 | CN |
112101562 | Dec 2020 | CN |
112115953 | Dec 2020 | CN |
111062973 | Jan 2021 | CN |
111275080 | Jan 2021 | CN |
112183739 | Jan 2021 | CN |
112232497 | Jan 2021 | CN |
112288658 | Jan 2021 | CN |
112308095 | Feb 2021 | CN |
112308799 | Feb 2021 | CN |
112313663 | Feb 2021 | CN |
112329552 | Feb 2021 | CN |
112348783 | Feb 2021 | CN |
111899245 | Mar 2021 | CN |
202017102235 | May 2017 | DE |
202017102238 | May 2017 | DE |
102017116017 | Jan 2019 | DE |
102018130821 | Jun 2020 | DE |
102019008316 | Aug 2020 | DE |
0 422 348 | Apr 1991 | EP |
0 586 025 | Mar 1994 | EP |
1215626 | Sep 2008 | EP |
2228666 | Sep 2012 | EP |
2420408 | May 2013 | EP |
2723069 | Apr 2014 | EP |
2741253 | Jun 2014 | EP |
3115772 | Jan 2017 | EP |
2618559 | Aug 2017 | EP |
3285485 | Feb 2018 | EP |
2863633 | Feb 2019 | EP |
3113080 | May 2019 | EP |
3525132 | Aug 2019 | EP |
3531689 | Aug 2019 | EP |
3537340 | Sep 2019 | EP |
3543917 | Sep 2019 | EP |
3608840 | Feb 2020 | EP |
3657387 | May 2020 | EP |
2396750 | Jun 2020 | EP |
3664020 | Jun 2020 | EP |
3690712 | Aug 2020 | EP |
3690742 | Aug 2020 | EP |
3722992 | Oct 2020 | EP |
3690730 | Nov 2020 | EP |
3739486 | Nov 2020 | EP |
3501897 | Dec 2020 | EP |
3751455 | Dec 2020 | EP |
3783527 | Feb 2021 | EP |
2402572 | Aug 2005 | GB |
2548087 | Sep 2017 | GB |
2577485 | Apr 2020 | GB |
2517270 | Jun 2020 | GB |
04-295953 | Oct 1992 | JP |
10-143494 | May 1998 | JP |
2578262 | Aug 1998 | JP |
3941252 | Jul 2007 | JP |
4282583 | Jun 2009 | JP |
4300098 | Jul 2009 | JP |
2010-079840 | Apr 2010 | JP |
2015004922 | Jan 2015 | JP |
2015-056124 | Mar 2015 | JP |
5863536 | Feb 2016 | JP |
6044134 | Dec 2016 | JP |
2017-027149 | Feb 2017 | JP |
6525707 | Jun 2019 | JP |
2019101535 | Jun 2019 | JP |
2020101927 | Jul 2020 | JP |
2020173744 | Oct 2020 | JP |
100326702 | Feb 2002 | KR |
101082878 | Nov 2011 | KR |
101738422 | May 2017 | KR |
101969864 | Apr 2019 | KR |
101996167 | Jul 2019 | KR |
102022388 | Aug 2019 | KR |
102043143 | Nov 2019 | KR |
102095335 | Mar 2020 | KR |
102097120 | Apr 2020 | KR |
1020200085490 | Jul 2020 | KR |
102189262 | Dec 2020 | KR |
1020200142266 | Dec 2020 | KR |
200630819 | Sep 2006 | TW |
I294089 | Mar 2008 | TW |
I306207 | Feb 2009 | TW |
WO 9410638 | May 1994 | WO |
WO 02052835 | Jul 2002 | WO |
WO 14025765 | Feb 2014 | WO |
WO 16032398 | Mar 2016 | WO |
WO 16048108 | Mar 2016 | WO |
WO 16099779 | Jun 2016 | WO |
WO 16186811 | Nov 2016 | WO |
WO 16186823 | Nov 2016 | WO |
WO 16207875 | Dec 2016 | WO |
WO 17117186 | Jul 2017 | WO |
WO 17158622 | Sep 2017 | WO |
WO 19005547 | Jan 2019 | WO |
WO 19067695 | Apr 2019 | WO |
WO 19089339 | May 2019 | WO |
WO 19092456 | May 2019 | WO |
WO 19099622 | May 2019 | WO |
WO 19122952 | Jun 2019 | WO |
WO 19125191 | Jun 2019 | WO |
WO 19126755 | Jun 2019 | WO |
WO 19144575 | Aug 2019 | WO |
WO 19182782 | Sep 2019 | WO |
WO 19191578 | Oct 2019 | WO |
WO 19216938 | Nov 2019 | WO |
WO 19220436 | Nov 2019 | WO |
WO 20006154 | Jan 2020 | WO |
WO 20012756 | Jan 2020 | WO |
WO 20025696 | Feb 2020 | WO |
WO 20034663 | Feb 2020 | WO |
WO 20056157 | Mar 2020 | WO |
WO 20076356 | Apr 2020 | WO |
WO 20097221 | May 2020 | WO |
WO 20101246 | May 2020 | WO |
WO 20120050 | Jun 2020 | WO |
WO 20121973 | Jun 2020 | WO |
WO 20131140 | Jun 2020 | WO |
WO 20139181 | Jul 2020 | WO |
WO 20139355 | Jul 2020 | WO |
WO 20139357 | Jul 2020 | WO |
WO 20142193 | Jul 2020 | WO |
WO 20146445 | Jul 2020 | WO |
WO 20151329 | Jul 2020 | WO |
WO 20157761 | Aug 2020 | WO |
WO 20163455 | Aug 2020 | WO |
WO 20167667 | Aug 2020 | WO |
WO 20174262 | Sep 2020 | WO |
WO 20177583 | Sep 2020 | WO |
WO 20185233 | Sep 2020 | WO |
WO 20185234 | Sep 2020 | WO |
WO 20195658 | Oct 2020 | WO |
WO 20198189 | Oct 2020 | WO |
WO 20198779 | Oct 2020 | WO |
WO 20205597 | Oct 2020 | WO |
WO 20221200 | Nov 2020 | WO |
WO 20240284 | Dec 2020 | WO |
WO 20260020 | Dec 2020 | WO |
WO 20264010 | Dec 2020 | WO |
Entry |
---|
Arima et al., Aug. 15, 1994, Recent Topics of Neurochips, System/Control/Information, 38(8):19. |
Iwase et al., May 1, 2002, High-speed processing method in SIMD-type parallel computer, Den Journal of the Institute of Electrical Engineers of Japan C, 122-C(5):878-884. |
Takahashi, Aug. 2, 1989, Parallel Processing Mechanism, First Edition, Maruzen Co., Ltd., pp. 67-77, 259. |
Cornu et al., “Design, Implementation, and Test of a Multi-Model Systolic Neural-Network Accelerator”, Scientific Programming-Parallel Computing Projects of the Swiss Priority Programme, vol. 5, No. 1, Jan. 1, 1996. |
Jouppi et al., Jun. 26, 2017, In-datacenter performance analysis of a tensor processing unit, 44th International symposium on Computer Architecture IKSCA), Toronto, Canada, 28 pp. |
Kim et al., “A Large-scale Architecture for Restricted Boltzmann Machines”, Department of Electrical Engineering Stanford University, 2010 18th IEEE Annual International Symposium on, IEEE, Piscataway, NJ, USA, May 2, 010. |
Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, downloaded from<http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012>, The 26th annual conference on Neural Information Processing Systems: Dec. 3-8, 2012. |
Kung S: “VLSI Array processors”, IEEE ASSP Magazine, IEEE. US, vol. 2, No. 3, Jul. 1985 (1 pg). |
Oxford Dictionary, Definition of synchronize, retrieved Sep. 12, 2020, https://www/lexico.com/en/definition/synchronize. |
Sato et al., “An in-depth look at Google's first Tensor Processing Unit (TPU)”, posted in Google Cloud Big Data and Machine Learning Blog, downloaded from internet, <URL: https://cloud.google.com/blog/big-data/>,posted May 12, 2017. |
Wikipedia, Accumulator (computing), Version from Jul. 14, 2017, 4 pp. |
International Search Report and Written Opinion dated Oct. 1, 2018, in International Patent Application No. PCT/US18/42959. |
International Search Report and Written Opinion dated Sep. 10, 2018 in application No. PCT/US18/38618. |
Wikipedia, Booth's multiplication algorithm, Version from May 30, 2017, 5 pp. |
Genusov, Oct. 28, 1990, A new type of highly parallel 32-bit floating-point DSP vector signal processor, Modern Radar, 5:106-111. |
Jin et al., Dec. 11, 2006, Design and implementation of floating-point multiply-accumulate processing element under SMVM System, Computer Engineering and Applications, 35:107-109. |
Number | Date | Country | |
---|---|---|---|
20220188123 A1 | Jun 2022 | US |
Number | Date | Country | |
---|---|---|---|
62628212 | Feb 2018 | US | |
62625251 | Feb 2018 | US | |
62536399 | Jul 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15920150 | Mar 2018 | US |
Child | 17451989 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15710433 | Sep 2017 | US |
Child | 15920150 | US |