Computational array microprocessor system using non-consecutive data formatting

Description

BACKGROUND OF THE INVENTION

Processing for machine learning and artificial intelligence typically requires performing mathematical operations on large sets of data and often involves solving multiple convolution layers. Applications of machine learning, such as self-driving and driver-assisted automobiles, often utilize array computational operations to calculate matrix and vector results. For example, array computational operations may be used to compute convolutional layers such as when performing image processing on captured sensor data. In many situations, a large amount of data is required to perform the necessary computational operations. Traditional implementations of these operations often require loading each element of a computational operation from a unique memory address. For a convolution operation, the process typically requires calculating an individual memory address for each element. Moreover, there is a potential to incur an additional delay from the latency involved in reading each data element from memory. These performance penalties are magnified when performing wide convolution operations that involve large input matrices and many matrix elements. Traditional solutions for performing computational operations, such as relying on multiple graphical processing unit (GPU) cores, utilize parallel processing to decrease the time spent computing. However, these solutions are limited in throughput in part due to the latency incurred by reading input data from memory. Therefore, there exists a need for a microprocessor system with increased throughput that performs array computational operations without the need to perform computationally and latency expensive operations for each of the individual elements of the input data.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

FIG. 1 is a block diagram illustrating an embodiment of a microprocessor system for performing machine learning processing.

FIG. 2 is a flow diagram illustrating an embodiment of a process for performing machine learning processing.

FIG. 3 is a flow diagram illustrating an embodiment of a process for performing machine learning processing.

FIG. 4 is a block diagram illustrating an embodiment of a computation unit of a computational array.

FIG. 5 is a block diagram illustrating an embodiment of a cache-enabled microprocessor system for performing machine learning processing.

FIG. 6 is a block diagram illustrating an embodiment of a hardware data formatter, cache, and memory components of a microprocessor system.

FIG. 7 is a flow diagram illustrating an embodiment of a process for performing machine learning processing.

FIG. 8 is a flow diagram illustrating an embodiment of a process for retrieving input operands for a computational array.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

A microprocessor system for performing high throughput array computational operations is disclosed. In some embodiments, a microprocessor system includes a computational array (e.g., matrix processor) in communication with a hardware data formatter for aligning the data to minimize data reads and the latency incurred by reading input data for processing. For example, a matrix processor allows a plurality of elements of a matrix and/or vector to be loaded and processed in parallel together. Thus, using data formatted by one or more hardware data formatters, a computational operation such as a convolution operation may be performed by the computational array.

One technique includes loading a large number of consecutive elements (e.g., consecutive in memory) of a matrix/vector together and performing operations on the consecutive elements in parallel using the matrix processor. By loading consecutive elements together, a single memory load and/or cache check for the entire group of elements can be performed—allowing the entire group of elements to be loaded using minimal processing resources. However, requiring the input elements of each processing iteration of the matrix processor to be consecutive elements could potentially require the matrix processor to load a large number of matrix/vector elements that are to be not utilized. For example, performing a convolution operation using a stride greater than one requires access to matrix elements that are not consecutive. If parallel input elements to the matrix processor are required to be consecutive, each processing iteration of the matrix processor is unable to fully utilize every individual input element for workloads only requiring non-consecutive elements. An alternative technique is to not require every individual input element of the matrix processor be consecutive (e.g., every individual input element can be independently specified without regard to whether it is consecutive in memory to a previous input element). This technique incurs significant performance costs since each referenced element incurs the cost of determining its memory address and performing a cache check for the individual element with the potential of an even more expensive load from memory in the case of a cache miss.

In an embodiment of a disclosed microprocessor system, the group of input elements of a matrix processor are divided into a plurality of subsets, wherein elements within each subset are required be consecutive but the different subsets are not required to be consecutive. This allows the benefit of reduce resources required to load consecutive elements within each subset while providing the flexibility of loading non-consecutive elements across the different subsets. For example, a hardware data formatter loads multiple subsets of elements where the elements of each subset are located consecutively in memory. By loading the elements of each subset together, a memory address calculation and cache check is performed only with respect to the start and end elements of each subset. In the event of a cache miss, an entire subset of elements is loaded together from memory. Rather than incurring a memory lookup penalty on a per element basis as with the previous discussed technique, a cache check is minimized to two checks for each subset (the start and end elements) and a single memory read for the entire subset in the event of a cache miss. Computational operations on non-consecutive elements, such as the performing convolution using a stride greater than one, are more efficient since the memory locations of the subsets need not be consecutively located in memory. Using the disclosed system and techniques, computational operations may be performed on non-consecutive elements with increased throughput and a high clock frequency.

In various embodiments, a computational array performs matrix operations involving input vectors and includes a plurality of computation units to receive M operands and N operands from the input vectors. Using a sequence of input vectors, a computational array may perform matrix operations such as a matrix multiplication. In some embodiments, the computation units are sub-circuits that include an arithmetic logic unit, an accumulator, a shadow register, and a shifter for performing operations such as generating dot-products and various processing for convolution. Unlike conventional graphical processing unit (GPU) or central processing unit (CPU) processing cores, where each core is configured to receive its own unique processing instruction, the computation units of the computational array each perform the same computation in parallel in response to an individual instruction received by the computational array.

In various embodiments, the data input to the computational array is prepared using a hardware data formatter. For example, a hardware data formatter is utilized to load and align data elements using subsets of elements where the elements of each subset are located consecutively in memory and the subsets need not be located consecutively in memory. In various embodiments, the various subsets may each have a memory location independent from other subsets. For example, the different subsets may be located non-consecutively in memory from one another. By restricting the data elements within a subset to consecutive data, multiple consecutive data elements are processed together, which minimizes the calculations and delay incurred when preparing the data for a computational array. For example, a subset of data elements may be cached as a consecutive sequence of data elements by performing a cache check on the start and end element and, in the event of a cache miss on either element, a single data read to load the entire subset from memory into a memory cache. Once all the data elements are available, the data may be provided together to the computational array as a group of values to be processed in parallel.

In some embodiments, a microprocessor system comprises a computational array and a hardware data formatter. For example, a microprocessor system includes a matrix processor capable of performing matrix and vector operations. In various embodiments, the computational array includes a plurality of computation units. For example, the computation units may be sub-circuits of a matrix processor that include the functionality for performing one or more multiply, add, accumulate, and shift operations. As another example, computation units may be sub-circuits that include the functionality for performing a dot-product operation. In various embodiments, the computational array includes a sufficient number of computation units for performing multiple operations on the data inputs in parallel. For example, a computational array configured to receive M operands and N operands may include at least M×N computation units. In various embodiments, each of the plurality of computation units operates on a corresponding value formatted by a hardware data formatter and the values operated by the plurality of computation units are synchronously provided together to the computational array as a group of values to be processed in parallel. For example, values corresponding to elements of a matrix are processed by one or more hardware data formatters and provided to the computational array together as a group of values to be processing in parallel.

In various embodiments, a hardware data formatter is configured to gather the group of values to be processed in parallel by the computational array. For example, a hardware data formatter retrieves the values from memory, such as static random access memory (SRAM), via a cache. In some embodiments, in the event of a cache miss, the hardware data formatter loads the values into the cache from memory and subsequently retrieves the values from the cache. In various embodiments, the values provided to the computational array correspond to computational operands. For example, a hardware formatter may process M operands as an input vector to a computational array. In various embodiments, a second hardware formatter may process N operands as a second input vector to the computational array. In some embodiments, each hardware data formatter processes a group of values synchronously provided together to the computational array, where each group of values includes a first subset of values located consecutively in memory and a second subset of values located consecutively in memory, yet the first subset of values are not located consecutively in the memory from the second subset of values. For example, a hardware data formatter loads a first subset of values stored consecutively in memory and a second subset of values also stored consecutively in memory but with a gap in memory between the two subsets of values. Each subset of values is loaded as consecutive values into the hardware data formatter. To prepare an entire vector of inputs for a computational array, the hardware data formatter performs loads based on the number of subsets instead of based on the total number of elements needed for an input operand to a computational array.

FIG. 1 is a block diagram illustrating an embodiment of a microprocessor system for performing machine learning processing. In the example shown, microprocessor system 100 includes control unit 101, data input 103, data formatter 104, weight input 105, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115. Data input 103 and weight input 105 are input data that is fed to hardware data formatters data formatter 104 and weight formatter 106. In some embodiments, data input 103 and/or weight input 105 are retrieved from a memory (not shown), which may include a memory cache or buffer to reduce latency when reading data. In the example shown, data formatter 104 and weight formatter 106 are hardware data formatters for preparing data for matrix processor 107. In some embodiments, data formatter 104 and weight formatter 106 include a logic circuit for preparing data for matrix processor 107 and/or a memory cache or buffer for storing and processing input data. For example, data formatter 104 may prepare N operands from a two-dimensional array retrieved from data input 103 that correspond to image data. Weight formatter 106 may prepare M operands retrieved from weight input 105 that correspond to a vector of weight values. Data formatter 104 and weight formatter 106 prepare the N and M operands to be processed by matrix processor 107. In some embodiments, microprocessor system 100, including at least hardware data formatters data formatter 104 and weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115, perform the processes described below with respect to FIGS. 2, 3, 7, and 8.

In some embodiments, matrix processor 107 is a computational array that includes a plurality of computation units. For example, a matrix processor receiving M operands and N operands from weight formatter 106 and data formatter 104, respectively, includes M×N computation units. In the figure shown, the small squares inside matrix processor 107 depict that matrix processor 107 includes a logical two-dimensional array of computation units. Computation unit 109 is one of a plurality of computation units of matrix processor 107. In some embodiments, each computation unit is configured to receive one operand from data formatter 104 and one operand from weight formatter 106. In some embodiments, the computation units are configured according to a logical two-dimensional array but the matrix processor is not necessarily fabricated with computation units laid out as a physical two-dimensional array. For example, the i-th operand of data formatter 104 and the j-th operand of weight formatter 106 are configured to be processed by the i-th×j-th computation unit of matrix processor 107.

In various embodiments, the data width of components data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115 are wide data widths and include the ability to transfer more than one operand in parallel. In some embodiments, data formatter 104 and weight formatter 106 are each 96-bytes wide. In some embodiments, data formatter 104 is 192-bytes wide and weight formatter 106 is 96-bytes wide. In various embodiments, the width of data formatter 104 and weight formatter 106 is dynamically configurable. For example, data formatter 104 may be dynamically configured to 96 or 192 bytes and weight formatter 106 may be dynamically configured to 96 or 48 bytes. In some embodiments, the dynamic configuration is controlled by control unit 101. In various embodiments, a data width of 96 bytes allows 96 operands to be processed in parallel. For example, in an embodiment with data formatter 104 configured to be 96-bytes wide, data formatter 104 can transfer 96 operands to matrix processor 107 in parallel.

In various embodiments, data input 103 and weight input 105 are input data to corresponding hardware data formatters data formatter 104 and weight formatter 106 based on memory addresses calculated by the hardware data formatters. In some embodiments, data formatter 104 and/or weight formatter 106 retrieves via data input 103 and weight input 105, respectively, a stream of data corresponding to one or more subsets of values stored consecutively in memory. Data formatter 104 and/or weight formatter 106 may retrieve one or more subsets of values stored consecutively in memory and prepare the data as input values for matrix processor 107. In various embodiments, the one or more subsets of values are not themselves stored consecutively in memory with other subsets of values. In some embodiments, data input 103 and/or weight input 105 are retrieved from memory (not shown in FIG. 1) that contains a single read port. In some embodiments, the memory contains a limited number of read ports and the number of read ports is fewer than the data width of components data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115. In some embodiments, the memory includes a cache and a hardware data formatter, such as data formatter 104 and weight formatter 106, which will perform a cache check to determine whether each subset of values is in the cache prior to issuing a read request to memory. In the event the subset of values is cached, a hardware data formatter (e.g., data formatter 104 or weight formatter 106) will retrieve the data from the cache. In various embodiments, in the event of a cache miss, the hardware data formatter (e.g., data formatter 104 or weight formatter 106) will retrieve the entire subset of values from memory and populate the cache with the retrieved values.

In various embodiments, matrix processor 107 is configured to receive N bytes from data formatter 104 and M bytes from weight formatter 106 and includes at least M×N computation units. For example, matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 and includes at least 96×96 computation units. As another example, matrix processor 107 may be configured to receive 192 bytes from data formatter 104 and 48 bytes from weight formatter 106 and includes at least 192×48 computation units. In various embodiments, the dimensions of matrix processor 107 may be dynamically configured. For example, the default dimensions of matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 but the input dimensions may be dynamically configured to 192 bytes and 48 bytes, respectively. In various embodiments, the output size of each computation unit is equal to or larger than the input size. For example, in some embodiments, the input to each computation unit is two 1-byte operands, one corresponding to an operand from data formatter 104 and one from weight formatter 106, and the output of processing the two operands is a 4-byte result. As another example, matrix processor 107 may be configured to receive 96 bytes from data formatter 104 and 96 bytes from weight formatter 106 and output 96 4-byte results. In some embodiments, the output of matrix processor 107 is a vector. For example, a matrix processor configured to receive two 96-wide input vectors, where each element (or operand) of the input vector is one byte in size, can output a 96-wide vector result where each element of the vector result is 4-bytes in size.

In various embodiments, each computation unit of matrix processor 107 is a sub-circuit that includes an arithmetic logic unit, an accumulator, and a shadow register. In the example shown, the computation units of matrix processor 107 can perform an arithmetic operation on the M operands and N operands from weight formatter 106 and data formatter 104, respectively. In various embodiments, each computation unit is configured to perform one or more multiply, add, accumulate, and/or shift operations. In some embodiments, each computation unit is configured to perform a dot-product operation. For example, in some embodiments, a computation unit may perform multiple dot-product component operations to calculate a dot-product result. For example, the array of computation units of matrix processor 107 may be utilized to perform convolution steps required for performing inference using a machine learning model. A two-dimensional data set, such as an image, may be formatted and fed into matrix processor 107 using data formatter 104 and data input 103, one vector at a time. In parallel, a filter of weights may be applied to the two-dimensional data set by formatting the weights and feeding them as a vector into matrix processor 107 using weight formatter 106 and weight input 105. Corresponding computation units of matrix processor 107 perform a matrix processor instruction on the corresponding operands of the weight and data inputs in parallel.

In some embodiments, vector engine 111 is a vector computational unit that is communicatively coupled to matrix processor 107. Vector engine 111 includes a plurality of processing elements including processing element 113. In the figure shown, the small squares inside vector engine 111 depict that vector engine 111 includes a plurality of processing elements arranged as a vector. In some embodiments, the processing elements are arranged in a vector in the same direction as data formatter 104. In some embodiments, the processing elements are arranged in a vector in the same direction as weight formatter 106. In various embodiments, the data size of the processing elements of vector engine 111 is the same size or larger than the data size of the computation units of matrix processor 107. For example, in some embodiments, computation unit 109 receives two operands each 1 byte in size and outputs a result 4 bytes in size. Processing element 113 receives the 4-byte result from computation unit 109 as an input 4 bytes in size. In various embodiments, the output of vector engine 111 is the same size as the input to vector engine 111. In some embodiments, the output of vector engine 111 is smaller in size compared to the input to vector engine 111. For example, vector engine 111 may receive up to 96 elements each 4 bytes in size and output 96 elements each 1 byte in size. As described above, in some embodiments, the communication channel from data formatter 104 and weight formatter 106 to matrix processor 107 is 96-elements wide with each element 1 byte in size and matches the output size of vector engine 111 (96-elements wide with each element 1 byte in size).

In some embodiments, the processing elements of vector engine 111, including processing element 113, each include an arithmetic logic unit (ALU) (not shown). For example, in some embodiments, the ALU of each processing element is capable of performing arithmetic operations. In some embodiments, each ALU of the processing elements is capable of performing in parallel a rectified linear unit (ReLU) function and/or scaling functions. In some embodiments, each ALU is capable of performing a non-linear function including non-linear activation functions. In various embodiments, each processing element of vector engine 111 includes one or more flip-flops for receiving input operands. In some embodiments, each processing element has access to a slice of a vector engine accumulator and/or vector registers of vector engine 111. For example, a vector engine capable of receiving 96-elements includes a 96-element wide accumulator and one or more 96-element vector registers. Each processing element has access to a one-element slice of the accumulator and/or vector registers. In some embodiments, each element is 4-bytes in size. In various embodiments, the accumulator and/or vector registers are sized to fit at least the size of an input data vector. In some embodiments, vector engine 111 includes additional vector registers sized to fit the output of vector engine 111.

In some embodiments, the processing elements of vector engine 111 are configured to receive data from matrix processor 107 and each of the processing elements can process the received portion of data in parallel. As one example of a processing element, processing element 113 of vector engine 111 receives data from computation unit 109 of matrix processor 107. In various embodiments, vector engine 111 receives a single vector processor instruction and in turn each of the processing elements performs the processor instruction in parallel with the other processing elements. In some embodiments, the processor instruction includes one or more component instructions, such as a load, a store, and/or an arithmetic logic unit operation. In various embodiments, a no-op operation may be used to replace a component instruction.

In the example shown, the dotted arrows between data formatter 104 and matrix processor 107, weight formatter 106 and matrix processor 107, matrix processor 107 and vector engine 111, and vector engine 111 and post-processing unit 115 depict couplings between the respective pairs of components that are capable of sending multiple data elements such as a vector of data elements. As an example, the communication channel between matrix processor 107 and vector engine 111 may be 96×32 bits wide and support transferring 96 elements in parallel where each element is 32 bits in size. As another example, the communication channel between vector engine 111 and post-processing unit 115 may be 96×1 byte wide and support transferring 96 elements in parallel where each element is 1 byte in size. In various embodiments, data input 103 and weight input 105 are retrieved from a memory module (not shown in FIG. 1). In some embodiments, vector engine 111 is additionally coupled to a memory module (not shown in FIG. 1) and may receive input data from the memory module in addition or alternatively to input from matrix processor 107. In the various embodiments, a memory module is typically a static random access memory (SRAM).

In some embodiments, one or more computation units of matrix processor 107 may be grouped together into a lane such that matrix processor 107 has multiple lanes. In various embodiments, the lanes of matrix processor 107 may be aligned with either data formatter 104 or weight formatter 106. For example, a lane aligned with weight formatter 106 includes a set of computation units that are configured to receive as input every operand of weight formatter 106. Similarly, a lane aligned with data formatter 104 includes a set of computation units that are configured to receive as input every operand of data formatter 104. In the example shown in FIG. 1, the lanes are aligned along weight formatter 106 in a vertical column and each lane feeds to a corresponding lane of vector engine 111. In some embodiments, each lane is a vertical column of sub-circuits that include multiply, add and/or accumulate, and shift functionality. In some embodiments, matrix processor 107 includes a matrix of tiles and each tile is a matrix of computation units. For example, a 96×96 matrix processor may include a matrix of 6×6 tiles, where each tile includes 16×16 computation units. In some embodiments, a vertical lane is a single column of tiles. In some embodiments, a horizontal lane is a single row of tiles. In various embodiments, the dimensions of the lane may be configured dynamically and may be utilized for performing alignment operations on the input to matrix processor 107, vector engine 111, and/or post-processing unit 115. In some embodiments, the dynamic configuration is performed by or using control unit 101 and/or with using processor instructions and/or control signals controlled by control unit 101.

In some embodiments, control unit 101 synchronizes the processing performed by data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115. For example, control unit 101 may send processor specific control signals and/or instructions to each of data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and post-processing unit 115. In some embodiments, a control signal is utilized instead of a processor instruction. Control unit 101 may send matrix processor instructions to matrix processor 107. A matrix processor instruction may be a computational array instruction that instructs a computational array to perform an arithmetic operation, such as a dot-product or dot-product component, using specified operands from data input 103 and/or weight input 105 that are formatted by data formatter 104 and/or weight formatter 106, respectively. Control unit 101 may send vector processor instructions to vector engine 111. For example, a vector processor instruction may include a single processor instruction with a plurality of component instructions to be executed together by the vector computational unit. Control unit 101 may send post-processing instructions to post-processing unit 115. In various embodiments, control unit 101 synchronizes data that is fed to matrix processor 107 from data formatter 104 and weight formatter 106, to vector engine 111 from matrix processor 107, and to post-processing unit 115 from vector engine 111. In some embodiments, control unit 101 synchronizes the data between different components of microprocessor system 100 including between data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115 by utilizing processor specific memory, queue, and/or dequeue operations and/or control signals. In some embodiments, data and instruction synchronization is performed by control unit 101. In some embodiments, data and instruction synchronization is performed by control unit 101 that includes one or more sequencers to synchronize processing between data formatter 104, weight formatter 106, matrix processor 107, vector engine 111, and/or post-processing unit 115.

In some embodiments, data input 103, data formatter 104, weight input 105, weight formatter 106, matrix processor 107, and vector engine 111 are utilized for processing convolution layers. For example, matrix processor 107 may be used to perform calculations associated with one or more convolution layers of a convolution neural network. Data formatter 104 and weight formatter 106 may be utilized to prepare matrix and/or vector data in a format for processing by matrix processor 107. Data input 103 may include image data such as one or more image channels captured by sensors (not shown), where sensors include, as an example, cameras mounted to a vehicle. Weight input 105 may include weights determined by training a machine learning model for autonomous driving. In some embodiments, vector engine 111 is utilized for performing non-linear functions such as an activation function on the output of matrix processor 107. For example, matrix processor 107 may be used to calculate a dot-product and vector engine 111 may be used to perform an activation function such as a rectified linear unit (ReLU) or sigmoid function. In some embodiments, post-processing unit 115 is utilized for performing pooling operations. In some embodiments, post-processing unit 115 is utilized for formatting and storing the processed data to memory and may be utilized for synchronizing memory writing latency.

FIG. 2 is a flow diagram illustrating an embodiment of a process for performing machine learning processing. In some embodiments, the process of FIG. 2 is utilized to implement a convolutional neural network using sensor input data such as images and learned weights. In various embodiments, the process of FIG. 2 may be repeated for multiple convolution layers by using the output of the process of FIG. 2 as the input for the next convolution layer. In some embodiments, the processing is performed in the context of self-driving or driver-assisted vehicles to identify objects in a scene such as street signs, vehicles, pedestrians, and lane markers, among other objects. Other sensor data, including non-image sensor data, such as ultrasonic, radar, and LiDAR, may also be utilized as input data. In various embodiments, the process of FIG. 2 utilizes a microprocessor system such as is microprocessor system 100 of FIG. 1.

At 201, input channels are received as input data to the microprocessor system. For example, vision data is captured using sensors and may include one or more channels corresponding to different color channels for the colors red, green, and blue. In various embodiments, multiple channels may be utilized as the different channels may contain different forms of information. As another example, non-sensor data may be utilized as input data. In various embodiments, the input channels may be loaded from memory via a cache using subsets of consecutively stored data in memory. In some embodiments, the input channels may be retrieved and/or formatted for processing using a hardware data formatter such as data formatter 104 of FIG. 1.

At 203, one or more filters are received for processing the input channels. For example, a filter in the form of a matrix contains learned weights and is used to identify activations in the channels. In some embodiments, the filter is a square matrix kernel smaller than the input channel. In various embodiments, filters may be utilized to identify particular shapes, edges, lines, and other features and/or activations in the input data. In some embodiments, the filters and associated weights that make up the filter are created by training a machine learning model using a training corpus of data similar to the input data. In various embodiments, the received filters may be streamed from memory. In some embodiments, the filters may be retrieved and/or formatted for processing using a hardware data formatter such as weight formatter 106 of FIG. 1.

At 205, one or more feature layers are determined using the received input channels and filters. In various embodiments, the feature layers are determined by performing one or more convolution operations using a computational array such as matrix processor 107 of FIG. 1. In some embodiments, the one or more output feature layers are determined by repeatedly performing a dot-product between different small regions of an input channel and the weights of the filter. In various embodiments, each filter is used to create a single feature layer by performing a two-dimensional convolution using the filter. In some embodiments, the input data is padded to adjust for the size of the output feature layer. In various embodiments, a stride parameter is utilized and may impact the size of the output feature layer. In various embodiments, a bias parameter may be utilized. For example, a bias term may be added to the resulting values of convolution for each element of a feature layer.

At 207, an activation function is performed on one or more feature layers. For example, an element-wise activation function, such as a rectified linear unit (ReLU) function, is performed using a vector processor such as vector engine 111 of FIG. 1 to create an activation layer. In various embodiments, different activation functions, such as a non-linear activation function, including ReLU and sigmoid, may be utilized to create an activation layer for each feature layer.

At 209, pooling is performed on the activation layers created at 207. For example, a pooling layer is generated by a post-processing unit such as post-processing unit 115 of FIG. 1 using the activation layer generated at 207. In some embodiments, the pooling layer is generated to down sample the activation layer. In various embodiments, different filter sizes may be utilized to create a pooling layer based on the desired output size. In various embodiments, different pooling techniques, such as maxpooling, are utilized. In various embodiments, pooling parameters include kernel size, stride, and/or spatial extent, among others. In some embodiments, the pooling layer is an optional layer and may be implemented when appropriate.

In various embodiments, the process of FIG. 2 is utilized for each layer of a convolution neural network (CNN). Multiple passes of the process of FIG. 2 may be utilized to implement a multi-layer CNN. For example, the output of 209 may be utilized as input channels at 201 to calculate output layers of an intermediate layer. In some embodiments, a CNN is connected to one or more additional non-CNN layers for classification, object detection, object segmentation, and/or other appropriate goals. In some embodiments, the additional non-CNN layers are implemented using a microprocessor system such as is microprocessor system 100 of FIG. 1.

FIG. 3 is a flow diagram illustrating an embodiment of a process for performing machine learning processing. In some embodiments, the process of FIG. 3 is utilized to perform inference on sensor data by performing computational operations, such as convolution operations, and element-wise activation functions. In some embodiments, the process of FIG. 3 is performed using a microprocessor system such as is microprocessor system 100 of FIG. 1. In various embodiments, steps 301 and 303 are performed at 201 of FIG. 2 using data input 103 and data formatter 104 of FIG. 1, steps 305 and 307 are performed at 203 of FIG. 2 using weight input 105 and weight formatter 106 of FIG. 1, step 309 is performed at 205 of FIG. 2 using matrix processor 107 of FIG. 1, step 311 is performed at 207 of FIG. 2 using vector engine 111 of FIG. 1, and step 313 is performed at 209 of FIG. 2 using post-processing unit 115 of FIG. 1.

At 301, data input is received. For example, data input corresponding to sensor data is received by a hardware data formatter for formatting. In some embodiments, data input is data input 103 of FIG. 1 and is received by data formatter 104 of FIG. 1. In various embodiments, a hardware data formatter requests the data input from memory as read requests based on subsets of values stored consecutively in memory. For example, a hardware data formatter may first check a cache of the memory for the requested data values and in the event of a cache miss, the read request will retrieve the data values from memory. In various embodiments, checking for a cache hit or miss requires calculating the start address and end address of the subset of requested data values. In some embodiments, a data request populates the cache with the requested values along with additional data to fill a cache line. In some embodiments, the data is streamed in from memory and may bypass the cache.

At 303, data input is formatted using a hardware data formatter. For example, a hardware data formatter such as data formatter 104 of FIG. 1 formats the received data input for processing by a computational array such as matrix processor 107 of FIG. 1. The hardware data formatter may format the received data input into an input vector of operands for a computational array. In some embodiments, the hardware data formatter further performed the requesting of data received at 301. In some embodiments, the hardware data formatter will format at least one of the operands of a convolution operation. For example, each two-dimensional region corresponding to an input channel of vision data for a convolution operation involving a filter will be formatted by the hardware data formatter into a vector operand for the computational array. The vectors corresponding to the regions are grouped together by their n-th elements and fed to the computation array at a rate of at most one element from each vector per clock cycle. In some embodiments, the hardware data formatter will select the appropriate elements for performing convolution of a filter with the data input by formatting each region of the data input into a vector and feeding each element of the appropriate vector to a corresponding computation unit of a computational array. In some embodiments, a bias parameter is introduced using the hardware data formatter.

At 305, weight input is received. For example, weight input corresponding to machine learning weights of a filter are received by a hardware data formatter for formatting. In some embodiments, weight input is weight input 105 of FIG. 1 and is received by weight formatter 106 of FIG. 1. In various embodiments, a hardware data formatter requests the weight input from memory as read requests based on subsets of values stored consecutively in memory. For example, a hardware data formatter may first check a cache of the memory for the requested weight values and in the event of a cache miss, the read request will retrieve the weight values from memory. In various embodiments, checking for a cache hit or miss requires calculating the start address and end address of the subset of requested weight values. In some embodiments, a weight data request populates the cache with the requested weight values. In some embodiments, the data for weights is streamed in from memory and may bypass the cache. In some embodiments, the weight input includes a bias parameter.

At 307, weight input is formatted using a hardware data formatter. For example, a hardware data formatter such as weight formatter 106 of FIG. 1 formats the received weight input for processing by a computational array such as matrix processor 107 of FIG. 1. The hardware data formatter may format the received weight input into an input vector of operands for a computational array. In some embodiments, the hardware data formatter further performed the requesting of data received at 305. In some embodiments, the hardware data formatter will format at least one of the operands of a convolution operation. For example, a filter for a convolution operation will be formatted by the hardware data formatter into a vector operand for the computational array. In some embodiments, the hardware data formatter will select the appropriate elements for performing convolution of a filter with the data input by formatting the filter into a vector and feeding each element of the vector to a corresponding computation unit of a computational array. In some embodiments, a bias parameter is introduced using the hardware data formatter.

At 309, matrix processing is performed. For example, the operands formatted at 303 and 307 are received by each of the computation units of a computational array for processing. In some embodiments, the matrix processing is performed using a matrix processor such as matrix processor 107 of FIG. 1. In some embodiments, a dot-product is performed at each appropriate computation unit of the computational array using respective vectors received by hardware data formatters such as data formatter 104 and weight formatter 106 of FIG. 1. In some embodiments, only a subset of the matrix processor's computation units is utilized. For example, a computational array with 96×96 computation units may utilize only 64×64 computation units in the event the data input is 64 vectors and the weight input is 64 vectors. In various embodiments, the number of computation units utilized is based on the size on the data input and/or weight input. In some embodiments, the computation units each perform one or more of multiply, add, accumulate, and/or shift operations. In some embodiments, the computation units each perform one or more of multiply, add, accumulate, and/or shift operations each clock cycle. In some embodiments, a bias parameter is received and added to the calculated dot-product as part of the matrix processing performed.

At 311, vector processing is performed. For example, an element-wise activation function may be performed on the result of the matrix processing performed at 309. In some embodiments, an activation function is a non-linear activation function such as a rectified linear unit (ReLU), sigmoid, or other appropriate function. In some embodiments, the vector processor is utilized to implement scaling, normalization, or other appropriate techniques. For example, a bias parameter may be introduced to the result of a dot-product using the vector processor. In some embodiments, the result of 311 is a series of activation maps or activation layers. In some embodiments, vector processing is performed using a vector engine such as vector engine 111 of FIG. 1.

At 313, post-processing is performed. For example, a pooling layer may be implemented using a post-processing processor such as post-processing unit 115 of FIG. 1. In various embodiments, different post-processing techniques, including different pooling techniques such as maxpooling, may be implemented during the post-processing stage of 313.

In various embodiments, the process of FIG. 3 is utilized for each layer of a convolution neural network (CNN). Multiple passes of the process of FIG. 3 may be utilized to implement a multi-layer CNN. For example, the output of 313 may be utilized as data input for step 301. In some embodiments, the process of FIG. 3 must be repeated one or more times to complete a single layer. For example, in the scenario where the sensor data is larger in dimension than the number of computation units of the computational array, the sensor data may be sliced into smaller regions that fit the computational array and the process of FIG. 3 is repeated on each of the sliced regions.

FIG. 4 is a block diagram illustrating an embodiment of a computation unit of a computational array. In the example shown, computation unit 400 includes input values weight 402, data 404, and ResultIn 406; signals ClearAcc signal 408, Clock signal 410, ResultEnable signal 412, ResultCapture signal 414, and ShiftEn signal 416; components accumulator 424, multiplexer 426, shadow register 428, multiplier 430, and adder 432; logic 434, 436, and 438; and output value ResultOut 450. In some embodiments, logic 434, 436, and 438 are AND gates. In some embodiments, additional signals are included as appropriate. In various embodiments, the computation unit of FIG. 4 is repeated for each of the plurality of computation units, such as computation unit 109, of a computation array such as matrix processor 107 of FIG. 1. Computation unit 400 may be utilized to implement computational operations in parallel. In various embodiments, each computation unit of a computational array performs computations in parallel with the other computation units. In various embodiments, computation unit 400 is a sub-circuit of a matrix processor that includes the functionality for performing one or more multiply, add, accumulate, and/or shift operations. For example, computation unit 400 may be a sub-circuit that includes the functionality for performing a dot-product operation.

In some embodiments, Clock signal 410 is a clock signal received by computation unit 400. In various embodiments, each computation unit of the computational array receives the same clock signal and the clock signal is utilized to synchronize the processing of each computation unit with the other computation units.

In the example shown, multiplier 430 receives and performs a multiplication operation on the input values data 404 and weight 402. The output of multiplier 430 is fed to adder 432. Adder 432 receives and performs an addition on the output of multiplier 430 and the output of logic 434. The output of adder 432 is fed to accumulator 424. In some embodiments, input values data 404 and weight 402 are lines that cross computation units and feed the corresponding data and/or weight to neighboring computation units. For example, in some embodiments, data 404 is fed to all computation units in the same column and weight 402 is fed to all computation units in the same row. In various embodiments, data 404 and weight 402 correspond to input elements fed to computation unit 400 from a data hardware data formatter and a weight hardware data formatter, respectively. In some embodiments, the data hardware data formatter and the weight hardware data formatter are data formatter 104 and weight formatter 106 of FIG. 1, respectively.

In some embodiments, ClearAcc signal 408 clears the contents of accumulator 424. As an example, accumulation operations can be reset by clearing accumulator 424 and used to accumulate the result of multiplier 430. In some embodiments, ClearAcc signal 408 is used to clear accumulator 424 for performing a new dot-product operation. For example, elements-wise multiplications are performed by multiplier 430 and the partial-dot-product results are added using adder 432 and accumulator 424.

In various embodiments, accumulator 424 is an accumulator capable of accumulating the result of adder 432 and indirectly the result of multiplier 430. For example, in some embodiments, accumulator 424 is configured to accumulate the result of multiplier 430 with the contents of accumulator 424 based on the status of ClearAcc signal 408. As another example, based on the status of ClearAcc signal 408, the current result stored in accumulator 424 may be ignored by adder 432. In the example shown, accumulator 424 is a 32-bit wide accumulator. In various embodiments, accumulator 424 may be sized differently, e.g., 8-bits, 16-bits, 64-bits, etc., as appropriate. In various embodiments, each accumulator of the plurality of computation units of a computational array is the same size. In various embodiments, accumulator 424 may accumulate and save data, accumulate and clear data, or just clear data. In some embodiments, accumulator 424 may be implemented as an accumulation register. In some embodiments, accumulator 424 may include a set of arithmetic logic units (ALUs) that include registers.

In some embodiments, ResultEnable signal 412 is activated in response to a determination that data 404 is valid. For example, ResultEnable signal 412 may be enabled to enable processing by a computation unit such as processing by multiplier 430 and adder 432 into accumulator 424.

In some embodiments, ResultCapture signal 414 is utilized to determine the functionality of multiplexer 426. Multiplexer 426 receives as input ResultIn 406, output of accumulator 424, and ResultCapture signal 414. In various embodiments, ResultCapture signal 414 is used to enable either ResultIn 406 or the output of accumulator 424 to pass through as the output of multiplexer 426. In some embodiments, multiplexer 426 is implemented as an output register. In some embodiments, ResultIn 406 is connected to a computation unit in the same column as computation unit 400. For example, the output of a neighboring computation unit is fed in as an input value ResultIn 406 to computation unit 400. In some embodiments, the input of a neighboring computation unit is the computation unit's corresponding ResultOut value.

In some embodiments, shadow register 428 receives as input the output of multiplexer 426. In some embodiments, shadow register 428 is configured to receive the output of accumulator 424 via multiplexer 426 depending on the value of ResultCapture signal 414. In the example shown, the output of shadow register 428 is output value ResultOut 450. In various embodiments, once a result is inserted into shadow register 428, accumulator 424 may be used to commence new calculations. For example, once the final dot-product result is stored in shadow register 428, accumulator 424 may be cleared and used to accumulate and store the partial result and eventually the final result of a new dot-product operation on new weight and data input values. In the example shown, shadow register 428 receives a signal ShiftEn signal 416. In various embodiments, ShiftEn signal 416 is used to enable or disable the storing of values in the shadow register 428. In some embodiments, ShiftEn signal 416 is used to shift the value stored in shadow register 428 to output value ResultOut 450. For example, when ShiftEn signal 416 is enabled, the value stored in shadow register 428 is shifted out of shadow register 428 as output value ResultOut 450. In some embodiments, ResultOut 450 is connected to a neighboring computation unit's input value ResultIn. In some embodiments, the last cell of a column of computation units is connected to the output of the computational array. In various embodiments, the output of the computational array feeds into a vector engine such as vector engine 111 of FIG. 1 for vector processing. For example, the output ResultOut 450 of a computation cell such as computation cell 109 of FIG. 1 may be fed into a processing element of a vector engine such as processing element 113 of vector engine 111 of FIG. 1.

In the example shown, shadow register 428 is 32-bits wide. In various embodiments, shadow register 428 may be sized differently, e.g., 8-bits, 16-bits, 64-bits, etc., as appropriate. In various embodiments, each shadow register of the plurality of computation units of a computational array is the same size. In various embodiments, shadow register 428 is the same size as accumulator 424. In various embodiments, the size of multiplexer 426 is based on the size of accumulator 424 and/or shadow register 428 (e.g., the same size or larger).

In some embodiments, logic 434, 436, and 438 receive signals, such as control signals, to enable and/or configure the functionality of computation unit 400. In various embodiments, logic 434, 436, and 438 are implemented using AND gates and/or functionality corresponding to an AND gate. For example, as described above, logic 434 receives ClearAcc signal 408 and an input value corresponding to the value stored in accumulator 424. Based on ClearAcc signal 408, the output of logic 434 is determined and fed to adder 432. As another example, logic 436 receives ResultEnable signal 412 and Clock signal 410. Based on ResultEnable signal 412, the output of logic 436 is determined and fed to accumulator 424. As another example, logic 438 receives ShiftEn signal 416 and Clock signal 410. Based on ShiftEn signal 416, the output of logic 438 is determined and fed to shadow register 428.

In various embodiments, computation units may perform a multiplication, an addition operation, and a shift operation at the same time, i.e., within a single cycle, thereby doubling the total number of operations that occur each cycle. In some embodiments, results are moved from multiplexer 426 to shadow register 428 in a single clock cycle, i.e., without the need of intermediate execute and save operations. In various embodiments, the clock cycle is based on the signal received at Clock signal 410.

In various embodiments, input values weight 402 and data 404 are 8-bit values. In some embodiments, weight 402 is a signed value and data 404 is unsigned. In various embodiments, weight 402 and data 404 may be signed or unsigned, as appropriate. In some embodiments, ResultIn 406 and ResultOut 450 are 32-bit values. In various embodiments ResultIn 406 and ResultOut 450 are implemented using a larger number of bits than input operands weight 402 and data 404. By utilizing a large number of bits, the results of multiplying multiple pairs of weight 402 and data 404, for example, to calculate a dot-product result, may be accumulated without overflowing the scalar result.

In some embodiments, computation unit 400 generates an intermediate and/or final computation result in accumulator 424. The final computation result is then stored in shadow register 428 via multiplexer 426. In some embodiments, multiplexer 426 functions as an output register and store the output of accumulator 424. In various embodiments, the final computation result is the result of a convolution operation. For example, the final result at ResultOut 450 is the result of convolution between a filter received by computation unit 400 as input values using weight 402 and a two-dimensional region of sensor data received by computation unit 400 as input values using data 404.

As an example, a convolution operation may be performed using computation unit 400 on a 2×2 data input matrix [d0 d1; d2 d3] corresponding to a region of sensor data and a filter corresponding to a 2×2 matrix of weights [w0 w1; w2 w3]. The 2×2 data input matrix has a first row [d0 d1] and a second row [d2 d3]. The filter matrix has a first row [w0 w1] and a second row [w2 w3]. In various embodiments, computation unit 400 receives the data matrix via data 404 as a one-dimensional input vector [d0 d1 d2 d3] one element per clock cycle and weight matrix via weight 402 as a one-dimensional input vector [w0 w1 w2 w3] one element per clock cycle. Using computation unit 400, the dot product of the two input vectors is performed to produce a scalar result at ResultOut 450. For example, multiplier 430 is used to multiply each corresponding element of the input weight and data vectors and the results are stored and added to previous results in accumulator 424. For example, the result of element d0 multiplied by element w0 (e.g., d0*w0) is first stored in cleared accumulator 424. Next, element d1 is multiplied by element w1 and added using adder 432 to the previous result stored in accumulator 424 (e.g., d0*w0) to compute the equivalent of d0*w0+d1*w1. Processing continues to the third pair of elements d2 and w2 to compute the equivalent of d0*w0+d1*w1+d2*w2 at accumulator 424. The last pair of elements is multiplied and the final result of the dot product is now stored in accumulator 424 (e.g., d0*w0+d1*w1+d2*w2+d3*w3). The dot-product result is then copied to shadow register 428. Once stored in shadow register 428, a new dot-product operation may be initiated, for example, using a different region of sensor data. Based on ShiftEn signal 416, the dot-product result stored in shadow register 428 is shifted out of shadow register 428 to ResultOut 450. In various embodiments, the weight and data matrices may be different dimensions than the example above. For example, larger dimensions may be used.

In some embodiments, a bias parameter is introduced and added to the dot-product result using accumulator 424. In some embodiments, the bias parameter is received as input at either weight 402 or data 404 along with a multiplication identity element as the other input value. The bias parameter is multiplied against the identity element to preserve the bias parameter and the multiplication result (e.g., the bias parameter) is added to the dot-product result using adder 432. The addition result, a dot-product result offset by a bias value, is stored in accumulator 424 and later shifted out at ResultOut 450 using shadow register 428. In some embodiments, a bias is introduced using a vector engine such as vector engine 111 of FIG. 1.

FIG. 5 is a block diagram illustrating an embodiment of a cache-enabled microprocessor system for performing machine learning processing. The microprocessor system of FIG. 5 includes hardware data formatters that interface with a cache to prepare input values for a computational array such as a matrix processor. In various embodiments, incorporating a memory cache and using hardware data formatters to populate the cache increases the throughput of the matrix processor and allows the microprocessor system to operate at a higher clock rate than would otherwise be allowed. In the example shown, microprocessor system 500 includes control unit 501, memory 502, cache 503, data formatter 504, weight formatter 506, and matrix processor 507. Input data and weight data are retrieved by hardware data formatters 504, 506 from memory 502 via cache 503. The retrieved input values are formatted using data formatter 504 and weight formatter 506 to prepare vector operands for matrix processor 507. In some embodiments, data formatter 504 and weight formatter 506 include a logic circuit for preparing data for matrix processor 507 and/or a memory cache or buffer for storing and processing input data. For example, data formatter 504 may prepare N operands from a two-dimensional array retrieved from memory 502 via cache 503. Weight formatter 506 may prepare M operands retrieved from memory 502 via cache 503 that correspond to weight values. Data formatter 504 and weight formatter 506 prepare the N and M operands to be processed by matrix processor 507.

In various embodiments, microprocessor system 500 is microprocessor system 100 of FIG. 1 depicted with a memory and memory cache. With respect to microprocessor 100 of FIG. 1, in various embodiments, control unit 501 is control unit 101, data formatter 504 is data formatter 104, weight formatter 506 is weight formatter 106, and matrix processor 507 is matrix processor 107 of FIG. 1. Further, with respect to microprocessor 100 of FIG. 1, in various embodiments, data input 103 and weight input 105 of FIG. 1 are retrieved from memory 502 via cache 503. In some embodiments, microprocessor system 500, including at least hardware data formatter 504, weight formatter 506, and matrix processor 507, performs the processes described with respect to FIGS. 7 and 8 and portions of processes described with respect to FIGS. 2 and 3.

In some embodiments, matrix processor 507 is a computational array that includes a plurality of computation units. For example, a matrix processor receiving M operands and N operands from weight formatter 506 and data formatter 504, respectively, includes M×N computation units. In the figure shown, the small squares inside matrix processor 507 depict that matrix processor 507 includes a logical two-dimensional array of computation units. Computation unit 509 is one of a plurality of computation units of matrix processor 507. In some embodiments, each computation unit is configured to receive one operand from data formatter 504 and one operand from weight formatter 506. Matrix processor 507 and computation unit 509 are described in further detail with respect to matrix processor 107 and computation unit 109, respectively, of FIG. 1. Input values to matrix processor 507 are received from data formatter 504 and weight formatter 506 and described in further detail with respect to inputs from data formatter 104 and weight formatter 106 to matrix processor 107 of FIG. 1.

In the example shown, the dotted arrows between data formatter 504 and matrix processor 507 and between weight formatter 506 and matrix processor 507 depict a coupling between the respective pairs of components that are capable of sending multiple data elements such as a vector of data elements. In various embodiments, the data width of components data formatter 504, weight formatter 506, and matrix processor 507 are wide data widths and include the ability to transfer more than one operand in parallel. The data widths of components data formatter 504, weight formatter 506, and matrix processor 507 are described in further detail with respect to corresponding components data formatter 104, weight formatter 106, and matrix processor 107 of FIG. 1.

In various embodiments, the arrows in FIG. 5 describe the direction data and/or control signals flow from component to component. In some embodiments, the connections depicted by the one-direction arrows in FIG. 5 (e.g., between data formatter 504 and cache 503, between weight formatter 506 and cache 503, and between cache 503 and memory 502) may be bi-directional and thus the data and/or control signals may flow in both directions. For example, in some embodiments, control signals, such as a read request and/or data, can flow from cache 503 to memory 502.

In various embodiments, memory 502 is typically static random access memory (SRAM). In some embodiments, memory 502 has a single read port or a limited number of read ports. In some embodiments, the amount of memory 502 dedicated to storing data (e.g., sensor data, image data, etc.), weights (e.g., weight associated with image filters, etc.), and/or other data may be dynamically allocated. For example, memory 502 may be configured to partition more or less memory for data input compared to weight input based on a particular workload. In some embodiments, cache 503 includes one or more cache lines. For example, in some embodiments, cache 503 is a 1 KB cache that includes four cache lines where each cache line is 256 bytes. In various embodiments, the size of the cache may be larger or small, with fewer or more cache lines, have larger or smaller cache lines, and may be determined based on expected computation workload.

In various embodiments, hardware data formatters (e.g., data formatter 504 and weight formatter 506) calculate memory addresses to retrieve input values from memory 502 and cache 503 for processing by matrix processor 507. In some embodiments, data formatter 504 and/or weight formatter 506 stream data corresponding to a subset of values stored consecutively in memory 502 and/or cache 503. Data formatter 504 and/or weight formatter 506 may retrieve one or more subsets of values stored consecutively in memory and prepare the data as input values for matrix processor 507. In various embodiments, the one or more subsets of values are not themselves stored consecutively in memory with other subsets. In some embodiments, memory 502 contains a single read port. In some embodiments, memory 502 contains a limited number of read ports and the number of read ports is fewer than the data width of components data formatter 504, weight formatter 506, and matrix processor 507. In some embodiments, hardware data formatters 504, 506 will perform a cache check to determine whether a subset of values is in cache 503 prior to issuing a read request to memory 502. In the event the subset of values is cached, hardware data formatters 504, 506 will retrieve the data from cache 503. In various embodiments, in the event of a cache miss, hardware data formatters 504, 506 will retrieve the entire subset of values from memory 502 and populate a cache line of cache 503 with the retrieved values.

In some embodiments, control unit 501 initiates and synchronizes processing between components of microprocessor system 500, including components memory 502, data formatter 504, weight formatter 506, and matrix processor 507. In some embodiments, control unit 501 coordinates access to memory 502 including the issuance of read requests. In some embodiments, control unit 501 interfaces with memory 502 to initiate read requests. In various embodiments, the read requests are initiated by hardware data formatters 504, 506 via the control unit 501. In various embodiments, control unit 501 synchronizes data that is fed to matrix processor 507 from data formatter 504 and weight formatter 506. In some embodiments, control unit 501 synchronizes the data between different components of microprocessor system 500 including between data formatter 504, weight formatter 506, and matrix processor 507, by utilizing processor specific memory, queue, and/or dequeue operations and/or control signals. Additional functionality performed by control unit 501 is described in further detail with respect to control unit 101 of FIG. 1.

In some embodiments, microprocessor system 500 is utilized for performing convolution operations. For example, matrix processor 507 may be used to perform calculations, including dot-product operations, associated with one or more convolution layers of a convolution neural network. Data formatter 504 and weight formatter 506 may be utilized to prepare matrix and/or vector data in a format for processing by matrix processor 507. Memory 502 may be utilized to store data such as one or more image channels captured by sensors (not shown). Memory 502 may also include weights, including weights in the context of convolution filters, determined by training a machine learning model for autonomous driving.

In various embodiments, microprocessor system 500 may include additional components (not shown in FIG. 5), including processing components, such as a vector processor and a post-processing unit. An example of a vector processor and its associated functionality is vector engine 111 of FIG. 1. An example of a post-processing unit and its associated functionality is post-processing unit 115 of FIG. 1.

FIG. 6 is a block diagram illustrating an embodiment of a hardware data formatter, cache, and memory components of a microprocessor system. In the example shown, the components include memory 601, cache 603, and hardware data formatter 605. Memory 601 is communicatively connected to cache 603 and cache 603 is communicatively connected to hardware data formatter 605. Cache 603 includes four cache lines 611, 613, 615, and 617. Hardware data formatter 605 includes twelve read buffers 621-632. Read buffers 621-632 are each 8-byte read buffers. In various embodiments, the number of and size of the read buffers may be fewer or more than depicted in the embodiment of FIG. 6. For example, read buffers 621-632 are sized to accommodate a 96 element input vector, where each element is 1-byte, to a computational array. In various embodiments, read buffers 621-632 may be implemented as a single wide register, a single memory storage location, individual registers, or individual memory storage locations, among other implementations, as appropriate. In some embodiments, memory 601 and cache 603 are memory 502 and cache 503 of FIG. 5, respectively. In some embodiments, hardware data formatter 605 is data formatter 104 and/or weight formatter 106 of FIG. 1. In some embodiments, hardware data formatter 605 is data formatter 504 and/or weight formatter 506 of FIG. 5.

In various embodiments, a control unit (not shown) such as control unit 101 of FIG. 1 and a computational array (not shown) such as matrix processor 107 of FIG. 1 are components of the microprocessor system. For example, a control unit sends signals to synchronize the processing of computational operations and/or access to memory 601. In various embodiments, a computational array receives input vectors from one or more hardware data formatters as input operands. For example, a matrix processor may receive two vector inputs, one from a data formatter and one from a weight formatter, to perform matrix processing on. As another example, a matrix processor may receive two matrices, one from a data formatter and one from a weight formatter, to perform matrix processing on. In various embodiments, multiple clock cycles are needed to feed an entire matrix into a computational array. For example, in some embodiments, at most one row (and/or column) of a matrix is fed into a computational array each clock cycle.

In various embodiments, the output of hardware data formatter 605 is fed as input to a computational array such as matrix processor 107 of FIG. 1 and matrix processor 507 of FIG. 5. In various embodiments, each element of each read buffer of hardware data formatter 605 is fed into a computation unit of a computational array. For example, the first byte of read buffer 621 is fed into a first computation unit of a computational array, the second byte of read buffer 621 is fed into a second computation unit of a computational array, the third byte of read buffer 621 is fed into a third computation unit of a computational array, and so forth, with the last byte of read buffer 621 (i.e., the eighth byte) feeding into the eighth computation unit of a computational array. The next read buffer then feeds its elements into the next set of computation units. For example, the first byte of read buffer 622 is fed into a ninth computation unit of a computational array and the last byte of read buffer 632 is fed into a ninety-sixth computation unit of a computational array. In various embodiments, the size and number of the read buffers and the number of computation units may vary. As explained above, in the example shown, hardware data formatter 605 includes 12 read buffers 621-632 configured to each store eight consecutive bytes. Hardware data formatter 605 may be configured to feed into a computation unit that may receive at least one input vector of 96 1-byte elements.

In some embodiments, only a portion of the elements in read buffers 621-632 is utilized as input to a computational array. For example, a two-dimensional 80×80 matrix may only utilize read buffers 621-630 (corresponding to 80 bytes, numbered bytes 0-79) to feed an 80-element row into a matrix processor. In various embodiments, hardware data formatter 605 may perform additional processing on one or more elements of read buffers 621-632 to prepare the elements as input to a computational array. For example, a computational array may be configured to receive 48 16-bit elements instead of 96 8-bit elements and hardware data formatter 605 may be configured to combine pairs of 1-byte elements to form 16-bit elements to prepare a 48 16-bit input vector for the computational array.

In various embodiments, cache 603 is a memory cache of memory 601. In some embodiments, memory 601 is implemented using static random access memory (SRAM). In some embodiments, cache 603 is a 1 KB memory cache and each cache line 611, 613, 615, and 617 is 256 bytes. In various embodiments, reading data into cache 603 loads an entire cache line of data into one of cache lines 611, 613, 615, and 617. In various embodiments, cache 603 may be larger or small and have fewer or more cache lines. Moreover, in various embodiments, the cache lines may be a different size. The size and configuration of cache 603, cache lines 611, 613, 615, and 617, and memory 601 may be sized as appropriate for the particular workload of computational operations. For example, the size and number of image filters used for convolution may dictate a larger or smaller cache line and a larger or smaller cache.

In the example shown, the dotted-lined arrows originating from read buffers 621-632 indicate whether the data requested by hardware data formatter 605 exists as a valid entry in cache 603 and in particular which cache line holds the data. For example, read buffers 621, 622, and 623 request data that is found in cache line 611. Read buffers 626 and 627 request data that is found in cache line 613 and read buffers 630, 631, and 632 request data that is found in cache line 617. In various embodiments, each read buffer stores a subset of values located consecutively in the memory. The subsets of values stored at read buffers 621, 622, and 623 may not be located consecutively in memory with the subsets of values stored at read buffers 626 and 627 and also may not be located consecutively in memory with the subsets of values stored at read buffers 630, 631, and 632. In some scenarios, read buffers referencing the same cache line may store subsets of values that are not located consecutively in memory. For example, two read buffers may reference the same cache line of 256 bytes but different 8-byte subsets of consecutive values.

In the example shown, the data requested for read buffers 624, 625, 628, and 629 are not found in cache 603 and are cache misses. In the example shown, an “X” depicts a cache miss. In various embodiments, cache misses must be resolved by issuing a read for the corresponding subset of data from memory 601. In some embodiments, an entire cache line containing the requested subset of data is read from memory 601 and placed into a cache line of cache 603. Various techniques for cache replacement may be utilized as appropriate. Examples of cache replacement policies for determining the cache line to use include First In First Out, Least Recently Used, etc.

In some embodiments, each of read buffers 621-632 stores a subset of values located consecutively in memory. For example, in the example shown, read buffer 621 is 8-bytes in size and stores a subset of 8-bytes of values stored consecutively in memory. In various embodiments, the values are located consecutively in memory 601 and read as a continuous block of values into a cache line of cache 603. By implementing read buffers using the concept of a subset of values, where each of the values is located consecutively in memory, each read buffer is capable of loading multiple elements (e.g., up to eight elements for an 8-byte read buffer) together. In the example shown, a fewer number of reads are required than the number of elements to populate every read buffer with an element. For example, up to twelve reads are required to load 96-elements into the twelve read buffers 621-632. In many scenarios, even fewer reads are necessary in the event that a cache contains the requested subset of data. Similarly, in some scenarios, a single cache line is capable of storing the data requested for multiple read buffers.

In some embodiments, read buffers 621-632 are utilized by hardware data formatter 605 to prepare input operands such as an vector of inputs for a computational array, such as matrix processor 107 of FIG. 1. In some embodiments, the 96-bytes stored in read buffers 621-632 correspond to a 96-element input vector for a computational array. In some embodiments, hardware data formatter 605 selects elements from read buffers 621-632 to accommodate a particular stride when performing a computational operation such as convolution. In some embodiments, hardware data formatter 605 selectively filters out the elements from read buffers 621-632 that are not required for the computational operation. For example, hardware data formatter may only utilize a portion of the elements from each read buffer (e.g., every other byte of a read buffer) as the input vector elements for the computational array. In some embodiments, the filtering is performed using a multiplexer to selectively include elements from read buffers 621-632 when preparing an input vector for a computational operation. In various embodiments, the unused bytes of the read buffer may be discarded.

As an example, in a scenario with a stride parameter set to two, the initial input elements for a convolution operation are every other element of a row of an input matrix. Depending on the input matrix size, the elements include the 1st, 3rd, 5th, and 7th elements, etc., for the first group of input elements necessary for a convolution operation. Read buffer 621 is configured to read the first 8 elements (1 through 8), and thus elements 2, 4, 6, and 8 are not needed for a stride of two. As another example, using a stride of five, four elements are skipped when determining the start of the next neighboring region. Depending on the size of the input data, the 1st, 6th, 11th, 16th, and 21st elements, etc., are the first input elements necessary for a convolution operation. The elements 2-5 and 7-8 are loaded into a read buffer 621 but are not used for calculating the first dot-product component result corresponding to each region and may be filtered out.

In various embodiments, each read buffer loads eight consecutive elements and can satisfy two elements for a stride of five. For example, read buffer 621 initiates a read at element 1 and also reads in element 6, read buffer 621 initiates a read at element 11 and also reads in element 16, read buffer 622 initiates a read at element 21 and also reads in element 26, etc. In some embodiments, the reads are aligned to multiples of the read buffer size. In some embodiments, only the first read buffer is aligned to a multiple of the read buffer size. In various embodiments, only the start of each matrix row must be aligned to a multiple of the read buffer size. Depending on the stride and the size of the input matrix, in various embodiments, only a subset of the read buffers may be utilized. In various embodiments, the elements corresponding to least twelve regions, one element for each read buffer 621-632, are loaded and fed to a computational array in parallel. In various embodiments, the number of input elements provided in parallel to a computational array is at least the number of read buffers in the hardware data formatter.

In some embodiments, the elements not needed for the particular stride are filtered out and not passed to the computational array. In various embodiments, using, for example, a multiplexer, the input elements conforming to the stride are selected from the loaded read buffers and formatted into an input vector for a computational array. Once the input vector is formatted, hardware data formatter 605 feeds the input vector to the computational array. The unneeded elements may be discarded. In some embodiments, the unneeded elements may be utilized for the next dot-product component and a future clock cycle and are not discarded from read buffers 621-632. In various embodiments, the elements not needed for implementing a particular stride are fed as inputs to a computational array and the computational array and/or post-processing will filter the results to remove them. For example, the elements not needed may be provided as input to a computation array but the computation units corresponding to the unnecessary elements may be disabled.

In some embodiments, hardware data formatter 605 formats the input vector for a computational array to include padding. For example, hardware data formatter 605 may insert padding using read buffers 621-632. In various embodiments, one or more padding parameters may be described by a control unit using a control signal and/or instruction parameter.

In some embodiments, hardware data formatter 605 determines a set of addresses for preparing operands for a computational array. For example, hardware data formatter 605 calculates associated memory locations required to load a subset of values, determines whether the subset is cached, and potentially issues a read to memory for the subset in the event of a cache miss. In some scenarios, a pending read may satisfy a cache miss. In various embodiments, hardware data formatter 605 only processes the memory address associated with the start element and end element of each read buffer 621-632. In various embodiments, each read buffer 621-632 associates the validity of the cache entry for a subset of values with the memory addresses of the start and end values of the corresponding read buffer. In the example shown, read buffer 621 is configured to store 8-bytes corresponding to up to eight elements. In various embodiments, hardware data formatter 605 calculates the address of the first element and the address of the last element of read buffer 621. Hardware data formatter 605 performs a cache check on the first and last element addresses. In the event either of the addresses is a cache miss, hardware data formatter 605 issues a memory read for 8-bytes starting at the address of the first element. In the event that both addresses are a cache hit from the same cache line, hardware data formatter 605 considers every element in the subset to be a valid cache hit and loads the subset of values from the cache via the appropriate cache line. In this manner, an entire row of elements may be loaded by processing the addresses of at most the first and last addresses of each read buffer 621-632 (e.g., at most 24 addresses).

FIG. 7 is a flow diagram illustrating an embodiment of a process for performing machine learning processing. The process of FIG. 7 describes a pipeline for slicing one or more matrices to fit a computational array, receiving a computational operation for the sliced matrix or matrices, preparing the data for performing the operation, and computing one or more results associated with the operation. Depending on the application, the process of FIG. 7 may be repeated on different slices of a matrix and the results combined. For example, a frame of image data larger than a computational array may be sliced into smaller matrices and computational operations performed on the sliced matrices. The results of multiple passes of FIG. 7 on different slices may be combined to generate the result of a computational operation on the entire frame. In various embodiments, the process of FIG. 7 is performed by a microprocessor system such as the microprocessor system of FIGS. 1 and 5. In various embodiments, the process of FIG. 7 is utilized to implement applications relying on computational operations such as convolution. For example, the process of FIG. 7 may be utilized to implement a machine learning application that performs inference using a machine learning model. In some embodiments, the process of FIG. 7 is utilized to implement the processes of FIGS. 2 and 3.

At 701, one or more matrices may be sliced. In some embodiments, the size of a matrix, for example, a matrix representing a frame of vision data, is larger than will fit in a computational array. In the event the matrix exceeds the size of the computational array, the matrix is sliced into a smaller two-dimensional matrix with a size limited to the appropriate dimensions of the computational array. In some embodiments, the sliced matrix is a smaller matrix with addresses to elements referencing the original matrix. In various embodiments, the sliced matrix is serialized into a vector for processing. In some embodiments, each pass of the process of FIG. 7 may slice a matrix into a different slice and slices may overlap with previous slices. In various embodiments, a data matrix and a weight matrix may both be sliced, although typically only a data matrix will require slicing. In various embodiments, matrices may be sliced only at boundaries corresponding to multiples of the read buffer size of a hardware data formatter. For example, in the event each read buffer is 8-bytes in size, each row of a sliced matrix must begin with an address having a multiple of eight. In the event a matrix fits within the computational array, no slicing is required (i.e., the matrix slice used for the remaining steps of FIG. 7 is simply the original matrix). In various embodiments, the matrix slice(s) are used as input matrices for the computational operation of 703.

At 703, a computational operation is received. For example, a matrix operation is received by the microprocessor system. As one example, a computational operation requesting a convolution of an image with a filter is received. In some embodiments, the operation may include the necessary parameters to perform the computational operation including the operations involved and the operands. For example, the operation may include the size of the input operands (e.g., the size of each input matrix), the start address of each input matrix, a stride parameter, a padding parameter, and/or matrix, vector, and/or post-processing commands. For example, a computational operation may describe an image data size (e.g., 96×96, 1920×1080, etc.) and bit depth (e.g., 8-bits, 16-bits, etc.) and a filter size and bit depth, etc. In some embodiments, the computational operation is received by a control unit such as control unit 101 of FIG. 1 and 501 of FIG. 5. In some embodiments, a control unit processes the computational operation and performs the necessary synchronization between components of the microprocessor system. In various embodiments, the computational operation is a hardware implementation using control signals. In some embodiments, the computational operation is implemented using one or more processor instructions.

At 705, each hardware data formatter receives a data formatting operation. In some embodiments, the data formatting operation is utilized to prepare input arguments for a computational array such as matrix processor 107 of FIG. 1 and 507 of FIG. 5. For example, each hardware data formatter receives a data formatting operation that includes information necessary to retrieve the data associated with a computational operation (e.g., a start address of a matrix, a matrix size parameter, a stride parameter, a padding parameter, etc.) and to prepare the data to be fed as input into the computational array. In some embodiments, the data formatting operation is implemented using control signals. In some embodiments, the data formatting operation is received by a hardware data formatter such as data formatter 104 and 504 of FIGS. 1 and 5, respectively, and weight formatter 106 and 506 of FIGS. 1 and 5, respectively. In some embodiments, hardware data formatter is hardware data formatter 605 of FIG. 6. In some embodiments, a control unit such as control unit 101 of FIG. 1 and 501 of FIG. 5 interfaces with a hardware data formatter to process data formatting operations.

At 707, data addresses are processed by one or more hardware data formatters. For example, addresses corresponding to elements of the computational operation are processed by one or more hardware data formatters based on the formatting operations received at 705. In some embodiments, the addresses are processed in order for the hardware data formatter to load the elements (from a cache or memory) and prepare an input vector for a computational array. In various embodiments, a hardware data formatter first calculates a pair of memory addresses for each subset of values to determine whether a subset of elements exists in a cache before issuing a request to memory in the event of a cache miss. In various embodiments, a read request to memory incurs a large latency that may be minimized by reading elements from a cache. In some scenarios, all elements are read from a cache and thus require any cache misses to first populate the cache by issuing a read to memory. To minimize the latency for each read, in various embodiments, the reads are performed on subsets of elements (or values). In some embodiments, memory may only have a limited number of read ports, for example, a single read port, and all reads are processed one at a time. For example, performing 96 independent reads incurs the latency of 96 independent reads for a memory with a single read port. To reduce read latency, subsets of values are read together from memory into corresponding read buffers of a hardware data formatter. For example, using subsets of eight values, at most 12 memory reads are required to read 96 values. In the event some of the subsets are in the cache from previous memory reads, even fewer memory reads are required.

In various embodiments, subsets of values are prepared by determining the memory addresses for the start value of each subset (where each value corresponds to an element) and the end value of each subset. For example, to prepare a subset of 8-values each of 1-byte, a cache check is performed using the calculated address of the start value and the calculated address of the end value of the subset. In the event either of the addresses are cache misses, a memory read is issued to read 8-bytes from memory beginning at the address of the start value. In some embodiments, in addition to reading the requested 8-bytes from memory, an entire cache line of data (corresponding to multiple subsets) is read from memory and stored in the cache. In various embodiments, in the event the start and end addresses of a subset are cached at the same cache line, the entire subset of values is considered cached and no cache check is needed for the remaining elements of the subset. The entire subset is considered cached in the event the start and end elements are cached in the same cache line. In various embodiments, the processing at 707 determines the addresses of the start value of the subset and the end value of the subset for each subset of values. In various embodiments, one read buffer exists for each subset of values. In various embodiments, read buffers of a hardware data formatter are read buffers 621-632 of hardware data formatter 605 of FIG. 6.

In some embodiments, a stride parameter is implemented and non-consecutive subsets of values are loaded into each read buffer. In various embodiments, each subset of continuous values includes one or more elements needed to implement a particular stride parameter. For example, for a stride of one, every value in a subset of values located consecutively in memory is a utilized element. As another example, for a stride of two, every other value located consecutively in memory is utilized and a subset of eight consecutive values includes four utilized elements and four that are not utilized. As another example, for a stride of five, a subset of eight values located consecutively in memory may include two utilized elements and six unused elements. For each subset of elements located consecutively in memory, the memory addresses for the start and end elements of the subset are determined and utilized to perform a cache check at 709. In various embodiments, the start element of the subset is the first element of the subset. In some embodiments, the end element of the subset is the last element of the subset, regardless of whether the element is utilized to implement the stride parameter. In some embodiments, the end element of the subset is the last utilized element and not the last element of the subset.

In various embodiments, once the number of utilized elements that are included in a subset of consecutive elements is determined, the next subset of elements begins with the next element needed to satisfy the stride parameter. The next element may result in a memory location that is located at an address non-consecutive with the address of the last element of the previous subset. As an example, using a stride of five, four elements are skipped when determining the start of the next subset of values. Depending on the size of the input data, the 1st and 6th elements are stored in the first subset of values, 11th and 16th elements in the second subset of values, and 21st and 26th elements in the third subset of values, etc. In various embodiments, the second subset of values starts with the 11th element and the third subset of values starts with the 21st element. Each subset is located in memory at locations non-consecutive with the other subsets. Examples of unused elements in the first subset of values include the elements 2-5 and 7-8. In some embodiments, the first row of each matrix is aligned to a multiple of the subset size. In some embodiments, this alignment restriction is required to prevent gaps of invalid values between rows when a matrix is serialized. In some embodiments, all subsets are aligned to the multiple of the subset size.

In various embodiments, each subset of values is loaded in a read buffer such as read buffers 621-632 of FIG. 6. Depending on the particular application (e.g., the stride, the size of the input matrix, the size of the read buffer, the number of read buffers, etc.), some of the read buffers of a hardware data formatter may not be utilized. In some scenarios, the number of input elements provided in parallel to a computational array is at least the number of subsets. For example, a hardware data formatter supporting twelve subsets of values can provide at least twelve elements in parallel to a computational array.

In some embodiments, the formatting performed by a hardware data formatter includes converting a matrix into a vector with elements of the vector fed to a computational array over multiple clock cycles. For example, in some embodiments, a matrix corresponding to data (e.g., image data) is formatted to prepare vectors corresponding to sub-regions of the data. In some embodiments, each element fed to a computational array for a particular clock cycle corresponds to the n-th element of a vector associated with a sub-region of the data. As an example, a 3×3 matrix may be formatted into a one-dimensional vector of nine elements. Each of the nine elements may be fed into the same computation unit of a computational array. In various embodiments, feeding the 9 elements requires are least 9 clock cycles.

At 709, a determination is made whether the data corresponding to the addresses determined for each subset at 707 are cached. For example, a cache check is performed on each subset by determining whether the data associated with the address of the start value of the subset and the address of the end value of the subset is in the same cache line. In various embodiments, a cache check is performed for each read buffer, such as read buffers 621-632 of FIG. 6, of a hardware data formatter. In the event the data is cached, the processing continues to 713. In various embodiments, the cache utilized is cache 503 of FIG. 5 and/or 603 of FIG. 6. In the event the data is not cached, processing continues to 711.

At 711, each requested subset of data is read into the cache as an entire subset of values. In various embodiments, each subset data is read into the cache from memory. In some embodiments, the memory is memory 502 of FIG. 5 and 601 of FIG. 6. In some embodiments, an entire cache line is read into the cache. For example, a cache miss for a subset of values results in loading the subset of values into a cache line along with the other data located consecutively with the subset of values in memory. In some scenarios, a single cache line is sufficient to cache multiple subsets.

At 713, matrix processing is performed. For example, a matrix processor performs a matrix operation using the data cached and received by a hardware data formatter. In various embodiments, the cached data is received by the hardware data formatter and processed according to a formatting operation by a hardware data formatter into input values for matrix processing. In some embodiments, the processing by the hardware data formatter includes filtering out a portion of the received cached data. For example, in some embodiments, subsets of values located consecutively in memory are read into the cache and received by the hardware data formatter. In various embodiments, a computational operation may specify a stride and/or padding parameters. For example, to implement a specified stride for convolution, one or more data elements may be filtered from each subset of values. In some embodiments, only a subset of the elements from each of the subsets of values is selected to create an input vector for matrix processing.

In various embodiments, the matrix processor performs the computational operation specified at 703. For example, a matrix processor such as matrix processor 107 of FIG. 1 and 507 of FIG. 5 performs a matrix operation on input vectors received by hardware data formatters. In various embodiments, the matrix processor commences processing once all the input operands are made available. The output of matrix processing is fed to 715 for optional additional processing. In various embodiments, the result of matrix processing is shifted out of a computational array one vector at a time.

At 715, vector and/or post-processing operations are performed. For example, vector processing may include the application of an activation function such as a rectified linear unit (ReLU) function. In some embodiments, vector processing includes scaling and/or normalization. In various embodiments, vector processing is performed on one vector of the output of a computational array at a time. In some embodiments, vector processing is performed by a vector processor such as vector engine 111 of FIG. 1. In various embodiments, post-processing operations may be performed at 715. For example, post-processing operations such as pooling may be performed using a post-processor unit. In some embodiments, post-processing is performed by a post-processing processor such as post-processing unit 115 of FIG. 1. In some embodiments, vector and/or post-processing operations are optional operations.

FIG. 8 is a flow diagram illustrating an embodiment of a process for retrieving input operands for a computational array. The process of FIG. 8 describes a process for preparing data elements by a hardware data formatter for a computational array. For example, the input data is partitioned into subsets based on the number of read buffers of a hardware data formatter. The process of FIG. 8 is utilized to load the corresponding read buffers with data corresponding to subsets of values located consecutively in memory. By partitioning values into subsets based on memory location and performing a single read on the entire subset instead of an individual read for each element, the latency incurred from accessing memory is reduced. In various embodiments, the process of FIG. 8 is performed by a microprocessor system such as the microprocessor system of FIGS. 1 and 5. In various embodiments, the process of FIG. 8 is implemented at 707, 709, 711, and 713 of FIG. 7. In various embodiments, the memory utilized by the process of FIG. 8 is memory 502 of FIG. 5 and 601 of FIG. 6. In various embodiments, the cache utilized by the process of FIG. 8 is cache 503 of FIG. 5 and 603 of FIG. 6. In various embodiments, the process of FIG. 8 is performed at least in part by a hardware data formatter such as the hardware data formatters of FIGS. 1, 5, and 6. For example, a hardware data formatter may be utilized to perform the steps of 801, 803, 805, 807, 809, 811, 813, and portions of 815. In some embodiments, the process of FIG. 8 is utilized to implement the processes of FIGS. 2 and 3.

In some embodiments, the process of FIG. 8 is performed in parallel on different read buffers and/or subset of values. For example, in a scenario with eight read buffers, the data to be loaded into the read buffers may be partitioned into at most eight subsets and the process of FIG. 8 is performed on each subset in parallel. In some embodiments, the number of subsets is based on capabilities of the cache and/or the memory. For example, the number of subsets may be based on how many simultaneous cache checks may be performed on the cache and/or the number of simultaneous reads to memory that may be issued.

At 801, the first subset of data elements located consecutively in memory is processed. In various embodiments, the first consecutive subset of data corresponds to the data element designated for the first read buffer of a hardware data formatter. In some embodiments, the address of the first element must be a multiple of the number of elements in each subset. For example, using an 8-byte read buffer, the address of the first element must be a multiple of eight.

At 803, start and end memory addresses are determined for the current subset. For example, the memory address of the start element of a subset and the memory address of the end element of a subset are determined. In various embodiments, the start and end addresses are determined by a hardware data formatter, such as the hardware data formatters of FIGS. 1, 5, and 6.

At 805, a determination is made on whether the subset of data is cached or pending a read. For example, a determination is made whether the data corresponding to the start and end addresses determined at 803 are cached at the same cache line or will be cached as a result of an already issued memory read. In some embodiments, a pending read for a different subset brings an entire cache line of data into memory and will result in caching the current subset. In the event the data is not cached or will not be cached as a result of a pending memory read, processing continues to 807. In the event the data is cached or will be cached by a pending memory read, processing continues to 811.

At 807, a determination is made on whether a memory read is already issued. In the event a memory read is already issued, processing completes for the current clock cycle. In the event a memory read has not been issued, processing continues to 809. In some embodiments, the memory is configured with a single read port (e.g., to increase density) and the memory can only process one read at a time. In various embodiments, the determination of whether a memory read has been issued is based on the capability of the memory configuration and/or the availability of memory read ports. Not shown in FIG. 8, in some embodiments, in the event an additional memory read is supported for the current clock cycle (despite a pending read), processing continues to 809; otherwise processing completes for the current clock cycle.

At 809, a read is issued to cache a subset of data elements. For example, a block of memory beginning at the start address determined at 803 and extending for the length based on the size of a read buffer is read from memory into the memory cache. In various embodiments, an entire cache line of memory is read into the memory cache. For example, in a scenario with a cache line of 256 bytes and read buffers each capable of storing 8-bytes, a memory read will read 256 bytes of continuous data into a cache line, which corresponds to 32 subsets of non-overlapping 8-byte values. In various embodiments, reading a subset of values as a single memory read request reduces the latency associated with loading each element. Moreover, reading multiple subsets of values together may further reduce the latency by caching other subsets of values that may be associated with other read buffers. In some embodiments, loading multiple subsets of values takes advantage of potential locality between the subsets resulting in lower latency.

At 811, a determination is made on whether there are additional subsets of data elements. In the event that every subset has been processed, processing continues to 813. In the event that there are additional subsets to be processed, processing loops back to 803. In some embodiments, depending on the input size, one or more read buffers of a hardware data formatter may not be utilized.

At 813, a determination is made on whether all the data elements are cached. In the event some elements are not cached, processing completes for the current clock cycle to allow the non-cached data elements to be loaded from memory into the cache. In the event all the data elements are cached, the data elements are all available for processing and processing proceeds to 815.

At 815, matrix processing is performed. For example, the cached data elements are received at one or more hardware data formatters, formatted, and fed as input vector(s) to a computational array for processing. A computational array, such as matrix processor 107 of FIG. 1 and 507 of FIG. 5, performs matrix processing on the input vectors.

Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.

Claims

1. A microprocessor system, comprising: a computational array that includes a plurality of computation units, wherein each of the plurality of computation units operates on a corresponding value addressed from memory and the values operated by the plurality of computation units are provided to the computational array as a group of values to be processed in parallel, the group of values being utilized as a first input to the computational array; anda hardware data formatter configured to gather the group of values based on a data formatting operation, the data formatting operation identifying at least a stride,wherein the group of values are provided, by the hardware data formatter, to the computational array, and wherein the computational array disables particular computation units based on the stride.
2. The microprocessor system of claim 1, wherein the values operated by the computation units are synchronously provided to the computational array.
3. The microprocessor system of claim 1, wherein the hardware data formatter comprises a plurality of read buffers configured to store respective subsets of the values.
4. The microprocessor system of claim 3, wherein each subset corresponds to values located consecutively in the memory, wherein a number of values from each subset is determined based on the stride, the number of values indicating values of each subset which are to be utilized for processing based on the stride, wherein remaining values of each subset are not utilized, wherein the group of values includes the values of each subset which are to be utilized and the remaining values of each subset which are not utilized.
5. The microprocessor system of claim 4, wherein the computational array disables particular computation units corresponding to the remaining values of each subset which are not utilized.
6. The microprocessor system of claim 1, wherein the group of values includes at least a first subset and a second subset, and wherein the first subset and the second subset is not located consecutively in the memory.
7. The microprocessor system of claim 6, wherein a difference in memory address between the first subset and the second subset is based on the stride.
8. The microprocessor system of claim 6, wherein the hardware data formatter is configured to determine a corresponding start memory address for the first subset and the second subset.
9. The microprocessor system of claim 8, wherein a cache check is performed for each of the first subset and the second subset including by determining whether a value stored at the determined starting memory addresses for the first subset has been cached and determining whether a value stored at the determined starting memory addresses for the second subset has been cached.
10. The microprocessor system of claim 6, wherein the hardware data formatter is configured to determine a corresponding end memory address for the first subset and the second subset.
11. The microprocessor system of claim 1, wherein each computation unit of the plurality of computation units includes an arithmetic logic unit, an accumulator, and a shadow register.
12. The microprocessor system of claim 1, wherein the first input corresponds to image data.
13. The microprocessor system of claim 1, wherein the first input corresponds to ultrasonic or Light Detection and Ranging (LIDAR) data.
14. The microprocessor system of claim 1, wherein a data width of the hardware data formatter is dynamically configurable.
15. The microprocessor system of claim 1, wherein the hardware data formatter is configured to format weight inputs into an input vector and provide the input vector to a subset of the computation units associated with a corresponding subset of the first input.
16. A method comprising: receiving a data formatting operation at a hardware data formatter, the data formatting operation indicating at least a stride;retrieving a first group of values associated with an input data;retrieving a second group of values associated with a weight data;providing in parallel the first group of values and the second group of values to a computational array microprocessor comprising a plurality of computation units arranged as a matrix, wherein the computational array disables particular computation units based on the stride; andprocessing the first group of values and the second group of values as operands in parallel using the computational array.
17. The method of claim 16, wherein a subset of the first group of values are not utilized based on the stride, and wherein the subset corresponds to the particular computation units.
18. The method of claim 16, wherein the first group of values includes a first subset of values located consecutively in a memory and a second subset of values located consecutively in the memory, and the first subset of values is not located consecutively in the memory from the second subset of values, wherein a number of values from the first subset is determined based on the stride.
19. A microprocessor system, comprising: a computational array that includes a plurality of computation units, wherein each of the plurality of computation units operates on a corresponding value addressed from memory and the values operated by the plurality of computation units are provided to the computational array as a group of values to be processed in parallel, the group of values being utilized as a first input to the computational array, wherein the group of values includes at least 96 values and the group of values includes at least 12 subsets of values;
20. The microprocessor system of claim 19, wherein each read buffer is single wide register, a single memory storage location, individual registers, or individual memory storage locations.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of, and claims priority to, U.S. patent application Ser. No. 15/920,173 titled “COMPUTATIONAL ARRAY MICROPROCESSOR SYSTEM USING NON-CONSECUTIVE DATA FORMATTING” and filed on Mar. 13, 2018, which claims priority to U.S. Provisional Patent Application No. 62/628,212 entitled A COMPUTATIONAL ARRAY MICROPROCESSOR SYSTEM USING NON-CONSECUTIVE DATA FORMATTING filed Feb. 8, 2018 U.S. Provisional Patent Application No. 62/625,251 entitled VECTOR COMPUTATIONAL UNIT filed Feb. 1, 2018, U.S. Provisional Patent Application No. 62/536,399 entitled ACCELERATED MATHEMATICAL ENGINE filed Jul. 24, 2017, U.S. patent application Ser. No. 15/710,433 entitled ACCELERATED MATHEMATICAL ENGINE filed Sep. 20, 2017, which claims priority to U.S. Provisional Patent Application No. 62/536,399 entitled ACCELERATED MATHEMATICAL ENGINE filed Jul. 24, 2017, all of which are incorporated herein by reference for all purposes.

US Referenced Citations (680)

Number	Name	Date	Kind
5239636	Fischer	Aug 1993	A
5267185	Akabane	Nov 1993	A
5311459	D'Luna et al.	May 1994	A
5333296	Bouchard	Jul 1994	A
5471627	Means et al.	Nov 1995	A
5519864	Martell	May 1996	A
5600843	Kato et al.	Feb 1997	A
5717947	Gallup et al.	Feb 1998	A
5742782	Ito	Apr 1998	A
5850530	Chen	Dec 1998	A
5887183	Agarwal et al.	Mar 1999	A
6122722	Slavenburg	Sep 2000	A
6195674	Elbourne	Feb 2001	B1
6425090	Arimilli et al.	Jul 2002	B1
6446190	Barry	Sep 2002	B1
6882755	Silverstein et al.	Apr 2005	B2
7209031	Nakai et al.	Apr 2007	B2
7747070	Puri	Jun 2010	B2
7904867	Burch et al.	Mar 2011	B2
7974492	Nishijima	Jul 2011	B2
8165380	Choi et al.	Apr 2012	B2
8369633	Lu et al.	Feb 2013	B2
8406515	Cheatle et al.	Mar 2013	B2
8509478	Haas et al.	Aug 2013	B2
8588470	Rodriguez et al.	Nov 2013	B2
8744174	Hamada et al.	Jun 2014	B2
8773498	Lindbergh	Jul 2014	B2
8912476	Fogg et al.	Dec 2014	B2
8913830	Sun et al.	Dec 2014	B2
8924455	Barman et al.	Dec 2014	B1
8928753	Han et al.	Jan 2015	B2
8972095	Furuno et al.	Mar 2015	B2
8976269	Duong	Mar 2015	B2
9008422	Eid et al.	Apr 2015	B2
9081385	Ferguson et al.	Jul 2015	B1
9275289	Li et al.	Mar 2016	B2
9586455	Sugai et al.	Mar 2017	B2
9672437	McCarthy	Jun 2017	B2
9697463	Ross	Jul 2017	B2
9710696	Wang et al.	Jul 2017	B2
9738223	Zhang et al.	Aug 2017	B2
9754154	Craig et al.	Sep 2017	B2
9767369	Furman et al.	Sep 2017	B2
9965865	Agrawal et al.	May 2018	B1
10074051	Thorson	Sep 2018	B2
10133273	Linke	Nov 2018	B2
10140252	Fowers et al.	Nov 2018	B2
10140544	Zhao et al.	Nov 2018	B1
10146225	Ryan	Dec 2018	B2
10152655	Krishnamurthy et al.	Dec 2018	B2
10167800	Chung et al.	Jan 2019	B1
10169680	Sachdeva et al.	Jan 2019	B1
10192016	Ng et al.	Jan 2019	B2
10216189	Haynes	Feb 2019	B1
10228693	Micks et al.	Mar 2019	B2
10242293	Shim et al.	Mar 2019	B2
10248121	VandenBerg, III	Apr 2019	B2
10262218	Lee et al.	Apr 2019	B2
10282623	Ziyaee et al.	May 2019	B1
10296828	Viswanathan	May 2019	B2
10303961	Stoffel et al.	May 2019	B1
10310087	Laddha et al.	Jun 2019	B2
10311312	Yu et al.	Jun 2019	B2
10318848	Dijkman et al.	Jun 2019	B2
10325178	Tang et al.	Jun 2019	B1
10331974	Zia et al.	Jun 2019	B2
10338600	Yoon et al.	Jul 2019	B2
10343607	Kumon et al.	Jul 2019	B2
10359783	Williams et al.	Jul 2019	B2
10366290	Wang et al.	Jul 2019	B2
10372130	Kaushansky et al.	Aug 2019	B1
10373019	Nariyambut Murali et al.	Aug 2019	B2
10373026	Kim et al.	Aug 2019	B1
10380741	Yedla et al.	Aug 2019	B2
10394237	Xu et al.	Aug 2019	B2
10395144	Zeng et al.	Aug 2019	B2
10402646	Klaus	Sep 2019	B2
10402986	Ray et al.	Sep 2019	B2
10414395	Sapp et al.	Sep 2019	B1
10423934	Zanghi et al.	Sep 2019	B1
10436615	Agarwal et al.	Oct 2019	B2
10452905	Segalovitz et al.	Oct 2019	B2
10460053	Olson et al.	Oct 2019	B2
10467459	Chen et al.	Nov 2019	B2
10468008	Beckman et al.	Nov 2019	B2
10468062	Levinson et al.	Nov 2019	B1
10470510	Koh et al.	Nov 2019	B1
10474160	Huang et al.	Nov 2019	B2
10474161	Huang et al.	Nov 2019	B2
10474928	Sivakumar et al.	Nov 2019	B2
10489126	Kumar et al.	Nov 2019	B2
10489478	Shalev	Nov 2019	B2
10489972	Atsmon	Nov 2019	B2
10503971	Dang et al.	Dec 2019	B1
10514711	Bar-Nahum et al.	Dec 2019	B2
10528824	Zou	Jan 2020	B2
10529078	Abreu et al.	Jan 2020	B2
10529088	Fine et al.	Jan 2020	B2
10534854	Sharma et al.	Jan 2020	B2
10535191	Sachdeva et al.	Jan 2020	B2
10542930	Sanchez et al.	Jan 2020	B1
10546197	Shrestha et al.	Jan 2020	B2
10546217	Albright et al.	Jan 2020	B2
10552682	Jonsson et al.	Feb 2020	B2
10559386	Neuman	Feb 2020	B1
10565475	Lecue et al.	Feb 2020	B2
10567674	Kirsch	Feb 2020	B2
10568570	Sherpa et al.	Feb 2020	B1
10572717	Zhu et al.	Feb 2020	B1
10574905	Srikanth et al.	Feb 2020	B2
10579058	Oh et al.	Mar 2020	B2
10579063	Haynes et al.	Mar 2020	B2
10579897	Redmon et al.	Mar 2020	B2
10586280	McKenna et al.	Mar 2020	B2
10591914	Palanisamy et al.	Mar 2020	B2
10592785	Zhu et al.	Mar 2020	B2
10599701	Liu	Mar 2020	B2
10599930	Lee et al.	Mar 2020	B2
10599958	He et al.	Mar 2020	B2
10606990	Tull et al.	Mar 2020	B2
10609434	Singhai et al.	Mar 2020	B2
10614344	Anthony et al.	Apr 2020	B2
10621513	Deshpande et al.	Apr 2020	B2
10627818	Sapp et al.	Apr 2020	B2
10628432	Guo et al.	Apr 2020	B2
10628686	Ogale et al.	Apr 2020	B2
10628688	Kim et al.	Apr 2020	B1
10629080	Kazemi et al.	Apr 2020	B2
10636161	Uchigaito	Apr 2020	B2
10636169	Estrada et al.	Apr 2020	B2
10642275	Silva et al.	May 2020	B2
10645344	Marman et al.	May 2020	B2
10649464	Gray	May 2020	B2
10650071	Asgekar et al.	May 2020	B2
10652565	Zhang et al.	May 2020	B1
10656657	Djuric et al.	May 2020	B2
10657391	Chen et al.	May 2020	B2
10657418	Marder et al.	May 2020	B2
10657934	Kolen et al.	May 2020	B1
10661902	Tavshikar	May 2020	B1
10664750	Greene	May 2020	B2
10671082	Huang et al.	Jun 2020	B2
10671349	Bannon et al.	Jun 2020	B2
10671886	Price et al.	Jun 2020	B2
10678244	Iandola et al.	Jun 2020	B2
10678839	Gordon et al.	Jun 2020	B2
10678997	Ahuja et al.	Jun 2020	B2
10679129	Baker	Jun 2020	B2
10685159	Su et al.	Jun 2020	B2
10685188	Zhang et al.	Jun 2020	B1
10692000	Surazhsky et al.	Jun 2020	B2
10692242	Morrison et al.	Jun 2020	B1
10693740	Coccia et al.	Jun 2020	B2
10698868	Guggilla et al.	Jun 2020	B2
10699119	Lo et al.	Jun 2020	B2
10699140	Kench et al.	Jun 2020	B2
10699477	Levinson et al.	Jun 2020	B2
10713502	Tiziani	Jul 2020	B2
10719759	Kutliroff	Jul 2020	B2
10725475	Yang et al.	Jul 2020	B2
10726264	Sawhney et al.	Jul 2020	B2
10726279	Kim et al.	Jul 2020	B1
10726374	Engineer et al.	Jul 2020	B1
10732261	Wang et al.	Aug 2020	B1
10733262	Miller et al.	Aug 2020	B2
10733482	Lee et al.	Aug 2020	B1
10733638	Jain et al.	Aug 2020	B1
10733755	Liao et al.	Aug 2020	B2
10733876	Moura et al.	Aug 2020	B2
10740563	Dugan	Aug 2020	B2
10740914	Xiao et al.	Aug 2020	B2
10748062	Rippel et al.	Aug 2020	B2
10748247	Paluri	Aug 2020	B2
10751879	Li et al.	Aug 2020	B2
10755112	Mabuchi	Aug 2020	B2
10755575	Johnston et al.	Aug 2020	B2
10757330	Ashrafi	Aug 2020	B2
10762396	Vallespi et al.	Sep 2020	B2
10768628	Martin et al.	Sep 2020	B2
10768629	Song et al.	Sep 2020	B2
10769446	Chang et al.	Sep 2020	B2
10769483	Nirenberg et al.	Sep 2020	B2
10769493	Yu et al.	Sep 2020	B2
10769494	Xiao et al.	Sep 2020	B2
10769525	Redding et al.	Sep 2020	B2
10776626	Lin et al.	Sep 2020	B1
10776673	Kim et al.	Sep 2020	B2
10776939	Ma et al.	Sep 2020	B2
10779760	Lee et al.	Sep 2020	B2
10783381	Yu et al.	Sep 2020	B2
10783454	Shoaib et al.	Sep 2020	B2
10789402	Vemuri et al.	Sep 2020	B1
10789544	Fiedei et al.	Sep 2020	B2
10790919	Kolen et al.	Sep 2020	B1
10796221	Zhang et al.	Oct 2020	B2
10796355	Price et al.	Oct 2020	B1
10796423	Goja	Oct 2020	B2
10798368	Briggs et al.	Oct 2020	B2
10803325	Bai et al.	Oct 2020	B2
10803328	Bai et al.	Oct 2020	B1
10803743	Abari et al.	Oct 2020	B2
10805629	Liu et al.	Oct 2020	B2
10809730	Chintakindi	Oct 2020	B2
10810445	Kangaspunta	Oct 2020	B1
10816346	Wheeler et al.	Oct 2020	B2
10816992	Chen	Oct 2020	B2
10817731	Vailespi et al.	Oct 2020	B2
10817732	Porter et al.	Oct 2020	B2
10819923	McCauley et al.	Oct 2020	B1
10824122	Mummadi et al.	Nov 2020	B2
10824862	Qi et al.	Nov 2020	B2
10828790	Nemallan	Nov 2020	B2
10832057	Chan et al.	Nov 2020	B2
10832093	Taralova et al.	Nov 2020	B1
10832414	Pfeiffer	Nov 2020	B2
10832418	Karasev et al.	Nov 2020	B1
10833785	O'Shea et al.	Nov 2020	B1
10836379	Xiao et al.	Nov 2020	B2
10838936	Cohen	Nov 2020	B2
10839230	Charette et al.	Nov 2020	B2
10839578	Coppersmith et al.	Nov 2020	B2
10843628	Kawamoto et al.	Nov 2020	B2
10845820	Wheeler	Nov 2020	B2
10845943	Ansari et al.	Nov 2020	B1
10846831	Raduta	Nov 2020	B2
10846888	Kaplanyan et al.	Nov 2020	B2
10853670	Sholingar et al.	Dec 2020	B2
10853739	Truong et al.	Dec 2020	B2
10860919	Kanazawa et al.	Dec 2020	B2
10860924	Burger	Dec 2020	B2
10867444	Russell et al.	Dec 2020	B2
10871444	Al et al.	Dec 2020	B2
10871782	Milstein et al.	Dec 2020	B2
10872204	Zhu et al.	Dec 2020	B2
10872254	Mangla et al.	Dec 2020	B2
10872326	Garner	Dec 2020	B2
10872531	Liu et al.	Dec 2020	B2
10885083	Moeller-Bertram et al.	Jan 2021	B2
10887433	Fu et al.	Jan 2021	B2
10890898	Akella et al.	Jan 2021	B2
10891715	Li	Jan 2021	B2
10891735	Yang et al.	Jan 2021	B2
10893070	Wang et al.	Jan 2021	B2
10893107	Callari et al.	Jan 2021	B1
10896763	Kempanna et al.	Jan 2021	B2
10901416	Khanna et al.	Jan 2021	B2
10901508	Laszlo et al.	Jan 2021	B2
10902551	Mellado et al.	Jan 2021	B1
10908068	Amer et al.	Feb 2021	B2
10908606	Stein et al.	Feb 2021	B2
10909368	Guo et al.	Feb 2021	B2
10909453	Myers et al.	Feb 2021	B1
10915783	Hallman et al.	Feb 2021	B1
10917522	Segalis et al.	Feb 2021	B2
10921817	Kangaspunta	Feb 2021	B1
10922578	Banerjee et al.	Feb 2021	B2
10924661	Vasconcelos et al.	Feb 2021	B2
10928508	Swaminathan	Feb 2021	B2
10929757	Baker et al.	Feb 2021	B2
10930065	Grant et al.	Feb 2021	B2
10936908	Ho et al.	Mar 2021	B1
10937186	Wang et al.	Mar 2021	B2
10942737	Ivanov	Mar 2021	B2
10943101	Agarwal et al.	Mar 2021	B2
10943132	Wang et al.	Mar 2021	B2
10943355	Fagg et al.	Mar 2021	B2
11157287	Talpes	Oct 2021	B2
11157441	Talpes	Oct 2021	B2
11210584	Brand	Dec 2021	B2
11403069	Bannon et al.	Aug 2022	B2
11409692	Das Sarma et al.	Aug 2022	B2
20020169942	Sugimoto	Nov 2002	A1
20030035481	Hahm	Feb 2003	A1
20050125369	Buck et al.	Jun 2005	A1
20050162445	Sheasby et al.	Jul 2005	A1
20060072847	Chor et al.	Apr 2006	A1
20060224533	Thaler	Oct 2006	A1
20060280364	Ma et al.	Dec 2006	A1
20070255903	Tsadik	Nov 2007	A1
20090016571	Tijerina et al.	Jan 2009	A1
20090113182	Abernathy et al.	Apr 2009	A1
20090192958	Todorokihara	Jul 2009	A1
20100017351	Hench	Jan 2010	A1
20100118157	Kameyama	May 2010	A1
20110029471	Chakradhar et al.	Feb 2011	A1
20110239032	Kato et al.	Sep 2011	A1
20120017066	Vorbach et al.	Jan 2012	A1
20120109915	Krupnik et al.	May 2012	A1
20120110491	Cheung	May 2012	A1
20120134595	Fonseca et al.	May 2012	A1
20120323832	Snook et al.	Dec 2012	A1
20130159665	Kashyap	Jun 2013	A1
20140046995	Ranous	Feb 2014	A1
20140089232	Buibas et al.	Mar 2014	A1
20140115278	Redford	Apr 2014	A1
20140142929	Seide et al.	May 2014	A1
20140180989	Krizhevsky et al.	Jun 2014	A1
20140277718	Tzhikevich et al.	Sep 2014	A1
20140351190	Levin et al.	Nov 2014	A1
20150046332	Adjaoute	Feb 2015	A1
20150104102	Carreira et al.	Apr 2015	A1
20150199272	Goel	Jul 2015	A1
20150331832	Minoya	Nov 2015	A1
20160085721	Abali	Mar 2016	A1
20160132786	Balan et al.	May 2016	A1
20160328856	Mannino et al.	Nov 2016	A1
20160342889	Thorson et al.	Nov 2016	A1
20160342890	Young	Nov 2016	A1
20160342891	Ross	Nov 2016	A1
20160342892	Ross	Nov 2016	A1
20160342893	Ross et al.	Nov 2016	A1
20160364334	Asaro	Dec 2016	A1
20160379109	Chung et al.	Dec 2016	A1
20170011281	Dihkman et al.	Jan 2017	A1
20170052785	Uliel	Feb 2017	A1
20170060811	Yang	Mar 2017	A1
20170097884	Werner	Apr 2017	A1
20170103298	Ling	Apr 2017	A1
20170103299	Aydonat	Apr 2017	A1
20170103313	Ross et al.	Apr 2017	A1
20170103318	Ross	Apr 2017	A1
20170158134	Shigemura	Jun 2017	A1
20170193360	Gao	Jul 2017	A1
20170206434	Nariyambut et al.	Jul 2017	A1
20170277537	Grocutt	Sep 2017	A1
20170277658	Pratas	Sep 2017	A1
20180012411	Richey et al.	Jan 2018	A1
20180018590	Szeto et al.	Jan 2018	A1
20180032857	Lele	Feb 2018	A1
20180039853	Liu et al.	Feb 2018	A1
20180046900	Dally	Feb 2018	A1
20180067489	Oder et al.	Mar 2018	A1
20180068459	Zhang et al.	Mar 2018	A1
20180068540	Romanenko et al.	Mar 2018	A1
20180074506	Branson	Mar 2018	A1
20180107484	Sebexen	Apr 2018	A1
20180121762	Han et al.	May 2018	A1
20180150081	Gross et al.	May 2018	A1
20180157961	Henry	Jun 2018	A1
20180157962	Henry	Jun 2018	A1
20180157966	Henry	Jun 2018	A1
20180189633	Henry	Jul 2018	A1
20180189639	Henry	Jul 2018	A1
20180189640	Henry	Jul 2018	A1
20180189649	Narayan	Jul 2018	A1
20180189651	Henry	Jul 2018	A1
20180197067	Mody	Jul 2018	A1
20180211403	Hotson et al.	Jul 2018	A1
20180218260	Brand	Aug 2018	A1
20180247180	Cheng	Aug 2018	A1
20180260220	Lacy	Sep 2018	A1
20180307438	Huang	Oct 2018	A1
20180307783	Hah	Oct 2018	A1
20180308012	Mummadi et al.	Oct 2018	A1
20180314878	Lee et al.	Nov 2018	A1
20180315153	Park	Nov 2018	A1
20180336164	Phelps	Nov 2018	A1
20180357511	Misra et al.	Dec 2018	A1
20180374105	Azout et al.	Dec 2018	A1
20190011551	Yamamoto	Jan 2019	A1
20190023277	Roger et al.	Jan 2019	A1
20190025773	Yang et al.	Jan 2019	A1
20190026250	Das Sarma	Jan 2019	A1
20190042894	Anderson	Feb 2019	A1
20190042919	Peysakhovich et al.	Feb 2019	A1
20190042944	Nair et al.	Feb 2019	A1
20190042948	Lee et al.	Feb 2019	A1
20190057314	Julian et al.	Feb 2019	A1
20190065637	Bogdoll et al.	Feb 2019	A1
20190072978	Levi	Mar 2019	A1
20190079526	Vallespi et al.	Mar 2019	A1
20190080602	Rice et al.	Mar 2019	A1
20190088948	Rasale	Mar 2019	A1
20190095780	Zhong et al.	Mar 2019	A1
20190095946	Azout et al.	Mar 2019	A1
20190101914	Coleman et al.	Apr 2019	A1
20190108417	Talagala et al.	Apr 2019	A1
20190122111	Min et al.	Apr 2019	A1
20190130255	Yim et al.	May 2019	A1
20190145765	Luo et al.	May 2019	A1
20190146497	Urtasun et al.	May 2019	A1
20190147112	Gordon	May 2019	A1
20190147250	Zhang et al.	May 2019	A1
20190147254	Bai et al.	May 2019	A1
20190147255	Homayounfar et al.	May 2019	A1
20190147335	Wang et al.	May 2019	A1
20190147372	Luo et al.	May 2019	A1
20190158784	Ahn et al.	May 2019	A1
20190179870	Bannon	Jun 2019	A1
20190180154	Orlov et al.	Jun 2019	A1
20190185010	Ganguli et al.	Jun 2019	A1
20190189251	Horiuchi et al.	Jun 2019	A1
20190197357	Anderson et al.	Jun 2019	A1
20190204842	Jafari et al.	Jul 2019	A1
20190205402	Sernau et al.	Jul 2019	A1
20190205667	Avidan et al.	Jul 2019	A1
20190217791	Bradley et al.	Jul 2019	A1
20190227562	Mohammadiha et al.	Jul 2019	A1
20190228037	Nicol et al.	Jul 2019	A1
20190230282	Sypitkowski et al.	Jul 2019	A1
20190235499	Kazemi et al.	Aug 2019	A1
20190235866	Das Sarma	Aug 2019	A1
20190236437	Shin et al.	Aug 2019	A1
20190243371	Nister et al.	Aug 2019	A1
20190244138	Bhowmick et al.	Aug 2019	A1
20190250622	Nister et al.	Aug 2019	A1
20190250626	Ghafarianzacieh et al.	Aug 2019	A1
20190250640	O'Flaherty et al.	Aug 2019	A1
20190258878	Koivisto et al.	Aug 2019	A1
20190266418	Xu et al.	Aug 2019	A1
20190266610	Ghatage et al.	Aug 2019	A1
20190272446	Kangaspunta et al.	Sep 2019	A1
20190276041	Choi et al.	Sep 2019	A1
20190279004	Kwon et al.	Sep 2019	A1
20190286652	Habbecke et al.	Sep 2019	A1
20190286972	El Husseini et al.	Sep 2019	A1
20190287028	St Amant et al.	Sep 2019	A1
20190289281	Badrinarayanan et al.	Sep 2019	A1
20190294177	Kwon et al.	Sep 2019	A1
20190294975	Sachs	Sep 2019	A1
20190311253	Chung	Oct 2019	A1
20190311290	Huang et al.	Oct 2019	A1
20190318099	Carvalho et al.	Oct 2019	A1
20190325088	Dubey et al.	Oct 2019	A1
20190325266	Klepper et al.	Oct 2019	A1
20190325269	Bagherinezhad et al.	Oct 2019	A1
20190325580	Lukac et al.	Oct 2019	A1
20190325595	Stein et al.	Oct 2019	A1
20190329790	Nandakumar et al.	Oct 2019	A1
20190332875	Vallespi-Gonzalez et al.	Oct 2019	A1
20190333232	Vallespi-Gonzalez et al.	Oct 2019	A1
20190336063	Dascalu	Nov 2019	A1
20190339989	Liang et al.	Nov 2019	A1
20190340462	Pao et al.	Nov 2019	A1
20190340492	Burger et al.	Nov 2019	A1
20190340499	Burger et al.	Nov 2019	A1
20190347501	Kim et al.	Nov 2019	A1
20190349571	Herman et al.	Nov 2019	A1
20190354782	Kee et al.	Nov 2019	A1
20190354786	Lee et al.	Nov 2019	A1
20190354808	Park et al.	Nov 2019	A1
20190354817	Shlens et al.	Nov 2019	A1
20190354850	Watson et al.	Nov 2019	A1
20190370398	He et al.	Dec 2019	A1
20190370575	Nandakumar et al.	Dec 2019	A1
20190370645	Lee	Dec 2019	A1
20190370935	Chang et al.	Dec 2019	A1
20190373322	Rojas-Echenique et al.	Dec 2019	A1
20190377345	Bachrach et al.	Dec 2019	A1
20190377965	Totolos et al.	Dec 2019	A1
20190378049	Widmann et al.	Dec 2019	A1
20190378051	Widmann et al.	Dec 2019	A1
20190382007	Casas et al.	Dec 2019	A1
20190384303	Muller et al.	Dec 2019	A1
20190384304	Towal et al.	Dec 2019	A1
20190384309	Silva et al.	Dec 2019	A1
20190384994	Frossard et al.	Dec 2019	A1
20190385048	Cassidy et al.	Dec 2019	A1
20190385360	Yang et al.	Dec 2019	A1
20200004259	Gulino et al.	Jan 2020	A1
20200004351	Marchant et al.	Jan 2020	A1
20200012936	Lee et al.	Jan 2020	A1
20200017117	Milton	Jan 2020	A1
20200025931	Liang et al.	Jan 2020	A1
20200026282	Choe et al.	Jan 2020	A1
20200026283	Barnes et al.	Jan 2020	A1
20200026992	Zhang et al.	Jan 2020	A1
20200027210	Haemel et al.	Jan 2020	A1
20200033858	Xiao	Jan 2020	A1
20200033865	Mellinger et al.	Jan 2020	A1
20200034148	Sumbu	Jan 2020	A1
20200034665	Ghanta et al.	Jan 2020	A1
20200034710	Sidhu et al.	Jan 2020	A1
20200036948	Song	Jan 2020	A1
20200039520	Misu et al.	Feb 2020	A1
20200051550	Baker	Feb 2020	A1
20200060757	Ben-Haim et al.	Feb 2020	A1
20200065711	Clément et al.	Feb 2020	A1
20200065879	Hu et al.	Feb 2020	A1
20200069973	Lou et al.	Mar 2020	A1
20200073385	Jobanputra et al.	Mar 2020	A1
20200074230	Englard et al.	Mar 2020	A1
20200086880	Poeppel et al.	Mar 2020	A1
20200089243	Poeppel et al.	Mar 2020	A1
20200089969	Lakshmi et al.	Mar 2020	A1
20200090056	Singhal et al.	Mar 2020	A1
20200097841	Petousis et al.	Mar 2020	A1
20200098095	Borcs et al.	Mar 2020	A1
20200103894	Cella et al.	Apr 2020	A1
20200104705	Bhowmick et al.	Apr 2020	A1
20200110416	Hong et al.	Apr 2020	A1
20200117180	Cella et al.	Apr 2020	A1
20200117889	Laput et al.	Apr 2020	A1
20200117916	Liu	Apr 2020	A1
20200117917	Yoo	Apr 2020	A1
20200118035	Asawa et al.	Apr 2020	A1
20200125844	She et al.	Apr 2020	A1
20200125845	Hess et al.	Apr 2020	A1
20200126129	Lkhamsuren et al.	Apr 2020	A1
20200134427	Oh et al.	Apr 2020	A1
20200134461	Chai et al.	Apr 2020	A1
20200134466	Weintraub et al.	Apr 2020	A1
20200134848	El-Khamy et al.	Apr 2020	A1
20200143231	Fusi et al.	May 2020	A1
20200143279	West et al.	May 2020	A1
20200148201	King et al.	May 2020	A1
20200149898	Felip et al.	May 2020	A1
20200151201	Chandrasekhar et al.	May 2020	A1
20200151619	Mopur et al.	May 2020	A1
20200151692	Gao et al.	May 2020	A1
20200158822	Owens et al.	May 2020	A1
20200158869	Amirloo et al.	May 2020	A1
20200159225	Zeng et al.	May 2020	A1
20200160064	Wang et al.	May 2020	A1
20200160104	Urtasun et al.	May 2020	A1
20200160117	Urtasun et al.	May 2020	A1
20200160178	Kar et al.	May 2020	A1
20200160532	Urtasun et al.	May 2020	A1
20200160558	Urtasun et al.	May 2020	A1
20200160559	Urtasun et al.	May 2020	A1
20200160598	Manivasagam et al.	May 2020	A1
20200162489	Bar-Nahum et al.	May 2020	A1
20200167438	Herring	May 2020	A1
20200167554	Wang et al.	May 2020	A1
20200174481	Van Heukelom et al.	Jun 2020	A1
20200175326	Shen et al.	Jun 2020	A1
20200175354	Volodarskiy et al.	Jun 2020	A1
20200175371	Kursun	Jun 2020	A1
20200175401	Shen	Jun 2020	A1
20200183482	Sebot et al.	Jun 2020	A1
20200184250	Oko	Jun 2020	A1
20200184333	Oh	Jun 2020	A1
20200192389	ReMine et al.	Jun 2020	A1
20200193313	Ghanta et al.	Jun 2020	A1
20200193328	Guestrin et al.	Jun 2020	A1
20200202136	Shrestha et al.	Jun 2020	A1
20200202196	Guo et al.	Jun 2020	A1
20200209857	Djuric et al.	Jul 2020	A1
20200209867	Valois et al.	Jul 2020	A1
20200209874	Chen et al.	Jul 2020	A1
20200210717	Hou et al.	Jul 2020	A1
20200210769	Hou et al.	Jul 2020	A1
20200210777	Valois et al.	Jul 2020	A1
20200216064	du Toit et al.	Jul 2020	A1
20200218722	Mai et al.	Jul 2020	A1
20200218979	Kwon et al.	Jul 2020	A1
20200223434	Campos et al.	Jul 2020	A1
20200225758	Tang et al.	Jul 2020	A1
20200226377	Campos et al.	Jul 2020	A1
20200226430	Ahuja et al.	Jul 2020	A1
20200238998	Dasalukunte et al.	Jul 2020	A1
20200242381	Chao et al.	Jul 2020	A1
20200242408	Kim et al.	Jul 2020	A1
20200242511	Kale et al.	Jul 2020	A1
20200245869	Sivan et al.	Aug 2020	A1
20200249685	Elluswamy et al.	Aug 2020	A1
20200250456	Wang et al.	Aug 2020	A1
20200250515	Rifkin et al.	Aug 2020	A1
20200250874	Assouline et al.	Aug 2020	A1
20200257301	Weiser et al.	Aug 2020	A1
20200257306	Nisenzon	Aug 2020	A1
20200258057	Farahat et al.	Aug 2020	A1
20200265247	Musk et al.	Aug 2020	A1
20200272160	Djuric et al.	Aug 2020	A1
20200272162	Hasselgren et al.	Aug 2020	A1
20200272859	Iashyn et al.	Aug 2020	A1
20200273231	Schied et al.	Aug 2020	A1
20200279354	Klaiman	Sep 2020	A1
20200279364	Sarkisian et al.	Sep 2020	A1
20200279371	Wenzel et al.	Sep 2020	A1
20200285464	Brebner	Sep 2020	A1
20200286256	Houts et al.	Sep 2020	A1
20200293786	Jia et al.	Sep 2020	A1
20200293796	Sajjad et al.	Sep 2020	A1
20200293828	Wang et al.	Sep 2020	A1
20200293905	Huang et al.	Sep 2020	A1
20200294162	Shah	Sep 2020	A1
20200294257	Yoo et al.	Sep 2020	A1
20200294310	Lee et al.	Sep 2020	A1
20200297237	Tamersoy et al.	Sep 2020	A1
20200298891	Liang et al.	Sep 2020	A1
20200301799	Manlvasagam et al.	Sep 2020	A1
20200302276	Yang et al.	Sep 2020	A1
20200302291	Hong	Sep 2020	A1
20200302627	Duggal et al.	Sep 2020	A1
20200302662	Homayounfar et al.	Sep 2020	A1
20200304441	Bradley et al.	Sep 2020	A1
20200306640	Kolen et al.	Oct 2020	A1
20200307562	Ghafarianzadeh et al.	Oct 2020	A1
20200307563	Ghafarianzadeh et al.	Oct 2020	A1
20200309536	Omari et al.	Oct 2020	A1
20200309923	Bhaskaran et al.	Oct 2020	A1
20200310442	Halder et al.	Oct 2020	A1
20200311601	Robinson et al.	Oct 2020	A1
20200312003	Borovikov et al.	Oct 2020	A1
20200315708	Mosnier et al.	Oct 2020	A1
20200320132	Neumann	Oct 2020	A1
20200324073	Rajan et al.	Oct 2020	A1
20200327192	Hackman et al.	Oct 2020	A1
20200327443	Van et al.	Oct 2020	A1
20200327449	Tiwari et al.	Oct 2020	A1
20200327662	Liu et al.	Oct 2020	A1
20200327667	Arbel et al.	Oct 2020	A1
20200331476	Chen et al.	Oct 2020	A1
20200334416	Vianu et al.	Oct 2020	A1
20200334495	Al et al.	Oct 2020	A1
20200334501	Lin et al.	Oct 2020	A1
20200334551	Javidi et al.	Oct 2020	A1
20200334574	Ishida	Oct 2020	A1
20200337648	Saripalli et al.	Oct 2020	A1
20200341466	Pham et al.	Oct 2020	A1
20200342350	Madar et al.	Oct 2020	A1
20200342548	Mazed et al.	Oct 2020	A1
20200342652	Rowell et al.	Oct 2020	A1
20200348909	Das Sarma et al.	Nov 2020	A1
20200350063	Thornton et al.	Nov 2020	A1
20200351438	Dewhurst et al.	Nov 2020	A1
20200356107	Wells	Nov 2020	A1
20200356790	Jaipuria et al.	Nov 2020	A1
20200356864	Neumann	Nov 2020	A1
20200356905	Luk et al.	Nov 2020	A1
20200361083	Mousavian et al.	Nov 2020	A1
20200361485	Zhu et al.	Nov 2020	A1
20200364481	Kornienko et al.	Nov 2020	A1
20200364508	Gurel et al.	Nov 2020	A1
20200364540	Elsayed et al.	Nov 2020	A1
20200364746	Longano et al.	Nov 2020	A1
20200364953	Simoudis	Nov 2020	A1
20200372362	Kim	Nov 2020	A1
20200372402	Kursun et al.	Nov 2020	A1
20200380362	Cao et al.	Dec 2020	A1
20200380383	Kwong et al.	Dec 2020	A1
20200393841	Frisbie et al.	Dec 2020	A1
20200394421	Yu et al.	Dec 2020	A1
20200394457	Brady	Dec 2020	A1
20200394495	Moudgill et al.	Dec 2020	A1
20200394813	Theverapperuma et al.	Dec 2020	A1
20200396394	Zlokolica et al.	Dec 2020	A1
20200398855	Thompson	Dec 2020	A1
20200401850	Bazarsky et al.	Dec 2020	A1
20200401886	Deng et al.	Dec 2020	A1
20200402155	Kurian et al.	Dec 2020	A1
20200402226	Peng	Dec 2020	A1
20200410012	Moon et al.	Dec 2020	A1
20200410224	Goel	Dec 2020	A1
20200410254	Pham et al.	Dec 2020	A1
20200410288	Capota et al.	Dec 2020	A1
20200410751	Omari et al.	Dec 2020	A1
20210004014	Sivakumar	Jan 2021	A1
20210004580	Sundararaman et al.	Jan 2021	A1
20210004611	Garimella et al.	Jan 2021	A1
20210004663	Park et al.	Jan 2021	A1
20210006835	Slattery et al.	Jan 2021	A1
20210011908	Hayes et al.	Jan 2021	A1
20210012116	Urtasun et al.	Jan 2021	A1
20210012210	Sikka et al.	Jan 2021	A1
20210012230	Hayes et al.	Jan 2021	A1
20210012239	Arzani et al.	Jan 2021	A1
20210015240	Elfakhri et al.	Jan 2021	A1
20210019215	Neeter	Jan 2021	A1
20210026360	Luo	Jan 2021	A1
20210027112	Brewington et al.	Jan 2021	A1
20210027117	McGavran et al.	Jan 2021	A1
20210030276	Li et al.	Feb 2021	A1
20210034921	Pinkovich et al.	Feb 2021	A1
20210042575	Firner	Feb 2021	A1
20210042928	Takeda et al.	Feb 2021	A1
20210046954	Haynes	Feb 2021	A1
20210048984	Bannon	Feb 2021	A1
20210049378	Gautam et al.	Feb 2021	A1
20210049455	Kursun	Feb 2021	A1
20210049456	Kursun	Feb 2021	A1
20210049548	Grisz et al.	Feb 2021	A1
20210049700	Nguyen et al.	Feb 2021	A1
20210056114	Price et al.	Feb 2021	A1
20210056306	Hu et al.	Feb 2021	A1
20210056317	Golov	Feb 2021	A1
20210056420	Konishi et al.	Feb 2021	A1
20210056701	Vranceanu et al.	Feb 2021	A1
20220365753	Bannon	Nov 2022	A1

Foreign Referenced Citations (255)

Number	Date	Country
2019261735	Jun 2020	AU
2019201716	Oct 2020	AU
110599537	Dec 2010	CN
102737236	Oct 2012	CN
103366339	Oct 2013	CN
104835114	Aug 2015	CN
103236037	May 2016	CN
103500322	Aug 2016	CN
106419893	Feb 2017	CN
106504253	Mar 2017	CN
107031600	Aug 2017	CN
107169421	Sep 2017	CN
107507134	Dec 2017	CN
107885214	Apr 2018	CN
108122234	Jun 2018	CN
107133943	Jul 2018	CN
107368926	Jul 2018	CN
105318888	Aug 2018	CN
108491889	Sep 2018	CN
108647591	Oct 2018	CN
108710865	Oct 2018	CN
105550701	Nov 2018	CN
108764185	Nov 2018	CN
108845574	Nov 2018	CN
108898177	Nov 2018	CN
109086867	Dec 2018	CN
107103113	Jan 2019	CN
109215067	Jan 2019	CN
109359731	Feb 2019	CN
109389207	Feb 2019	CN
109389552	Feb 2019	CN
106779060	Mar 2019	CN
109579856	Apr 2019	CN
109615073	Apr 2019	CN
106156754	May 2019	CN
106598226	May 2019	CN
106650922	May 2019	CN
109791626	May 2019	CN
109901595	Jun 2019	CN
109902732	Jun 2019	CN
109934163	Jun 2019	CN
109948428	Jun 2019	CN
109949257	Jun 2019	CN
109951710	Jun 2019	CN
109975308	Jul 2019	CN
109978132	Jul 2019	CN
109978161	Jul 2019	CN
110060202	Jul 2019	CN
110069071	Jul 2019	CN
110084086	Aug 2019	CN
110096937	Aug 2019	CN
110111340	Aug 2019	CN
110135485	Aug 2019	CN
110197270	Sep 2019	CN
110310264	Oct 2019	CN
110321965	Oct 2019	CN
110334801	Oct 2019	CN
110399875	Nov 2019	CN
110414362	Nov 2019	CN
110426051	Nov 2019	CN
110473173	Nov 2019	CN
110516665	Nov 2019	CN
110543837	Dec 2019	CN
110569899	Dec 2019	CN
110599864	Dec 2019	CN
110619282	Dec 2019	CN
110619283	Dec 2019	CN
110619330	Dec 2019	CN
110659628	Jan 2020	CN
110688992	Jan 2020	CN
107742311	Feb 2020	CN
110751280	Feb 2020	CN
110826566	Feb 2020	CN
107451659	Apr 2020	CN
108111873	Apr 2020	CN
110956185	Apr 2020	CN
110966991	Apr 2020	CN
111027549	Apr 2020	CN
111027575	Apr 2020	CN
111047225	Apr 2020	CN
111126453	May 2020	CN
111158355	May 2020	CN
107729998	Jun 2020	CN
108549934	Jun 2020	CN
111275129	Jun 2020	CN
111275618	Jun 2020	CN
111326023	Jun 2020	CN
111428943	Jul 2020	CN
111444821	Jul 2020	CN
111445420	Jul 2020	CN
111461052	Jul 2020	CN
111461053	Jul 2020	CN
111488770	Jul 2020	CN
110225341	Aug 2020	CN
111307162	Aug 2020	CN
111488770	Aug 2020	CN
111539514	Aug 2020	CN
111565318	Aug 2020	CN
111582216	Aug 2020	CN
111598095	Aug 2020	CN
108229526	Sep 2020	CN
111693972	Sep 2020	CN
106558058	Oct 2020	CN
107169560	Oct 2020	CN
107622258	Oct 2020	CN
111767801	Oct 2020	CN
111768002	Oct 2020	CN
111783545	Oct 2020	CN
111783971	Oct 2020	CN
111797657	Oct 2020	CN
111814623	Oct 2020	CN
111814902	Oct 2020	CN
111860499	Oct 2020	CN
111881856	Nov 2020	CN
111882579	Nov 2020	CN
111897639	Nov 2020	CN
111898507	Nov 2020	CN
111898523	Nov 2020	CN
111899227	Nov 2020	CN
112101175	Dec 2020	CN
112101562	Dec 2020	CN
112115953	Dec 2020	CN
111062973	Jan 2021	CN
111275080	Jan 2021	CN
112183739	Jan 2021	CN
112232497	Jan 2021	CN
112288658	Jan 2021	CN
112308095	Feb 2021	CN
112308799	Feb 2021	CN
112313663	Feb 2021	CN
112329552	Feb 2021	CN
112348783	Feb 2021	CN
111899245	Mar 2021	CN
202017102235	May 2017	DE
202017102238	May 2017	DE
102017116017	Jan 2019	DE
102018130821	Jun 2020	DE
102019008316	Aug 2020	DE
0 422 348	Apr 1991	EP
1215626	Sep 2008	EP
2228666	Sep 2012	EP
2420408	May 2013	EP
2723069	Apr 2014	EP
2741253	Jun 2014	EP
3115772	Jan 2017	EP
2618559	Aug 2017	EP
3285485	Feb 2018	EP
2863633	Feb 2019	EP
3113080	May 2019	EP
3525132	Aug 2019	EP
3531689	Aug 2019	EP
3537340	Sep 2019	EP
3543917	Sep 2019	EP
3608840	Feb 2020	EP
3657387	May 2020	EP
2396750	Jun 2020	EP
3664020	Jun 2020	EP
3690712	Aug 2020	EP
3690742	Aug 2020	EP
3722992	Oct 2020	EP
3690730	Nov 2020	EP
3739486	Nov 2020	EP
3501897	Dec 2020	EP
3751455	Dec 2020	EP
3783527	Feb 2021	EP
2402572	Aug 2005	GB
2548087	Sep 2017	GB
2577485	Apr 2020	GB
2517270	Jun 2020	GB
04-295953	Oct 1992	JP
2578262	Aug 1998	JP
3941252	Jul 2007	JP
4282583	Jun 2009	JP
4300098	Jul 2009	JP
2010-079840	Apr 2010	JP
2015004922	Jan 2015	JP
2015-056124	Mar 2015	JP
5863536	Feb 2016	JP
6044134	Dec 2016	JP
2017-027149	Feb 2017	JP
6525707	Jun 2019	JP
2019101535	Jun 2019	JP
2020101927	Jul 2020	JP
2020173744	Oct 2020	JP
100326702	Feb 2002	KR
101082878	Nov 2011	KR
101738422	May 2017	KR
101969864	Apr 2019	KR
101996167	Jul 2019	KR
102022388	Aug 2019	KR
102043143	Nov 2019	KR
102095335	Mar 2020	KR
102097120	Apr 2020	KR
1020200085490	Jul 2020	KR
102189262	Dec 2020	KR
1020200142266	Dec 2020	KR
200630819	Sep 2006	TW
I294089	Mar 2008	TW
I306207	Feb 2009	TW
WO 9410638	May 1994	WO
WO 02052835	Jul 2002	WO
WO 14025765	Feb 2014	WO
WO 16032398	Mar 2016	WO
WO 16048108	Mar 2016	WO
WO 16099779	Jun 2016	WO
WO 16186811	Nov 2016	WO
WO 16186823	Nov 2016	WO
WO 16207875	Dec 2016	WO
WO 17117186	Jul 2017	WO
WO 17158622	Sep 2017	WO
WO 19005547	Jan 2019	WO
WO 19067695	Apr 2019	WO
WO 19089339	May 2019	WO
WO 19092456	May 2019	WO
WO 19099622	May 2019	WO
WO 19122952	Jun 2019	WO
WO 19125191	Jun 2019	WO
WO 19126755	Jun 2019	WO
WO 19144575	Aug 2019	WO
WO 19182782	Sep 2019	WO
WO 19191578	Oct 2019	WO
WO 19216938	Nov 2019	WO
WO 19220436	Nov 2019	WO
WO 20006154	Jan 2020	WO
WO 2012756	Jan 2020	WO
WO 20025696	Feb 2020	WO
WO 20034663	Feb 2020	WO
WO 20056157	Mar 2020	WO
WO 20076356	Apr 2020	WO
WO 20097221	May 2020	WO
WO 20101246	May 2020	WO
WO 20120050	Jun 2020	WO
WO 20121973	Jun 2020	WO
WO 20131140	Jun 2020	WO
WO 20139181	Jul 2020	WO
WO 20139355	Jul 2020	WO
WO 20139357	Jul 2020	WO
WO 20142193	Jul 2020	WO
WO 20146445	Jul 2020	WO
WO 20151329	Jul 2020	WO
WO 20157761	Aug 2020	WO
WO 20163455	Aug 2020	WO
WO 20167667	Aug 2020	WO
WO 20174262	Sep 2020	WO
WO 20177583	Sep 2020	WO
WO 20185233	Sep 2020	WO
WO 20185234	Sep 2020	WO
WO 20195658	Oct 2020	WO
WO 20198189	Oct 2020	WO
WO 20198779	Oct 2020	WO
WO 20205597	Oct 2020	WO
WO 20221200	Nov 2020	WO
WO 20240284	Dec 2020	WO
WO 20260020	Dec 2020	WO
WO 20264010	Dec 2020	WO

Non-Patent Literature Citations (14)

Entry
Cornu et al., “Design, Implementation, and Test of a Multi-Model Systolic Neural-Network Accelerator”, Scientific Programming-Parallel Computing Projects of the Swiss Priority Programme, vol. 5, No. 1, Jan. 1, 1996.
Kim et al., “A Large-scale Architecture for Restricted Boltzmann Machines”, Department of Electrical Engineering Stanford University, 2010 18th IEEE Annual International Symposium on, IEEE, Piscataway, NJ, USA, May 2, 010.
Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, downloaded from<http://papers.nips.co/book/advances-in-neural-information-processing-systems-25-2012>, The 26th annual conference on Neural Information Processing Systems: Dec. 3-8, 2012.
Kung S: “VLSI Array processors”, IEEE ASSP Magazine, IEEE. US, vol. 2, No. 3, Jul. 1985 (1 pg).
Oxford Dictionary, Definition of synchronize, retrieved Sep. 12, 2020, https://www/lexico.com/en/definition/synchronize.
Sato et al., “An in-depth look at Google's first Tensor Processing Unit (TPU)”, posted in Google Cloud Big Data and Machine Learning Blog, downloaded from internet, <URL: https://cloud.google.com/blog/big-data/>,posted May 12, 2017.
Wikipedia, Accumulator (computing), Version from Jul. 14, 2017, 4 pp.
International Search Report and Written Opinion dated Oct. 1, 2018, in International Patent Application No. PCT/US18/42959.
International Search Report and Written Opinion dated Sep. 10, 2018 in application No. PCT/US18/38618.
Jouppi et al., Jun. 26, 2017, In-datacenter performance analysis of a tensor processing unit, 44th International symposium on Computer Architecture IKSCA), Toronto, Canada, 28 pp.
Wikipedia, Booth's multiplication algorithm, Version from May 30, 2017, 5 pp.
Arima et al., Aug. 15, 1994, Recent Topics of Neurochips, System/Control/Information, 38(8):19.
Iwase et al., May 1, 2002, High-speed processing method in SIMD-type parallel computer, Den Journal of the Institute of Electrical Engineers of Japan C, 122-C(5):878-884
Takahashi, Aug. 2, 1989, Parallel Processing Mechanism, First Edition, Maruzen Co., Ltd., pp. 67-77, 259.

Related Publications (1)

	Number	Date	Country
	20220050806 A1	Feb 2022	US

Provisional Applications (3)

Number	Date	Country
62628212	Feb 2018	US
62625251	Feb 2018	US
62536399	Jul 2017	US

Continuations (1)

	Number	Date	Country
Parent	15920173	Mar 2018	US
Child	17451984		US

Continuation in Parts (1)

	Number	Date	Country
Parent	15710433	Sep 2017	US
Child	15920173		US

Computational array microprocessor system using non-consecutive data formatting

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Disclaimer

Abstract