Embodiments generally relate to computing systems. More particularly, embodiments relate to improving performance in processing unstructured sparse data, such as three-dimensional (3D) pointcloud data, using tile-based execution and sparsity-aware dataflow optimization.
Understanding three-dimensional (3D) geometry and semantics of a scene are essential to many real-world systems such as autonomous driving, robotics, remote sensing, augmented reality/virtual reality (AR/VR) systems, and so forth. Conventional solutions may face a number of challenges in processing 3D visual data.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Data from 3D sensors or other 3D data sources is known as 3D pointcloud data (or “pointcloud data”), which is characterized by a high volume but sparse data set. In sparse data sets, much of the data has a value of zero (or near zero), also known as inactive data points. Deep neural network (DNN) methodologies such as, e.g., convolutional neural network (CNN) technology used in two-dimensional (2D) image processing may be considered for various 3D visual and artificial intelligence (AI) applications such as shape classification, object detection, tracking, and scene segmentation. Among several methods proposed for processing 3D data, volumetric projection-based methods may process the neighborhood structure of 3D scenes. These methods face severe challenges, however, in processing 3D visual data due to the high dimensionality and the unstructured nature of 3D data. The volumetric methods involve voxelization which introduces discretization artifacts and causes information loss. Low-resolution voxel representation can degrade accuracy. On the other hand, maintaining high resolution, such as provided for in high-resolution pointclouds, increases computation and memory requirements in cubic order.
Implementations of 3D sparse convolution have drawbacks as well. For example, CPU- and GPU-based implementations involve data movement in gather and scatter operations, which significantly adds to overall execution time. Due to feature-map size for an entire pointcloud exceeding the capacity of inner level of caches, gather and scatter operations require massive data movements across the last-level cache and off-chip memory. In addition, these solutions implement weight stationary (WS), a fixed dataflow for all layers in a neural network, by fetching the weight data only once and having multiple re-fetches for input feature maps (IFMs) and output feature maps (OFMs). Thus, for layers (e.g., initial and last layers in networks) operating over high resolution 3D pointcloud data, a WS dataflow results in excessively high data accesses as feature map size is significantly higher than weight data size. Since execution time is dominated by these layers, adopting a fixed WS dataflow severely degrades overall performance.
Although tiling may have been used in other applications processing dense 2D/3D data, tiling in 3D spatially sparse data (inherent in 3D pointcloud data) would result in extremely inefficient execution due to excessive memory consumption and uneven work distribution as a result of inherent spatial sparsity present in 3D pointcloud data. Furthermore, in the case of 3D spatially sparse CNNs, which store spatially sparse data in 1D compressed data structures, tiling a one-dimensional (1D) compressed structure would also have several challenges. For example, because the size of a compressed data-structure varies per input pointcloud and across different regions within a pointcloud, a tile size requirement may vary significantly and cannot be estimated through mathematical formulation. In addition, storing 3D data in an unordered 1-D compressed format would result in irregular data accesses as convolution operations need to be performed on spatially proximate points in 3D space. Accordingly, data accesses cannot be predicted analytically.
An improved computing system as described herein provides technology to optimize (e.g., accelerate) processing of unstructured sparse data, such as 3D pointcloud data, by a compute engine (which may include a neural network such as a convolution neural network (CNN)) through tile-based execution while orchestrating optimal dataflow for the data processing with input-dependent spatial sparsity. The technology may include generation of a locality-aware rulebook, which encodes the receptive/response field for every voxel in the pointcloud; generation of sparsity attributes to represent the sparsity dependent variation in data accesses and number of operations in spatial regions in the pointcloud data; tiling selection based on 1D compressed pointcloud data; and a sparsity aware dataflow optimization to choose optimal tiling and loop order for each network layer given architecture parameters (e.g., size of available memory or cache). The technology may provide a rulebook structure to enable maximum spatial reuse of data by performing convolution operations over all voxels in the receptive (or response field) with a single fetch of feature map data.
The technology may also include dividing the process of dataflow optimization into an offline stage and a runtime stage to take advantage of meta-sparsity attributes, which are mostly consistent across pointclouds and thus may be extracted in an offline stage by processing a representative set of sample pointclouds. The technology may provide for optimizing dataflow in an offline stage based on the representative set of sample pointclouds, generating a table of optimal tiling and loop orders for each network layer with a table index based on an average receptive field (ARF) value for each sample pointcloud data set. The technology may further provide for determining, in a runtime stage, an optimal tiling and loop order for an input pointcloud data set through a table look-up based on an ARF value computed for the input pointcloud.
Thus, the technology described herein provides a system and method for three-dimensional (3D) sparse convolution, which avoids cubic growth in compute and memory requirements of other solutions. The technology exploits the inherent spatial sparsity present in 3D scenes to provide more efficient execution and storage by storing 3D sparse data in a one-dimensional (1D) compressed data structure and avoiding computation on free (empty) space.
The system 100 may also include a data sparsity attribute generator 140 that processes the locality-aware rulebook(s) 132 and generates a set of data sparsity attributes 142 representing the sparsity of active data (i.e., active voxels) in the input 3D pointcloud data set 110. The data sparsity attributes 142 are further described herein and with reference to
The system 100 may also include a sparsity-aware dataflow optimizer 150, which processes the locality-aware rulebook(s) 132 and the data sparsity attributes 142 to determine a tile size (e.g., an optimal tile size) and loop order for processing the input 3D pointcloud data set 110. The sparsity-aware dataflow optimizer 150 may include a candidate tile generator 160 to generate candidate tile sizes and a tile and loop order selector 170 to select the optimal tile size and loop order 172 based on one or more optimization criteria for the compute engine 180 to process the input 3D pointcloud data set 110. Network and architecture configuration parameters for the compute engine 180, such as neural network (NN) layer parameters 176 and architecture configuration parameters 178, may also be provided to dataflow optimizer 150. NN layer parameters 176 may include the number of input channels, the number of output channels, the number of filter (kernel) parameters, etc. Architecture configuration parameters 178 may include the available memory capacity (e.g., on-chip or cache memory), etc. Further details regarding the sparsity-aware dataflow optimizer 150 are described herein with reference to the sparsity-aware optimal dataflow and process 500 (as described herein with reference to
The compute engine 180 may implement a neural network such as, e.g., a convolution neural network (CNN), including a 3D CNN, to perform tile-based execution for processing spatially-sparse 3D pointcloud data, and may include tiling control logic 185 to handle selecting input pointcloud data for processing per the selected optimal tile size and loop order 172. The memory 190 may store all or portions of the input feature data associated with each 3D point in the 3D pointcloud data set 110, as well as the locality-aware rulebook(s) 132. The compute engine 180 may fetch data 192, which may include input feature data, network weight data, partially computed output feature data from previous compute steps along with locality-aware rulebook data, from the memory 190 for processing in accordance with the selected optimal tile size and loop order 172. The compute engine may store in memory 190 the intermediate results 194 from processing the pointcloud data (e.g. on a tile or level basis), which may be used in subsequent data fetches for other levels, tiles, etc. Once all processing is competed for the input 3D pointcloud data set 110, the compute engine may provide an output (e.g., data classification or other result).
Some or all components in the system 100 may be implemented using one or more of a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence (AI) accelerator, a field programmable gate array (FPGA) accelerator, an application specific integrated circuit (ASIC), and/or via a processor with software, or in a combination of a processor with software and an FPGA or ASIC. More particularly, components of the system 100 may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in configurable logic such as, for example, programmable logic arrays (PLAs), FPGAs, complex programmable logic devices (CPLDs), in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, complementary metal oxide semiconductor (CMOS) or transistor-transistor logic (TTL) technology, or any combination thereof.
For example, computer program code to carry out operations by the system 100 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Locality-Aware Rulebook Structure (i2o and o2i)
A locality-aware rulebook, such as rulebook(s) 132 generated via locality-aware rulebook generator 130 (
Turning now to
Continuing with
Continuing with
As shown in
Tiling for 3D Spatially Sparse Pointcloud Processing
As shown in
Also shown in
Sparsity Attributes
Sparsity attributes may be generated to represent the sparsity-dependent variation in data accesses and the number of operations in spatial regions in the pointcloud data. Sparsity attributes may be extracted through a single pass inspection of input pointcloud data. Sparsity attributes may encode local sparsity structure in form of memory-size requirements and data-accesses over a range of region-sizes. The region-sizes represent a large number of regions in the given input pointcloud. This enables to determine net data accesses for each valid tiling option without needing to re-process the input pointcloud multiple times.
As discussed above with tiling, for a given drb (the number of rb-lines in tile) and rulebook, di (the number of input voxels for an o2i rulebook) or do (the number of output voxels for an i2o-rulebook) may vary across tiles as local sparsity may differ across regions in an input pointcloud. For a kth tile with drb rb-lines, dok and dik may be expressed as follows:
where U denotes the union operator as a set collection of unique elements, rbk represents size of local neighborhood, and i2o/o2i identify the types of rulebooks (i2o rulebook or o2i rulebook, respectively). To model these sparsity dependent values (dik, dok, and rbk) as function of drb, two sparsity attributes may be defined as follows:
These sparsity attributes may be computed by pre-processing an o2i rulebook and/or and i2o rulebook over a range of drb values (i.e., a range of potential tile sizes).
Using these sparsity concepts, a set of sparsity attributes may be defined for an entire pointcloud data set, as follows, for use in determining an optimal processing dataflow:
Sparsity-Aware Optimal Dataflow
An optimal 3D pointcloud processing dataflow for tile-based execution via the system of
At processing block 510, for the given input pointcloud the two versions of locality-aware rulebooks (namely, an i2o-rulebook and an o2i-rulebook) may be generated. In some embodiments, only one version (e.g., either an i2o-rulebook and an o2i-rulebook) may be generated for the pointcloud. At processing block 520 the sparsity attributes, already discussed, may be computed.
At processing block 530, for given neural network layer and architecture parameters, tile candidates may be selected such that they fit within constrained on-chip (or cache) memory. Tile size may be estimated for a candidate tile (drb, dic, doc) using the rulebook-specific max sparsity-attributes (already discussed) o2imax and o2rbmax (for an o2i-rulebook) or i2omax and i2rbmax (for an i2o-rulebook). As discussed previously, a candidate tile is defined by 3 parameters: drb (subset of rulebook lines, dic (subset of input channels), and doc (subset of output channels). For an o2i-rulebook, estimated tile size may be computed as follows:
sizeo2i(drb,dic,doc)=o2imax(drb)×drb×dic×fm_prec+drb×doc×fm_prec+F×dic×doc×wt_prec+drb×(krb+o2rbmax(drb))×rb_prec Equation 4:
where F is the number of coefficients in the kernel (i.e., filter), and fm_prec, wt_prec, and rb_prec are the precisions in bytes for feature maps, weights and rulebook data respectively. The term krb is a constant to account for the bitmask and other metadata in each rulebook line. The parameters drb, dic and doc are the tile parameters (already discussed) for each candidate tile. A similar computation may be used to estimate tile sizes for an i2o rulebook using the attributes i2omax and i2rbmax. Those tiles for which sizeo2i(drb, dic, doc) exceed the available on-chip or cache memory are eliminated from further consideration (and similarly for tile size estimates for an i2o rulebook).
At processing block 540, iterate over tile candidates to determine the number of data accesses for each combination of dataflow (loop order) parameters. The computations in convolution neural networks (CNNs) involve three nested loops, one running over input/output voxel indices in spatial dimension, one running over input channels, and one running over output channels. These loops maybe arranged in different orders—also known as walk-pattern (WP). The data fetched in outer loops can be reused for calculations in inner loops and therefore can be kept stationary in the memory. For example, if the innermost loop is running over input channels, input feature map (IFM) data and weight data is fetched in the innermost loop and output feature map (OFM) data is fetched in an outer loop. In such a case the same OFM data may be reused in the innermost loop, and this loop order is termed as Output Stationary (OS). Similarly, if the innermost loop is running over output channels, the IFM data may be reused in the innermost loop, and this loop order is termed as Input stationary (IS). If the innermost loop is running over input, output voxel indices, weight data may be reused, and this loop order is termed as Weight stationary (WS). The total data accesses in computation may depend on the size of data tiles used in each loop and the order of loops. In the case of dense data, each tile contains the same amount of data. On the other hand, in case of spatially sparse data, each tile may contain a varying number of input and/or output voxels. For the sparse data, the number of data accesses may be estimated based on the average sparsity attributes. For example, the number of data accesses (Acco2i) for an o2i-rulebook for each potential tile size and loop order combination may be estimated based on the o2i average sparsity attributes (o2iavg, o2rbavg) as follows:
and where Ic, Oc and Rb represent number of input channels, number of output channels and number of rb-lines in the given network layer respectively; and where WP denotes a candidate walk-pattern (i.e., loop order) which may be chosen over a set of Input-Stationary (IS), Output-Stationary (OS), and/or Weight-Stationary (WS) walk-patterns. These computations are repeated for each combination of tile size (drb, dic, doc) and WP (loop order). For example, for each given tile size (drb, dic, doc) a variety of walk patterns may be applied, and the number of data accesses may be computed for each combination. Similar computations may be used to estimate the number of data accesses using an i2o-rulebook based on the i2o average sparsity attributes (i2oavg, i2rbavg).
At processing block 550, for given neural network layer and architecture parameters, a tile size and loop order combination is selected to meet optimization criteria, once the optimizer has explored the potential dataflow combinations for one or both the variants of the locality-aware rulebook (block 540). For example, where the optimal dataflow is determined based on optimization criteria of minimizing data accesses, the tile size and loop order combination that results in the minimum number of data accesses is selected. Once the optimal tile size and loop order combination is selected, the optimal tile size and loop order may be provided to the compute engine (neural network) for processing the pointcloud data set, as illustrated in
The process 500 may be implemented in a computing system such as, e.g., the system 100 described herein with reference to
For example, computer program code to carry out operations shown in the process 500 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Meta-Sparsity Attributes and Offline Stage Processing
Sparsity attributes as discussed above may be categorized into two sets: (a) common attributes which are consistent across pointclouds—referred to as Meta-Sparsity Attributes (MSA), and (b) Input Specific Attributes (ISA) which vary highly across pointclouds. Since the extraction of sparsity attributes and the dataflow exploration as discussed above may be computationally intensive and may therefore add to latency overhead, a further improvement in the optimization techniques herein may be obtained through pre-computing meta-sparsity attributes in an offline stage over a representative set of M sample pointclouds for selected binned values of ISA. The MSA refers to the attributes which remain consistent across a class of pointclouds. Meta sparsity attributes, thus, serve as approximation to certain of the actual sparsity attributes of pointcloud data sets.
For example, behavior of the two types of sparsity attributes, o2i(drb) and o2rb(drb), across a set of point clouds, is illustrated in
An average receptive field (ARF) may be computed for each pointcloud as an input-dependent attribute. The ARF may be computed by averaging each of the receptive fields in the rulebook, where each rb-line represents the receptive field for a given voxel. That is, ARF may be calculated by summing of number of entries on each rulebook line for all rulebook lines and dividing it by number of rulebook lines. ARF may represent o2rbavg for an o2i rulebook (or i2rbavg for an i2o rulebook), which is also essentially invariant (i.e., variation is negligible) to the value of drb. The ARF remains consistent within a pointcloud, but the ARF will vary significantly across pointclouds. Using meta-sparsity attributes, optimal dataflow may be pre-computed in the offline stage over a range ARFs (e.g., ARF1, . . . ARFm, . . . ARFM) for a set of sample pointclouds and a table of optimal dataflow selections (tile size and loop order combinations), one for each ARF, may be compiled. The set of ARF values may be selected by sufficiently binning the entire range of ARFs, for example by processing a sufficient number of representative sample pointcloud data sets. As an example, assume that ARF can vary over a range from 10-25, and in steps of 0.5. Then optimal tile/loop order may be calculated for pointclouds having ARF values of 10, 10.5, . . . 24.5, 25 (a total of approximately 30 such ARF values in this example). Representative pointcloud data sets may be obtained, for example, from the same type of sensor or from similar views. Thus, with sufficient variety in ARFs in the offline stage, once the table is compiled ARF may be used as an index to select an optimal tile size and loop order combination in a runtime stage for a given input pointcloud of interest with ARFi.
For purposes of estimating data accesses, the MSA mo2iavg(drb) may be computed over a range of pointclouds (P=1 to M) for an o2i-rulebook as follows:
A similar computation may be made for the MSA mi2oavg(drb) for an i2o-rulebook. Similar to mo2iavg(drb), another MSA, o2iQ_tile(n) may be defined for tile size estimation, where o2iQ_tile(n) represents the n-th quantile of the attribute o2iavg such that:
Probability(o2iavgP(drb)≤o2iQ_tile(n)(drb))=n, for 1≤P<M Equation 7:
For example, for 90th-quantile (n=0.9), o2iQ_tile(n) is chosen such that it is larger than the sparsity attribute o2iavg(drb) of 90% of pointclouds. A similar computation may be made for MSA i2oQ_tile(n) for an i2o-rulebook. For example, with n=0.9, then 90% of the actual data tiles during a runtime stage are likely to fit within the size estimated based on qo2idrb(n). During the runtime stage, if the data tile exceeds the allocated size based on the estimated size, the tile may be split into two or more sub-tiles such that size requirement for each sub-tile does not exceed the constrained memory size.
Once the optimal tile size and loop order combinations have been determined and is compiled (e.g., into a lookup table) in the offline stage, a given input pointcloud of interest may be processed in the runtime stage. The input pointcloud data set may have an average receptive filed ARFi, and the optimal tile size and loop-order for processing the input pointcloud may be obtained through a table look-up based on ARFi.
The system 600 includes offline stage processing for determining sparsity attributes for a representative set of M sample 3D pointcloud data sets 610 (e.g., pointcloud1, . . . pointcloudm, . . . , pointcloudM, where m may be in the range 1 . . . M). A set of offline procedures 620 may include applying hashmap 120, locality-aware rulebook generator 130, data sparsity attribute generator 140 and ARF compute 630 for each of the M sample 3D pointcloud data sets. ARF compute 630 may compute the average receptive field value for a given pointcloud based on the rb-lines for the respective rulebook(s). Meta-sparsity attribute generator 640 may compute meta-sparsity attributes based on the generation of the rulebooks and the data sparsity attributes via procedures 620. The meta-sparsity attributes are approximations to the actual data sparsity attributes. Sparsity-aware dataflow optimizer 650 may evaluate candidate tile size and loop order combinations in a manner similar to sparsity-aware dataflow optimizer 150 (
The system 600 includes runtime stage processing in which the system 600 may process one or more input 3D pointcloud data sets 110. Each input 3D pointcloud data sets 110 may, preferably, be obtained from a similar sensor type or similar data type or source as reflected by the representative sample pointcloud data sets used in the offline processing stage. The runtime stage for system 600 may include applying a hashmap 120 to the input 3D pointcloud data set 110, then processing with locality-aware rulebook generator 130 to generate the appropriate rulebooks (o2i and/or i2o rulebook variants). ARF compute 630 may compute the average receptive field value for the input pointcloud based on the rb-lines in the rulebook(s). Once the rulebook(s) and ARF are obtained for the input 3D pointcloud 110, optimal tile and loop order selector 670 queries optimal dataflow table 660, based on the ARF, to obtain the optimal tile size and loop order 672 for processing the input pointcloud 110. The optimal tile size and loop order 672 are provided to compute engine 180 for processing the input pointcloud 110, as described with reference to
Some or all components in the system 600 may be implemented using one or more of a CPU, a GPU, an A accelerator, an FPGA accelerator, an ASIC, and/or via a processor with software, or in a combination of a processor with software and an FPGA or ASIC. More particularly, components of the system 100 may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in configurable logic such as, for example, PLAs, FPGAs, CPLDs, in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS or TTL technology, or any combination thereof.
For example, computer program code to carry out operations by the system 600 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
The technology described herein may be applied to various large, unstructured sparse data sets indifferent scenarios. For example, four-dimensional (4D) pointclouds, which include a 4th dimension for movement over time, may be processed using the systems and processes described above. Similarly, the techniques may be applied to N-dimensional sparse convolutions (N-dimensional CNNs) and to graph-based convolution networks (GNNs).
The processes 701 and/or 702 may be implemented in a computing system such as, e.g., the system 600 described herein with reference to
For example, computer program code to carry out operations shown in the processes 701 and/or 702 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
System 10 may also include an input/output (I/O) subsystem 16. I/O subsystem 16 may communicate with for example, one or more input/output (I/O) devices 17, a network controller 24 (e.g., wired and/or wireless NIC), and storage 22. Storage 22 may be comprised of any appropriate non-transitory machine- or computer-readable memory type (e.g., flash memory, DRAM, SRAM (static random access memory), solid state drive (SSD), hard disk drive (HDD), optical disk, etc.). Storage 22 may include mass storage. In some embodiments, host processor 12 and/or I/O subsystem 16 may communicate with storage 22 (all or portions thereof) via network controller 24. In some embodiments, the system 10 may also include a graphics processor 26 (e.g., graphics processing unit/GPU) and an AI accelerator 27. In an embodiment, the system 10 may also include a vision processing unit (VPU), not shown.
Host processor 12 and I/O subsystem 16 may be implemented together on a semiconductor die as a system on chip (SoC) 11, shown encased in a solid line. SoC 11 may therefore operate as a computing apparatus for optimizing 3D pointcloud data processing. In some embodiments, SoC 11 may also include one or more of system memory 20, network controller 24, and/or graphics processor 26 (shown encased in dotted lines). In some embodiments, SoC 11 may also include other components of system 10.
Host processor 12 and/or I/O subsystem 16 may execute program instructions 28 retrieved from system memory 20 and/or storage 22 to perform one or more aspects of process 500 as described herein with reference to
Computer program code to carry out the processes described above may be written in any combination of one or more programming languages, including an object-oriented programming language such as JAVA, JAVASCRIPT, PYTHON, SMALLTALK, C++ or the like and/or conventional procedural programming languages, such as the “C” programming language or similar programming languages, and implemented as program instructions 28. Additionally, program instructions 28 may include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, microprocessor, etc.).
I/O devices 17 may include one or more of input devices, such as a touch-screen, keyboard, mouse, cursor-control device, touch-screen, microphone, digital camera, video recorder, camcorder, biometric scanners and/or sensors; input devices may be used to enter information and interact with system 10 and/or with other devices. I/O devices 17 may also include one or more of output devices, such as a display (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display, plasma panels, etc.), speakers and/or other visual or audio output devices. Input and/or output devices may be used, e.g., to provide a user interface.
Semiconductor apparatus 30 may be constructed using any appropriate semiconductor manufacturing processes or techniques. For example, logic 34 may include transistor channel regions that are positioned (e.g., embedded) within substrate(s) 32. Thus, the interface between logic 34 and substrate(s) 32 may not be an abrupt junction. Logic 34 may also be considered to include an epitaxial layer that is grown on an initial wafer of substrate(s) 34.
Processor core 40 is shown including execution logic 50 having a set of execution units 55-1 through 55-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 50 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 58 retires the instructions of code 42. In one embodiment, the processor core 40 allows out of order execution but requires in order retirement of instructions. Retirement logic 59 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, processor core 40 is transformed during execution of code 42, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 46, and any registers (not shown) modified by the execution logic 50.
Although not illustrated in
The system 60 is illustrated as a point-to-point interconnect system, wherein the first processing element 70 and the second processing element 80 are coupled via a point-to-point interconnect 71. It should be understood that any or all of the interconnects illustrated in
As shown in
Each processing element 70, 80 may include at least one shared cache 99a, 99b. The shared cache 99a, 99b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 74a, 74b and 84a, 84b, respectively. For example, the shared cache 99a, 99b may locally cache data stored in a memory 62, 63 for faster access by components of the processor. In one or more embodiments, the shared cache 99a, 99b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.
While shown with only two processing elements 70, 80, it is to be understood that the scope of the embodiments is not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 70, 80 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 70, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 70, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 70, 80 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 70, 80. For at least one embodiment, the various processing elements 70, 80 may reside in the same die package.
The first processing element 70 may further include memory controller logic (MC) 72 and point-to-point (P-P) interfaces 76 and 78. Similarly, the second processing element 80 may include a MC 82 and P-P interfaces 86 and 88. As shown in
The first processing element 70 and the second processing element 80 may be coupled to an I/O subsystem 90 via P-P interconnects 76 and 86, respectively. As shown in
In turn, I/O subsystem 90 may be coupled to a first bus 65 via an interface 96. In one embodiment, the first bus 65 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
As shown in
Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of
Embodiments of each of the above systems, devices, components and/or methods, including system 10, semiconductor apparatus 30, processor core 40, system 60, system 100, system 600, process 500, and/or processes 701-702, and/or any other system components, may be implemented in hardware, software, or any suitable combination thereof. For example, hardware implementations may include configurable logic such as RAM, ROM, PROM, firmware, flash memory, etc., in configurable logic such as, for example, PLAs, FPGAs, CPLDs, in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS or TTL technology, or any combination thereof.
Alternatively, or additionally, all or portions of the foregoing systems and/or components and/or methods may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components may be written in any combination of one or more operating system (OS) applicable/appropriate programming languages, including an object-oriented programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
Example 1 includes a computing system, comprising a processor, a memory coupled to the processor to store instructions which, when executed by the processor, cause the processor to generate a locality-aware rulebook based on an input unstructured sparse data set, the locality-aware rulebook storing spatial neighborhood information for active voxels in the input unstructured sparse data set, determine, from a plurality of predetermined tile size and loop order combinations, a tile size and loop order combination for processing the unstructured sparse data set based on an average receptive field (ARF) value, the ARF value computed based on the locality-aware rulebook, wherein the plurality of predetermined tile size and loop order combinations have been derived based on data sparsity attributes, and process by a compute engine the unstructured sparse data set via tile-based execution using the locality-aware rulebook and the tile size and loop order combination.
Example 2 includes the system of Example 1, wherein each line of the locality aware rulebook comprises one of an index of an input voxel representing an offset address for the input voxel data, a bitmask indicating active output voxels in an output response field of the input voxel and bit-locations of convolution weights to be applied, and indices of output voxels in the output response field, or an index of an output voxel representing an offset address for the output voxel data, a bitmask indicating active input voxels in an input receptive field of the output voxel and bit-locations of convolution weights to be applied, and indices of input voxels in the input receptive field.
Example 3 includes the system of Example 1, wherein the instructions, when executed, further cause the processor to generate, for each of a plurality of sample unstructured sparse data sets, a sample locality-aware rulebook based on the respective sample unstructured sparse data set, each sample locality-aware rulebook storing spatial neighborhood information for active voxels in the respective sample unstructured sparse data set, generate, for each sample locality-aware rulebook, a set of sparsity attributes representing data sparsity within the respective sample unstructured sparse data set, the sparsity attributes computed over a range of a number of rulebook lines per tile, generate a set of meta-sparsity attributes based on the sets of sparsity attributes, the meta-sparsity attributes representing a data sparsity quality for the plurality of sample unstructured sparse data sets, and determine, for each of a plurality of average receptive field (ARF) values, a tile size and loop order combination for processing, by the compute engine, unstructured sparse data based on the set of meta-sparsity attributes and on network and architecture configuration parameters.
Example 4 includes the system of Example 3, wherein the instructions, when executed, further cause the processor to generate a table including a plurality of tile size and loop order combinations based on each respective determined tile size and loop order combination and the respective ARF value, wherein the tile size and loop order combinations in the table provide the plurality of predetermined tile size and loop order combinations, wherein each of the plurality of ARF values may be computed based on the respective sample locality aware rulebook.
Example 5 includes the system of Example 4, wherein each respective tile size and loop order combination is determined based on minimizing the number of data accesses required for the compute engine to process an unstructured sparse data set.
Example 6 includes the system of any of Examples 1-5, wherein each of the sample unstructured sparse data sets is a three-dimensional (3D) pointcloud data set, wherein the input unstructured sparse data set is a 3D pointcloud data set, wherein the locality-aware rulebook and each sample locality-aware rulebook is generated from a one-dimensional (1D) compressed data set that includes the coordinates of active voxels in the respective unstructured sparse data set, wherein the sparsity attributes encode local sparsity structure in form of memory-size requirements and data-accesses over a range of region-sizes in the respective sample unstructured sparse data set, and wherein tile size includes a number of rulebook lines per tile and the loop order includes one of an input-stationary walk pattern, an output-stationary walk pattern, or a weight-stationary walk pattern.
Example 7 includes a semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable logic or fixed-functionality hardware logic, the logic coupled to the one or more substrates to generate a locality-aware rulebook based on an input unstructured sparse data set, the locality-aware rulebook storing spatial neighborhood information for active voxels in the input unstructured sparse data set, compute an average receptive field (ARF) value based on the locality aware rulebook, and determine, from a plurality of predetermined tile size and loop order combinations, a tile size and loop order combination for processing the unstructured sparse data based on the computed ARF value, wherein the plurality of predetermined tile size and loop order combinations have been derived based on data sparsity attributes, wherein the locality-aware rulebook and the tile size and loop order combination are to be provided to a compute engine, the compute engine to process the unstructured sparse using the locality aware rulebook and the tile size and loop order combination.
Example 8 includes the apparatus of Example 7, wherein each line of the locality aware rulebook comprises one of an index of an input voxel representing an offset address for the input voxel data, a bitmask indicating active output voxels in an output response field of the input voxel and bit-locations of convolution weights to be applied, and indices of output voxels in the output response field, or an index of an output voxel representing an offset address for the output voxel data, a bitmask indicating active input voxels in an input receptive field of the output voxel and bit-locations of convolution weights to be applied, and indices of input voxels in the input receptive field.
Example 9 includes the apparatus of Example 7, wherein the logic is further to generate, for each of a plurality of sample unstructured sparse data sets, a sample locality-aware rulebook based on the respective sample unstructured sparse data set, each sample locality-aware rulebook storing spatial neighborhood information for active voxels in the respective sample unstructured sparse data set, generate, for each sample locality-aware rulebook, a set of sparsity attributes representing data sparsity within the respective sample unstructured sparse data set, the sparsity attributes computed over a range of a number of rulebook lines per tile, generate a set of meta-sparsity attributes based on the sets of sparsity attributes, the meta-sparsity attributes representing a data sparsity quality for the plurality of sample unstructured sparse data sets, and determine, for each of a plurality of average receptive field (ARF) values, a tile size and loop order combination for processing, by the compute engine, unstructured sparse data based on the set of meta-sparsity attributes and on network and architecture configuration parameters.
Example 10 includes the apparatus of Example 9, wherein the logic is further to generate a table including a plurality of tile size and loop order combinations based on each respective determined tile size and loop order combination and the respective ARF value, wherein the tile size and loop order combinations in the table provide the plurality of predetermined tile size and loop order combinations, wherein each of the plurality of ARF values may be computed based on the respective sample locality aware rulebook.
Example 11 includes the apparatus of Example 10, wherein each respective tile size and loop order combination is determined based on minimizing the number of data accesses required for the compute engine to process an unstructured sparse data set.
Example 12 includes the apparatus of any of Examples 7-11, wherein each of the sample unstructured sparse data sets is a three-dimensional (3D) pointcloud data set, wherein the input unstructured sparse data set is a 3D pointcloud data set, wherein the locality-aware rulebook and each sample locality-aware rulebook is generated from a one-dimensional (1D) compressed data set that includes the coordinates of active voxels in the respective unstructured sparse data set, wherein the sparsity attributes encode local sparsity structure in form of memory-size requirements and data-accesses over a range of region-sizes in the respective sample unstructured sparse data set, and wherein tile size includes a number of rulebook lines per tile and the loop order includes one of an input-stationary walk pattern, an output-stationary walk pattern, or a weight-stationary walk pattern.
Example 13 includes the apparatus of Example 7, wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates.
Example 14 includes at least one non-transitory computer readable storage medium comprising a set of instructions which, when executed by a computing system, cause the computing system to generate a locality-aware rulebook based on an input unstructured sparse data set, the locality-aware rulebook storing spatial neighborhood information for active voxels in the input unstructured sparse data set, compute an average receptive field (ARF) value based on the locality aware rulebook, and determine, from a plurality of predetermined tile size and loop order combinations, a tile size and loop order combination for processing the unstructured sparse data based on the computed ARF value, wherein the plurality of predetermined tile size and loop order combinations have been derived based on data sparsity attributes, wherein the locality-aware rulebook and the tile size and loop order combination are to be provided to a compute engine, the compute engine to process the unstructured sparse data using the locality aware rulebook and the tile size and loop order combination.
Example 15 includes the at least one non-transitory computer readable storage medium of Example 14, wherein each line of the locality aware rulebook comprises one of an index of an input voxel representing an offset address for the input voxel data, a bitmask indicating active output voxels in an output response field of the input voxel and bit-locations of convolution weights to be applied, and indices of output voxels in the output response field, or an index of an output voxel representing an offset address for the output voxel data, a bitmask indicating active input voxels in an input receptive field of the output voxel and bit-locations of convolution weights to be applied, and indices of input voxels in the input receptive field.
Example 16 includes the at least one non-transitory computer readable storage medium of Example 14, wherein the instructions, when executed, further cause the computing system to generate, for each of a plurality of sample unstructured sparse data sets, a sample locality-aware rulebook based on the respective sample unstructured sparse data set, each sample locality-aware rulebook storing spatial neighborhood information for active voxels in the respective sample unstructured sparse data set, generate, for each sample locality-aware rulebook, a set of sparsity attributes representing data sparsity within the respective sample unstructured sparse data set, the sparsity attributes computed over a range of a number of rulebook lines per tile, generate a set of meta-sparsity attributes based on the sets of sparsity attributes, the meta-sparsity attributes representing a data sparsity quality for the plurality of sample unstructured sparse data sets, and determine, for each of a plurality of average receptive field (ARF) values, a tile size and loop order combination for processing, by the compute engine, unstructured sparse data based on the set of meta-sparsity attributes and on network and architecture configuration parameters.
Example 17 includes the at least one non-transitory computer readable storage medium of Example 16, wherein the instructions, when executed, further cause the computing system to generate a table including a plurality of tile size and loop order combinations based on each respective determined tile size and loop order combination and the respective ARF value, wherein the tile size and loop order combinations in the table provide the plurality of predetermined tile size and loop order combinations, wherein each of the plurality of ARF values may be computed based on the respective sample locality aware rulebook.
Example 18 includes the at least one non-transitory computer readable storage medium of Example 17, wherein each respective tile size and loop order combination is determined based on minimizing the number of data accesses required for the compute engine to process an unstructured sparse data set.
Example 19 includes the at least one non-transitory computer readable storage medium of any of Examples 14-18, wherein each of the sample unstructured sparse data sets is a three-dimensional (3D) pointcloud data set, wherein the input unstructured sparse data set is a 3D pointcloud data set, wherein the locality-aware rulebook and each sample locality-aware rulebook is generated from a one-dimensional (1D) compressed data set that includes the coordinates of active voxels in the respective unstructured sparse data set, wherein the sparsity attributes encode local sparsity structure in form of memory-size requirements and data-accesses over a range of region-sizes in the respective sample unstructured sparse data set, and wherein tile size includes a number of rulebook lines per tile and the loop order includes one of an input-stationary walk pattern, an output-stationary walk pattern, or a weight-stationary walk pattern.
Example 20 includes a method of optimizing sparse data processing, comprising generating a locality-aware rulebook based on an input unstructured sparse data set, the locality-aware rulebook storing spatial neighborhood information for active voxels in the input unstructured sparse data set, computing an average receptive field (ARF) value based on the locality aware rulebook, and determining, from a plurality of predetermined tile size and loop order combinations, a tile size and loop order combination for processing the unstructured sparse data based on the computed ARF value, wherein the plurality of predetermined tile size and loop order combinations have been derived based on data sparsity attributes, wherein the locality-aware rulebook and the tile size and loop order combination are provided to a compute engine, the compute engine to process the unstructured sparse data using the locality aware rulebook and the tile size and loop order combination.
Example 21 includes the method of Example 20, wherein each line of the locality aware rulebook comprises one of an index of an input voxel representing an offset address for the input voxel data, a bitmask indicating active output voxels in an output response field of the input voxel and bit-locations of convolution weights to be applied, and indices of output voxels in the output response field, or an index of an output voxel representing an offset address for the output voxel data, a bitmask indicating active input voxels in an input receptive field of the output voxel and bit-locations of convolution weights to be applied, and indices of input voxels in the input receptive field.
Example 22 includes the method of Example 20, further comprising generating, for each of a plurality of sample unstructured sparse data sets, a sample locality-aware rulebook based on the respective sample unstructured sparse data set, each sample locality-aware rulebook storing spatial neighborhood information for active voxels in the respective sample unstructured sparse data set, generating, for each sample locality-aware rulebook, a set of sparsity attributes representing data sparsity within the respective sample unstructured sparse data set, the sparsity attributes computed over a range of a number of rulebook lines per tile, generating a set of meta-sparsity attributes based on the sets of sparsity attributes, the meta-sparsity attributes representing a data sparsity quality for the plurality of sample unstructured sparse data sets, and determining, for each of a plurality of average receptive field (ARF) values, a tile size and loop order combination for processing, by the compute engine, unstructured sparse data based on the set of meta-sparsity attributes and on network and architecture configuration parameters.
Example 23 includes the method of Example 22, further comprising generating a table including a plurality of tile size and loop order combinations based on each respective determined tile size and loop order combination and the respective ARF value, wherein the tile size and loop order combinations in the table provide the plurality of predetermined tile size and loop order combinations, wherein each of the plurality of ARF values may be computed based on the respective sample locality aware rulebook.
Example 24 includes the method of Example 23, wherein each respective tile size and loop order combination is determined based on minimizing the number of data accesses required for the compute engine to process an unstructured sparse data set.
Example 25 includes the method of any of Examples 20-24, wherein each of the sample unstructured sparse data sets is a three-dimensional (3D) pointcloud data set, wherein the input unstructured sparse data set is a 3D pointcloud data set, wherein the locality-aware rulebook and each sample locality-aware rulebook is generated from a one-dimensional (1D) compressed data set that includes the coordinates of active voxels in the respective unstructured sparse data set, wherein the sparsity attributes encode local sparsity structure in form of memory-size requirements and data-accesses over a range of region-sizes in the respective sample unstructured sparse data set, and wherein tile size includes a number of rulebook lines per tile and the loop order includes one of an input-stationary walk pattern, an output-stationary walk pattern, or a weight-stationary walk pattern.
Example 26 includes an apparatus comprising means for performing the method of any of Examples 20-24.
Thus, technology described herein improves the performance of computing systems through data acceleration and optimization techniques providing faster, more efficient and more accurate processing of 3D pointcloud data. For example, the technology may achieve up to 90% savings in data accesses, 3× improvements in compute utilization (low runtimes, lower latency) compared to CPU implementations, improvements that are consistent across datasets (with varying sparsity) over several architecture configurations (memory, compute-size/bandwidth ratios). The technology includes an improved rulebook metadata structure that encapsulates all neighborhood voxels in a receptive filed or response field and is more compressed than other rulebooks used in CPU/GPU implementations, requiring approximately half of the memory of other rulebooks, while maintaining approximately the same creation time and overhead compared to such rulebooks. The sparsity-aware optimal dataflow outperforms current non-tile-based implementation with significantly lower data-accesses.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A, B, C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.