Contemporary imaging and video applications, especially in the domain of automotive, surveillance, or computer vision may access scattered data in unpredictable and random manner. Such applications may include object detection algorithms, fine grained motion based temporal noise reduction or ultra-low light imaging, various fine grained image registration applications, example based super resolution techniques, various random sampling machine learning inference algorithms, etc. For example, object detection algorithms may include face detection and recognition.
The same numbers are used throughout the disclosure and the figures to reference like components and features. Numbers in the 100 series refer to features originally found in
As discussed above, current imaging and video applications may access scattered data in an unpredictable and random manner. In particular, a number of applications may use random sample access when processing image or video data. For example, in video-based or image-based object detection and recognition, an object may be detected in any part of an image. The position of the object in the video or image may be unknown before the image is processed. The detection algorithm typically access parts of image and individual feature samples. Since the objects are searched at different sizes, orientations, etc., the requirements for random sample access may be typically very high. Currently, many systems are therefore forced to work with frame rates and low resolutions.
In addition, motion compensation algorithms, such as temporal noise reduction algorithms, may use random sample access. For example, motion compensation algorithms may fetch data from previous images based on computed motion vectors. The motion vectors may be changing per frame and therefore require random access. For increasing image quality more fine grained access is required. Current systems however do not enable such fine grained random sample access.
Furthermore, machine learning and recognition applications may also use random sample access. For example, sparse matrix projections and sampling optimization methods, such as Markov chain Monte Carlo methods, are common elements of many machine learning and recognition algorithms. Some current face recognition algorithm may be based on a sparse matrix multiplication that requires 8 samples per clock from a 16 KB data space.
Currently, some memory architectures may provide efficient fetch of a group of samples, but not individual samples, and only under some strict conditions. For example, some current architectures may efficiently fetch of a group of mutually neighboring samples, shaped like a monolithic one dimensional or two dimensional blocks, but not individually scattered samples. In particular, high-performance processors, based on a vector or SIMD (Single Instruction Multiple Data stream) instruction sets, like an IPU (Imaging Processing Unit), may include such architectures. For example, an IPU's Vector Processor (VP) may be a programmable, SIMD core, built for the purpose to allow a flexible firmware and thus after-the-Silicon answer to application needs. However, current imaging or video use already exceeds 4k resolution at 60 frames per second (FPS) in real-time, and future processing may use even larger bandwidth such as 8k at 60 FPS.
A VP may thus include a high-performance architecture with a memory sub-system, vector data path, and vector instruction set designed to reach a peak at 32 or 64 samples per clock (SPCs). However, when the required input samples are scattered around the data space, the memory controller may not able to profit from any sample grouping, and the performance peak may drop to approximately one SPC, since the fetching may drop to just a single sample component per clock cycle. The slowdown in the fetching may also slow down data flow and thereby all subsequent processing stages. The sample fetching stage may also be quite early in the processing pipe, thus affecting the performance of an entire pipe by approximate factor of 32 or 64 depending on the parallelism available to the processor.
The present disclosure relates generally to techniques for processing scattered samples. Specifically, the techniques described herein include an apparatus, method and system for processing scattered data using a high-performance fetch for improved random sample access. An example apparatus includes an address buffer to receive a plurality of vector addresses corresponding to input vector data comprising scattered samples to be processed. The apparatus includes a multi-bank memory to receive the input vector data and send output vector data. The apparatus further includes a memory controller comprising an address scheduler to assign an address to each bank of the multi-bank memory. The techniques described herein thus enable fast access to random sample data scattered around a data space. For example, the data space may be an image, either as originally received or down-scaled by any factor. In particular, the use of a multiple memory banks may increase the silicon efficiency of memory, as well as the performance of memory during more complex modes of operation. In some examples, the techniques may include a high-performance fetch, achieved with significantly lower latency than 32 samples per clock (SPCs), with the possibility to pipeline the read requests, making this memory truly high-performance in a steady-state. For example, a typical operation may have a run-in period when the address buffer is filled, followed by the steady-state, and then followed by a run-out period in which the last vectors of data are retrieved. A new image processing may go through these three phases, and performance during the steady-state may be particularly improved with the present techniques. The techniques may thus also remove the bottleneck in the early fetching stage of an image processing pipelines, thus allowing the data path and instruction set to number-crunch full vectors of data. In some examples, the architecture of the system may be parametric. For example, the system may have two major parameters: the number of vector addresses NVa and the number of memory banks Nb at design time. The architecture may thus allow tradeoffs along two vectors: achieved peak performance against latency, and achieved peak performance against cost of implementation (power, and area). Faster random data access may enable many new applications. Alternatively, the techniques may enable finer grained random sample access.
The example system 100 includes an address buffer 102, an address scheduler 104, a multi-bank memory subsystem 106, an address logic 108, and a data output buffer 110. For example, the address buffer 102 and data output buffer 110 may both be first-in, first-out (FIFO) buffers. The address buffer 102 and data output buffer 110 can both store a total of an NVa number of vector addresses 112 with an NWAY number of samples per vector word 114, referring to the parallelism available to the processor. A vector word, as used herein, thus refers to an NWAY number of samples, each sample having a predetermined number of bits. The multi-bank memory subsystem 106 includes a number Nb of memory banks 116. In some examples, the depth of the buffers NVa may be set to a value of 4, and the number of memory banks may be set to a value of 16.
As shown in
The vector data 120 may be stored within the multi-bank memory system 106. The vector data 120 may be stored and read from the multi-bank memory system 106 using a memory controller (not shown) including the address scheduler 104 and address logic 108. In some examples, the memory controller may be a hardware device that may feature a sophisticated reading and writing scheme with a built-in address history. The memory controller may thus store samples from the vector data 120 in the Nb number of memory banks 116 of the multi-bank memory subsystem 106. In some examples, the memory banks may be one sample wide. In some examples, the memory banks 116 may be multiple samples wide. In some examples, the address scheduler 104 may be a simple scheduler. For example, the address scheduler 104 may attempt to schedule each vector address to a corresponding memory bank and if the bank is already occupied, then the address vector may be scheduled for the next clock cycle. In some examples, the address scheduler 104 may use skewed addressing as described in greater detail below with respect to
The address logic 108 may then read vector data in the multi-bank memory system 106 and write the vector data to a data buffer 110. As mentioned above, the data buffer 110 may also have a capacity of NWAY×Nva samples. The data buffer 110 may then output vector data 122 for further processing.
The system 100 may thus deliver a vector of NWAY samples, within a minimal amount of clock cycles. In some examples, the memory subsystem 106 may be designed to accommodate an average amount of read cycles. For example, the number of clock cycles actually used to output a data vector may be within some distribution based on the randomness of the input vector data. Therefore, the performance and latency of the system 100 may be defined to accommodate average numbers. For example, the number of physically instantiated memory banks 116 may be fixed by design as well as the depth 114 of the address buffer 102 and the data buffer 110. However, in some examples, at compile-time, or run-time, the actual used depth 114 of the address buffer 102 can be made smaller to minimize latency. For example, the depth 114 size can be adjusted to be smaller according to any applied use case. In some examples, depth 114 adjustment may be implemented as part of the instruction set. For example, a flexible time-shape instruction may be used. In some examples, depth 114 adjustment may be implemented using several read instructions following different time shapes. For example, in some cases, using flexible time shape instructions may not be possible due to the very long instruction word (VLIW) tools limitations. In some examples, the VLIW limitations may be overcome as described below. In some examples, flexible time shapes can be used with CPUs, GPUs, and DSPs in general, where HW scheduling can be an out of order execution. Further, in cases where microthreading is available, out-of-order execution may also be possible. For example, a processor may switch to another thread, providing additional time to the memory to collect data.
In some examples, the specified time-shape of an instruction may not match the actual vector data being processed. For example, three scenarios may be possible given a particular distribution of random samples. In the first scenario, the memory subsystem 106 may deliver the vector data exactly according to specified time shape. Thus, the time shape of the instruction may match the randomness of the distribution exactly. In another example, the memory subsystem 106 may deliver the output vector data in less than the specified number of clock cycles. In this example, the memory may wait for the specified time-shape, and deliver the output vector at the requested clock-cycle. In another example, the memory subsystem may use more than specified number of clock cycles to deliver the output vector. In this example, the memory can issue a stall signal until the system 100 is ready to deliver the full vector of data. The processing of the output vector at further stages may thus be delayed. In some examples, the system 100 may output a partial vector instead of issuing the stall signal. Thus, the system 100 may be configured to operate in either a fixed-schedule or a fixed performance mode as described in greater length with respect to
The diagram of
The example system 200 includes a data segment 202 that contains a number of samples 204 that are to be loaded. For example, the data segment 202 may be a region of interest in an image. A set of memory banks 206 are to store the data segment 202, which includes samples 204 including groups 208.
In
The diagram of
As shown in
The diagram of
In some examples, an address history 402 may be included enable hardware address scheduling. For example, instead of trying to place the address immediately into the memory bank reading, a delay of N steps may be introduced to make dense reading based on a larger number of addresses and memory bank combinations. In some examples, by taking into account the values of the addresses that were submitted for the current access, and also within the recent history, the address scheduler 104 may further increase reading efficiency. For example, the address scheduler 404 may thus enable the memory to perform reads from all the banks available.
In some examples, the amount of time (or clock cycles) required to fetch the full NWAY vector of samples matched to a vector address, may not be constant, and may depend on the actual content of the vector data, and a current location of the samples within the ROI. However, the number of clock cycles used to fetch all samples within a vector may be predictable within some margins, assuming truly random data.
In some examples, vector addresses may be supplied in NWAY groups. If the number of vectors of addresses is denoted by NVa then the total number of scalar addressed may be calculated using the equation:
Na=NVa*NWAY Eq. 1
where NWAY is equal to the width of the SIMD of the vector processor. In some examples, the vector processor may have settings of NWAY=32. In some examples, the vector processor may have settings of NWAY=64, or NWAY=16, or some other value. The NVa vector addresses may be used to generate a pool of Na addresses 408 that can be entered into the address scheduler 404 in order to pick up the Nb 410 number of addresses 406 that can be submitted to Nb individual memory banks. For example, the address scheduler 404 may determine a number Nb of scalar addresses that can be read in one clock cycle without bank conflicts. In this way, the address scheduler 404 may increase the use of parallel reading from the Nb memory banks. In some examples, the longer the history (larger Na, and thereby larger NVa), and more banks to operate on (a larger Nb), the better the schedule that the address scheduler 404 may be able to generate.
The diagram of
As shown in
In the example of 500A, a vector processor may process scattered data without an address buffer to increase the schedule density. a larger amount of clock cycles 504 are used to read the same number of data points 506 as a smaller average number of data points 506 are read simultaneously with each clock cycle 504. For example, 500A shows that the 64 random addresses 506 are read within 16 clock cycles 504.
In the example of 500B, the vector processor may include an address buffer to increase the schedule density. In some examples, it may take multiple clock cycles to transfer the address data. However, this delay is also not very important since the transfer may happen in parallel while the previous data is being fetched. In this example, the same number of 64 addresses 506 may be read within 7 clock cycles 504 in the resulting compressed reading schedule. Although both examples 500A and 500B are shown reading the 64 addresses in less cycles than the worst case scenario of 64 addresses, or one address per cycle, the example of 500B is able to read the same number of addresses 506 in less than half the clock cycles 504 of example 500A. Therefore, the use of an address buffer, such as the address buffer described in
The diagram of
The graph 600 shows that the average number of samples per clock (SPC) 604 grows nearly linearly with an increasing number Nb of memory banks 602. For system configurations that are the same but for the differing number of memory banks 602 of Nb=8, and Nb=16, the achieved average SPC performance 604 can be SPC=6, and SPC=11, respectively. Thus, an additional factor of 2-3× performance improvement can be achieved by doubling the number of memory banks 602. Therefore, multiple memory banks may be used to increase the average samples per clock read.
The system 700 includes a buffer of input addresses 702, a shuffling stage 704, and an output register 706 including output addresses. The buffers each have a number of vector addresses 708 of Nva.
As shown in
Thus, an alternative method may be used to avoid having to process the sample data through a shuffle stage 704. For example, while scheduling each sample, the position of each sample within a vector address may be recorded. Recording the sample position may include two components. First, a few bits may be used to record the vector address to where the sample belongs. For example, the number of bits may be log 2 (NVa). In addition, a few bits may be used to record the location of a sample within the NWAY samples. For example, the number of bits may be log 2 (NWAY). Together, these bits may compose the address within the output stage of the memory where each of the Nb samples are to be written. In some examples, these Nb samples may then be written to a register file consisting of NWAY*NVa samples. In order to write all Nb samples in parallel, the register file may have a total of Nb ports. Given the capacity of the register file is quite small (NVa*NWAY), the area paid for such an implementation is limited. Moreover, such an implementation may be equal to 4*32=128 registers. Thus, the additional costs of implementing a hardware shuffling stage 704 may be avoided using a few bits to record and keep track of sample locations.
The diagram of
The example system 800 includes an address buffer 802, a data buffer 804, an address scheduler 806, a memory logic 808, and a memory subsystem 810 with a number of memory banks 812. The address buffer 802 is receiving a vector address 814 and the data buffer 804 is receiving a vector data 816.
As discussed above and below, various methods may be used for reading from a set of random locations. However, the similar principles can be also applied to writing data to random locations. For example, NVa address vectors and NVa data vectors may be supplied to the memory. Roughly the same elements can thus be used for writing data to random memory locations.
For example, an address scheduler 806 may similarly use address scheduling to determine the way of accessing the multiple Nb memory banks 812. Corresponding data elements 808 may then be written to the memory banks 812 based on the corresponding schedule. The read operation may use the output data buffer and the address logic to unpack the data that is read from the memory banks 802 and 804. For the write operation, the same amount of data may be kept in the input data buffer 804 as depicted in
The diagram of
The example memory subsystem 900 is receiving a vector address 902 and vector data 904, and outputting vector data 906 and scalar data 908. In some examples, depending on the selected type of the memory, the memory subsystem 900 will have slightly different interfaces. For example, the two types of memory may be fixed-schedule memory and fixed-performance memory.
In both types of memory, a vector address input 902 may have a width of NWAY. The vector address input 902 may include addresses of each of the NWAY requested samples in the vector data input 904. In some examples, the input vector addresses 902 may be provided as byte addresses. In some examples, the input vector addresses 902 may be provided as x and y offsets to a reference (0, 0) or the top-left sample within the region of interest.
In addition, both types of memory may receive a vector data input 904. The vector data input 904 may also have a width of NWAY. The vector data input 904 may include data samples to be written to the memory at specified memory locations.
Both data types may further output a vector data output 906. The vector data output 906 may also have a width of NWAY. The vector data output 906 may include data samples that are read out of the memory from specified address locations.
However, the fixed schedule type of memory may also have an additional scalar output 908. The scalar output 908 may be used to indicate how many valid samples are provided at the output of the memory.
The interface of the memory may thus be defined as two inputs and one or two outputs, depending on the type of memory. Within the context of the vector-nature of the vector processor (VP) of the image processing unit (IPU), the address vector 902 may be provided as a vector-shaped input. The second input 904 is a vector of samples 904 to be written into the memory subsystem 900. The output 906 is the vector of read samples, corresponding to the addresses as specified at the input 902. The operation of these two types of memories is described at greater length with respect to
The diagram of
As shown in
The diagram of
In some examples, the internal microarchitecture of the memory 1100 may be such that the samples are stored across Nb individual memory banks, where each location contains Np samples. Thus, the memory may be called multi-sample, multi-bank memory. The vector addresses 1006 of the requested samples may be provided to a memory controller (not shown). The memory controller may record a history of requests, maintaining at all times Na sample addresses, corresponding to NVa 1214 vector addresses. In some examples, the data to be read may be localized, or grouped, and the likelihood of fetching a group of valid samples within the same address may thus be larger. Therefore, multiple memory banks coupled with multiple samples per bank may enable better read coverage of such groups of samples, scattered around a data region of interest (ROI). When samples that are required to be fetched are scattered around the ROI, trying to cover them with several addresses each containing Np samples may be much more efficient than just with one address per sample.
In
The diagram of
In the chart 1200, three example data access patterns include a random pattern 1202, a random block pattern 1204, and random groups 1206 are shown. As used herein, random groups refer to different irregular shapes, where samples are close to each other. The vertical axis of graph 1200 represents performance as average samples per clock (SPCs).
The chart 1200 shows the performance of three example data memory types including single sample wide memory 1210, multi-sample wide memory 1212 with 4 sample wide memory banks without scheduling, and multi-sample wide memory with scheduling 1214. In the example data memory types, the word width (or NWAY of the vector processor) is set to 32 samples, the image size is set to 256×256 samples. Moreover, skewing of data is enabled in order to allow the random block pattern 1204 and random groups 1206 to benefit from the skewing feature. The depth of buffers (Nva) in each example is 4 and each buffer has 32 addresses available. The three example memory data types 1210, 1212, 1214 are provided as examples, and different scenarios are possible and described herein.
As shown in
The multi-sample width configurations 1300A and 13008 of
At block 1402, the memory controller receives a load instruction. For example, the load instruction may have a time shape. For example, the time shape may indicate the number of clock cycles to complete the load instruction. In some examples, the time shape of the load instruction may be flexible. For example, the time shape may be configurable such that different time shapes may be configured depending on factors including performance and latency. In some examples, the memory controller may reduce a depth of the address buffer via a flexible time shape instruction.
At block 1404, the memory controller receives input vector addresses and corresponding vector data comprising scattered samples. For example, the scattered samples may be randomly scattered, grouped in blocks, or organized in random groups.
At block 1406, the memory controller processes an address buffer based on a time shape of the load instruction. For example, if the latency of a function is set to average latency, then the processor may expect data after that number of cycles. If the data is not there after the number of cycles, then the processor will have to wait and a stall may result at the processing pipeline. In some examples, the memory controller may perform address skewing to increase efficiency, and to provide faster coverage of different 2D shapes. For example, the 2D shapes may be rectangles and squares. In some examples, address scheduling may be implemented based on the time shape of the load instruction. In some examples, the memory controller may process multiple samples in parallel. For example, the memory controller may assign addresses to a multi-bank memory.
At block 1408, the memory controller outputs a partial vector. For example, a subset of the total scattered samples in the input vector data may be output after a predetermined number of clock cycles has completed. Thus, a vector processor may have some output vector data to process at regular intervals. The memory controller may output additional partial vectors at the regular intervals for the vector processor to process.
At block 1410, the memory controller outputs a scalar value indicating a number of valid samples in the partial vector. For example, the number of valid samples in the partial vector may depend on the randomness of the input vector data and the grouping of the input vector data. Since the latency of a fixed-schedule may be fixed, method 1400 may be used when latency is more important than data coherency.
This process flow diagram is not intended to indicate that the blocks of the example process 1400 are to be executed in any particular order, or that all of the blocks are to be included in every case. Further, any number of additional blocks not shown may be included within the example process 1400, depending on the details of the specific implementation. For example, for the fixed-schedule type of the memory, the memory controller may provide for data coherency during writing. Thus, all samples from the input vector may be written to the memory subsystem. In order to enforce that within the fixed time-shape, constraints on the type of the accesses may be used. Therefore, accesses that do not result in a bank conflict may be allowed, while access that result in bank conflicts may not be allowed. Since the memory may be based on Nb banks with one element per bank, all 1D and 2D write accesses are possible, provided that the width of the region is power of two fraction of Nb. In some examples, for a given NWAY and number of memory banks Nb, the number of clock cycles used for a write action may be calculated using the equation:
Nr_write_cycles_=NWAY/Nb Eq. 2
At block 1502, the memory controller receives a target number of samples to be output. In some examples, the target number of samples may be the number of samples that were input. In some examples, the target number of samples may be a fraction of the number of samples that were input. For example, in order to reduce latency, the target number of samples may be ½ or ¼ of the total number of input samples. In some examples, the number of samples to be output can be specified by user input. In some examples, the number of samples to be output can be a vector size NWAY. In some examples, other values may be used if latency is more important than the number of samples. For example, an NWAY/2 number of samples may be output. In some examples, the values may be limited to powers of two for easier implementation.
At block 1504, the memory controller receives input vector addresses and corresponding vector data comprising scattered samples. For example, the vector addresses of the scattered samples may be randomly scattered, grouped in blocks, or organized in random groups.
At block 1506, the memory controller processes an address buffer based on the predetermined number of samples to be output. For example, the address buffer may be a FIFO buffer. In some examples, the memory controller may process the address buffer until the specified number of samples is produced at the output. In some examples, the memory controller may statistically calculate the latency to deliver the requested number of samples. The memory controller may then predict the average throughput and performance of this memory, and thus subsequent components within the image processing pipeline.
At block 1508, the memory controller outputs the predetermined number of samples. For example, the samples may then be processed by additional stages of an image processing pipeline.
This process flow diagram is not intended to indicate that the blocks of the example process 1500 are to be executed in any particular order, or that all of the blocks are to be included in every case. Further, any number of additional blocks not shown may be included within the example process 1500, depending on the details of the specific implementation.
Referring now to
The memory device 1604 can include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory systems. For example, the memory device 1604 may include dynamic random access memory (DRAM). The memory device 1604 may include device drivers 1610 that are configured to execute the instructions for device discovery. The device drivers 1610 may be software, an application program, application code, or the like.
The computing device 1600 may also include a graphics processing unit (GPU) 1608. As shown, the CPU 1602 may be coupled through the bus 1606 to the GPU 1608. The GPU 1608 may be configured to perform any number of graphics operations within the computing device 1600. For example, the GPU 1608 may be configured to render or manipulate graphics images, graphics frames, videos, or the like, to be displayed to a user of the computing device 1600.
The memory device 1604 can include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory systems. For example, the memory device 1604 may include dynamic random access memory (DRAM). The memory device 1604 may include device drivers 1610 that are configured to execute the instructions for generating virtual input devices. The device drivers 1610 may be software, an application program, application code, or the like.
The CPU 1602 may also be connected through the bus 1606 to an input/output (I/O) device interface 1612 configured to connect the computing device 1600 to one or more I/O devices 1614. The I/O devices 1614 may include, for example, a keyboard and a pointing device, wherein the pointing device may include a touchpad or a touchscreen, among others. The I/O devices 1614 may be built-in components of the computing device 1600, or may be devices that are externally connected to the computing device 1600. In some examples, the memory 1604 may be communicatively coupled to I/O devices 1614 through direct memory access (DMA).
The CPU 1602 may also be linked through the bus 1606 to a display interface 1616 configured to connect the computing device 1600 to a display device 1618. The display device 1618 may include a display screen that is a built-in component of the computing device 1600. The display device 1618 may also include a computer monitor, television, or projector, among others, that is internal to or externally connected to the computing device 1600.
The computing device 1600 also includes a storage device 1620. The storage device 1620 is a physical memory such as a hard drive, an optical drive, a thumbdrive, an array of drives, a solid-state drive, or any combinations thereof. The storage device 1620 may also include remote storage drives.
The computing device 1600 may also include a network interface controller (NIC) 1622. The NIC 1622 may be configured to connect the computing device 1600 through the bus 1606 to a network 1624. The network 1624 may be a wide area network (WAN), local area network (LAN), or the Internet, among others. In some examples, the device may communicate with other devices through a wireless technology. For example, the device may communicate with other devices via a wireless local area network connection. In some examples, the device may connect and communicate with other devices via Bluetooth® or similar technology.
The computing device 1600 further includes an image processing unit 1626. For example, the image processing unit 1626 may include an image processing pipeline. The pipeline may include a number of processing stages. In some examples, the stages may process frames in parallel. For example, the pipeline may include an enhanced prefetch stage for efficient reading of scattered data in images. The image processing unit 1626 may further include a vector processor 1628. For example, the vector processor may be capable of processing an NWAY number of samples in parallel. The image processing unit 1626 may further include a multi-bank memory 1630. In some examples, the multi-bank memory may include a number of memory banks with single sample widths. In some examples, the multi-bank memory may include memory banks with multi-sample widths. The image processing unit 1626 may also include a memory controller 1632. In some examples, the memory controller may include an address scheduler 1634 to schedule the storing of addressing into the multi-bank memory 1630. In some examples, the memory controller may include an address history of previously stored addresses. For example, the memory controller may use the address history when scheduling addresses. In some examples, the scheduler may further include skewing logic to perform skewing when scheduling the addresses.
The block diagram of
The various software components discussed herein may be stored on one or more computer readable media 1700, as indicated in
The block diagram of
Example 1 is an apparatus for processing scattered data. The apparatus includes an address buffer to receive a plurality of vector addresses corresponding to input vector data including scattered samples to be processed. The apparatus also includes a multi-bank memory to receive the input vector data and send output vector data. The apparatus further includes a memory controller including an address scheduler to assign an address to each bank of the multi-bank memory.
Example 2 includes the apparatus of example 1, including or excluding optional features. In this example, the multi-bank memory includes single-sample wide memory banks.
Example 3 includes the apparatus of any one of examples 1 to 2, including or excluding optional features. In this example, the multi-bank memory includes multi-sample wide memory banks.
Example 4 includes the apparatus of any one of examples 1 to 3, including or excluding optional features. In this example, the multi-bank memory includes skewed addressing.
Example 5 includes the apparatus of any one of examples 1 to 4, including or excluding optional features. In this example, the plurality of vector addresses include random vector addresses.
Example 6 includes the apparatus of any one of examples 1 to 5, including or excluding optional features. In this example, the plurality of vector addresses include pseudo-random vector addresses.
Example 7 includes the apparatus of any one of examples 1 to 6, including or excluding optional features. In this example, the multi-bank memory includes a number of memory banks corresponding to a number of samples that can be processed in parallel by an associated vector processor.
Example 8 includes the apparatus of any one of examples 1 to 7, including or excluding optional features. In this example, the apparatus is to output a subset of the scattered samples in a predetermined number of cycles.
Example 9 includes the apparatus of any one of examples 1 to 8, including or excluding optional features. In this example, the apparatus is to output a predetermined number of the scattered samples.
Example 10 includes the apparatus of any one of examples 1 to 9, including or excluding optional features. In this example, the apparatus includes an address history, wherein the address scheduler is to assign the address to each bank of the multi-bank memory based on an address history.
Example 11 is a method for processing scattered data. The method includes receiving a load instruction. The method also includes receiving input vector addresses and corresponding vector data including scattered samples. The method further includes processing an address buffer based on a time shape of the load instruction; and outputting a partial vector in a predetermined number of cycles.
Example 12 includes the method of example 11, including or excluding optional features. In this example, the method includes outputting a scalar value indicating a number of valid samples in the partial vector.
Example 13 includes the method of any one of examples 11 to 12, including or excluding optional features. In this example, the method includes reducing a depth of the address buffer via a flexible time shape instruction.
Example 14 includes the method of any one of examples 11 to 13, including or excluding optional features. In this example, the method includes reducing the depth of the address buffer via selecting an alternative time shape instruction.
Example 15 includes the method of any one of examples 11 to 14, including or excluding optional features. In this example, processing the address buffer includes performing address skewing.
Example 16 includes the method of any one of examples 11 to 15, including or excluding optional features. In this example, the load instruction includes a time shape that indicates a number of cycles to complete the load instruction.
Example 17 includes the method of any one of examples 11 to 16, including or excluding optional features. In this example, the partial vector includes a subset of the scattered samples.
Example 18 includes the method of any one of examples 11 to 17, including or excluding optional features. In this example, the method includes outputting additional partial vectors at regular intervals.
Example 19 includes the method of any one of examples 11 to 18, including or excluding optional features. In this example, a number of valid samples in the partial vector depends on the randomness of the input vector data and the grouping of the input vector data.
Example 20 includes the method of any one of examples 11 to 19, including or excluding optional features. In this example, the method includes providing for data coherency during writing.
Example 21 is a method for processing scattered data. The method includes receiving a target number of samples to be output. The method also includes receiving input vector addresses and corresponding vector data including scattered samples. The method further includes processing an address buffer based on the predetermined number of samples to be output. The method also further includes outputting the predetermined number of samples.
Example 22 includes the method of example 21, including or excluding optional features. In this example, the target number of samples to be output includes an NWAY number of samples.
Example 23 includes the method of any one of examples 21 to 22, including or excluding optional features. In this example, the target number of samples to be output includes an NWAY/2 number of samples.
Example 24 includes the method of any one of examples 21 to 23, including or excluding optional features. In this example, processing the address buffer includes processing the address buffer until the specified number of samples is produced at the output.
Example 25 includes the method of any one of examples 21 to 24, including or excluding optional features. In this example, the number of samples to be output is specified by user input.
Example 26 includes the method of any one of examples 21 to 25, including or excluding optional features. In this example, the address buffer includes a first-in, first-out (FIFO) buffer.
Example 27 includes the method of any one of examples 21 to 26, including or excluding optional features. In this example, the method includes processing the predetermined number of samples at an additional stage of an image processing pipeline.
Example 28 includes the method of any one of examples 21 to 27, including or excluding optional features. In this example, the vector addresses of the scattered samples include randomly grouped addresses.
Example 29 includes the method of any one of examples 21 to 28, including or excluding optional features. In this example, the vector addresses of the scattered samples include addresses grouped in blocks.
Example 30 includes the method of any one of examples 21 to 29, including or excluding optional features. In this example, the vector addresses of the scattered samples include randomly scattered addresses.
Example 31 is at least one computer readable medium for processing scattered data having instructions stored therein that. The computer-readable medium includes instructions that direct the processor to receive a load instruction; receive input vector addresses and corresponding vector data including scattered samples. The computer-readable medium also includes instructions to process an address buffer based on a time shape of the load instruction. The computer-readable medium further includes instructions to output a partial vector in a predetermined number of clock cycles.
Example 32 includes the computer-readable medium of example 31, including or excluding optional features. In this example, the computer-readable medium includes instructions to output a scalar value indicating a number of valid samples in the partial vector.
Example 33 includes the computer-readable medium of any one of examples 31 to 32, including or excluding optional features. In this example, the computer-readable medium includes instructions to reduce a depth of the address buffer via a flexible time shape instruction.
Example 34 includes the computer-readable medium of any one of examples 31 to 33, including or excluding optional features. In this example, the computer-readable medium includes instructions to reduce the depth of the address buffer via selecting an alternative time shape instruction.
Example 35 includes the computer-readable medium of any one of examples 31 to 34, including or excluding optional features. In this example, the computer-readable medium includes instructions to perform address skewing.
Example 36 includes the computer-readable medium of any one of examples 31 to 35, including or excluding optional features. In this example, the load instruction includes a time shape that indicates a number of cycles to complete the load instruction.
Example 37 includes the computer-readable medium of any one of examples 31 to 36, including or excluding optional features. In this example, the partial vector includes a subset of the scattered samples.
Example 38 includes the computer-readable medium of any one of examples 31 to 37, including or excluding optional features. In this example, the computer-readable medium includes instructions to output additional partial vectors at regular intervals.
Example 39 includes the computer-readable medium of any one of examples 31 to 38, including or excluding optional features. In this example, a number of valid samples in the partial vector depends on the randomness of the input vector data and the grouping of the input vector data.
Example 40 includes the computer-readable medium of any one of examples 31 to 39, including or excluding optional features. In this example, the computer-readable medium includes instructions to provide for data coherency during writing.
Example 41 is a system for processing scattered data. The system includes an address buffer to receive a plurality of vector addresses corresponding to input vector data including scattered samples to be processed. The system also includes a multi-bank memory to receive the input vector data and send output vector data. The system further includes a memory controller including an address scheduler to assign an address to each bank of the multi-bank memory.
Example 42 includes the system of example 41, including or excluding optional features. In this example, the multi-bank memory includes single-sample wide memory banks.
Example 43 includes the system of any one of examples 41 to 42, including or excluding optional features. In this example, the multi-bank memory includes multi-sample wide memory banks.
Example 44 includes the system of any one of examples 41 to 43, including or excluding optional features. In this example, the multi-bank memory includes skewed addressing.
Example 45 includes the system of any one of examples 41 to 44, including or excluding optional features. In this example, the plurality of vector addresses include random vector addresses.
Example 46 includes the system of any one of examples 41 to 45, including or excluding optional features. In this example, the plurality of vector addresses include pseudo-random vector addresses.
Example 47 includes the system of any one of examples 41 to 46, including or excluding optional features. In this example, the multi-bank memory includes a number of memory banks corresponding to a number of samples that can be processed in parallel by an associated vector processor.
Example 48 includes the system of any one of examples 41 to 47, including or excluding optional features. In this example, the apparatus is to output a subset of the scattered samples in a predetermined number of cycles.
Example 49 includes the system of any one of examples 41 to 48, including or excluding optional features. In this example, the apparatus is to output a predetermined number of the scattered samples.
Example 50 includes the system of any one of examples 41 to 49, including or excluding optional features. In this example, the system includes an address history, wherein the address scheduler is to assign the address to each bank of the multi-bank memory based on an address history.
Example 51 is a system for processing scattered data. The system includes means for receiving a plurality of vector addresses corresponding to input vector data including scattered samples to be processed. The system also includes means for receiving the input vector data and send output vector data. The system further includes means for assigning an address to each bank of the multi-bank memory.
Example 52 includes the system of example 51, including or excluding optional features. In this example, the means for receiving the input vector data include single-sample wide memory banks.
Example 53 includes the system of any one of examples 51 to 52, including or excluding optional features. In this example, the means for receiving the input vector data include multi-sample wide memory banks.
Example 54 includes the system of any one of examples 51 to 53, including or excluding optional features. In this example, the means for receiving the input vector data include skewed addressing.
Example 55 includes the system of any one of examples 51 to 54, including or excluding optional features. In this example, the plurality of vector addresses include random vector addresses.
Example 56 includes the system of any one of examples 51 to 55, including or excluding optional features. In this example, the plurality of vector addresses include pseudo-random vector addresses.
Example 57 includes the system of any one of examples 51 to 56, including or excluding optional features. In this example, the means for receiving the input vector data include a number of memory banks corresponding to a number of samples that can be processed in parallel by an associated vector processor.
Example 58 includes the system of any one of examples 51 to 57, including or excluding optional features. In this example, the system is to output a subset of the scattered samples in a predetermined number of cycles.
Example 59 includes the system of any one of examples 51 to 58, including or excluding optional features. In this example, the system is to output a predetermined number of the scattered samples.
Example 60 includes the system of any one of examples 51 to 59, including or excluding optional features. In this example, the system includes means for assigning the address to each bank of the multi-bank memory based on an address history.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular aspect or aspects. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
It is to be noted that, although some aspects have been described in reference to particular implementations, other implementations are possible according to some aspects. Additionally, the arrangement and/or order of circuit elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some aspects.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
It is to be understood that specifics in the aforementioned examples may be used anywhere in one or more aspects. For instance, all optional features of the computing device described above may also be implemented with respect to either of the methods or the computer-readable medium described herein. Furthermore, although flow diagrams and/or state diagrams may have been used herein to describe aspects, the techniques are not limited to those diagrams or to corresponding descriptions herein. For example, flow need not move through each illustrated box or state or in exactly the same order as illustrated and described herein.
The present techniques are not restricted to the particular details listed herein. Indeed, those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present techniques. Accordingly, it is the following claims including any amendments thereto that define the scope of the present techniques.