Accelerators with on-chip cache locality typically focus on system on chip (SoC) designs that integrate a number of components of a computer or other electronic system into a single chip. The accelerators typically provide acceleration of instructions executed by a processor. The acceleration of instructions results in performance and energy efficiency improvements, for example, for in memory database processes.
Features of the present disclosure are illustrated by way of-example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Accelerators that provide acceleration of instructions executed by a processor, for example, for indexing, may be designated as indexing accelerators. Indexing accelerators may include both specialized hardware and dedicated buffers for targeting relatively large data workloads. Such large data workloads may include segments of execution that may not be ideally suited for standard processors due to relatively large amounts of time spent accessing data and waiting on dynamic random-access memory (DRAM) (e.g., time spent chasing pointers through indexing structures). The indexing accelerators may provide an alternate and more energy efficient option for executing these data segments, while also allowing the main processor core to be put into a low power mode.
According to an example, an indexing accelerator that leverages high amounts of memory-level parallelism (MLP) is disclosed herein. The indexing accelerator disclosed herein may generally provide for a processor core to offload database indexing operations. The indexing accelerator disclosed herein may support one or more outstanding memory requests at a time. As described in further detail below, the support for a plurality of outstanding memory requests may be provided, for example, by incorporating MLP support at the indexing accelerator, allowing multiple indexing requests to use the indexing accelerator, allowing execution to move ahead by issuing prefetch requests on-the-fly, and supporting. parallel fetching of multiple probe keys to mitigate and overlap certain index-related on-chip cache miss penalties. The MLP support may allow the indexing accelerator to achieve higher performance than a baseline design without MLP support.
The indexing accelerator disclosed herein may support MLP by generally using inter-query parallelism, or by extracting the parallelism with data structure specific prefetching. MLP may be supported by allowing multiple indexing requests to use the indexing accelerator by including additional configuration registers in the indexing accelerator. Execution of indexing requests for queries may be allowed to move ahead by issuing prefetch requests for a next entry in a hash table chain. Further, the indexing accelerator disclosed herein may support parallel fetching of multiple probe keys to mitigate and overlap certain index-related on-chip cache miss penalties.
The indexing accelerator disclosed herein may generally include a controller that performs the indexing operation, and a relatively small cache data structure used to buffer any data encountered (e.g., touched) during the indexing operation. The controller may handle lookups into an index data structure (e.g., a red-black tree, a B-tree, or a hash table), perform any computation needed for the indexing (e.g., joining between two tables, or matching specific fields), and access to the data being searched for (e.g., database table rows that match a user's query). According to an example, the relatively small cache data structure may be 4-8 KB.
The indexing accelerator disclosed herein may target, for example, data-centric workloads that spend a relatively large amount of time accessing data. Such data-centric workloads may typically include minimal reuse of application data. As a result of the relatively large amounts of data being encountered, the locality of data structure elements (e.g., internal nodes within a tree) may tend to be low, as searches may have a relatively low probability of touching the same data. Data reuse may be useful for metadata such as table headers, schema, and constants that may be used to access raw data or calculate pointer addresses. The buffer of the indexing accelerator disclosed herein may facilitate indexing, for example, by reducing the use of a processor core primary cache for data that may not be used again. The buffer of the indexing accelerator disclosed herein may also capture frequently used metadata in database workloads (e.g., database schema and constants). The indexing accelerator disclosed herein may also provide efficiency for queries that operate on relatively small indexes, for example, by issuing multiple outstanding loads. Therefore, the indexing accelerator disclosed herein may provide acceleration of memory accesses for achieving improvements, for example, in performance and energy efficiency.
The components of the indexing accelerator 100 that perform various other functions in the indexing accelerator 100, may comprise machine readable instructions stored on a non-transitory computer readable medium. In addition, or alternatively, the components of the indexing accelerator 100 may comprise hardware or a combination of machine readable instructions and hardware. For example, the components of the indexing accelerator 100 may be implemented on a SoC.
Referring to
The buffer 122 may be a fully associative cache that stores any data that is encountered during execution of the indexing accelerator 100. For example, the buffer 122 may be a relatively small (e.g., 4-8 KB) fully associative cache. The buffer 122 may provide for the leveraging of spatial and temporal locality.
The indexing accelerator 100 interface may be provided as a library, or as a software (i.e., machine readable instructions) application programming interface (API) of a database management system (DBMS). The indexing accelerator 100 may provide functions such as, for example, index creation and lookup. The library calls may be converted to specific instruction set architecture (ISA) extension instructions to setup and use the indexing accelerator 100. During invocations of the indexing accelerator 100, a processor core 128 executing a thread that is indexing may sleep while the indexing accelerator 100 is performing the indexing operation. Once the indexing operation is complete, the indexing accelerator 100 may push results 130 (e.g., found data in the form of a temporary table) to the processor's cache, and send the processor core 128 an interrupt, allowing the processor core 128 to continue execution. When the indexing accelerator 100 is not being used to index data, the components of the indexing accelerator 100 may be used for other purposes to augment a processor's existing cache hierarchy. Using the indexing accelerator 100 during idle periods may reduce wasted transistors, improve a processor's performance by providing expanded cache capacity, improve a processor's energy consumption by allowing portions of the cache to be shut down, and reduce periods of poor processor utilization by providing a higher level of optimizations.
During idle periods, the request decoder 104, the controller 108, and the computational logic 120 may be shut down, and a processor or higher level cache may be provided access to the buffer 122 of the indexing accelerator 100, For example, the request decoder 104, the controller 108, and the computational logic 120 may individually or in combination provide access to the buffer 122 by the core processor. Moreover, the indexing accelerator 100 may include an internal connector 132 directly connecting the buffer 122 to the processor core 128 for operation during such idle periods.
During idle periods of the indexing accelerator 100, the processor core 128 or higher level cache (e.g., the L2 cache 202 of
Based on receipt of the indication that the indexing tasks are complete, the processor core 128 may direct the indexing accelerator 100 to flush any stale indexing accelerator 100 specific data in the buffer 122. Since the buffer 122 may have been previously used to cache data that the indexing accelerator 100 was using during indexing operations, clean data (e.g., tree nodes within an index, data table tuple entries, etc.) may be flushed out so that the data will not be inadvertently accessed while the indexing accelerator 100 is not being used as an indexing accelerator 100. If dirty or modified data remains in the buffer 122, the buffer 122 may provide for snooping by any lower caches (e.g., the L2 cache 208) such that those lower caches see that modified data and write back that modified data.
After the data has been flushed from the buffer 122, the controller 108 may be disabled. Disabling the controller 108 may prevent the indexing accelerator 100 from functioning as an indexing accelerator, and may instead allow certain components of the indexing accelerator 100 to be used for the various different purposes. For example, after disablement of the controller 108, the indexing accelerator 100 may be used as a victim cache, a miss buffer, a stream buffer, or an optimization buffer, as opposed to an indexing accelerator 100 with MLP (i.e., based on the MLP state of the controller 108). Each of these modes may be used during any idle period that the indexing accelerator 100 is experiencing.
As shown in
According to another example, the buffer 122 may be used as a scratch pad memory such that the indexing accelerator 100, during idle periods, may provide an interface to the processor core 128 to enable specific computations to be performed on the data maintained within the buffer 122. The computations allowed may be operations that are provided by the indexing hardware, such as comparisons or address calculations. This may allow flexibility in the indexing accelerator 100 by providing other ways to reuse the indexing accelerator 100.
As described herein, the indexing accelerator 100 may be used as a victim cache, a miss buffer, a stream buffer, or an optimization buffer during idle periods. However, the indexing accelerator 100 may be used as an indexing accelerator once again, and the processor core 128 may send a signal to the indexing accelerator 100 to perform indexing operations. When the processor core 128 sends a signal to the indexing accelerator 100 to perform indexing operations, the data contained in the buffer 122 may be invalidated. If the data contained in the buffer 122 is clean data, the data may be deleted, written over, or the addresses to the data may be deleted. If the data contained in the buffer 122 is dirty or altered, then that data may be flushed to the caches (e.g., L1 cache 202, L2 cache 208) within the memory hierarchy 200. After the buffer data in the indexing accelerator 100 has been invalidated, the controller 108 may be re-enabled by receipt of a signal from the processor core 128. If the L1 cache 202 had been disabled previously, the L1 cache 202 may also be re-enabled.
In order for the indexing accelerator 100 to provide MLP support, as described herein, the indexing accelerator 100 may generally include the MSHRs 112, the multiple configuration registers (or prefetch buffers) 106 for executing independent indexing requests, and the controller 108 with MLP support.
The MSHRs 112 may provide for the indexing accelerator 100 to issue outstanding loads. The indexing accelerator 100 may include, for example, 4-12 MSHRs 112 to exploit MLP. For the cases where there is no need to support an outstanding load (e.g., speculative loads), the prefetch buffer 114 of the same size may be used to avoid complexities of dependence checking hardware in the MSHRs 112. As the indexing accelerator 100 issues its off-indexing accelerator loads to the L1 cache 202, the number of outstanding misses that the L1 cache 202 can support may also bound the number of the MSHRs 112. The multiple configuration registers 106 may be used during the execution, for example, of indexing requests for multiple queries 102. The configuration register contexts 206 may share the same decoder since the format of the requests is the same. The controller 108 with the MLP support may provide for issuing of prefetch requests via the MSHRs 112 or the prefetch buffers 114. Both tree and hash states of the indexing accelerator 100 may initiate a prefetch request. The controller 108 may force a normal execution mod of the indexing accelerator 100 or cancel the prefetch operations arbitrarily by disabling the controller monitor 116 in the MLP (prefetch) engine 110.
In order to provide for MLP, the indexing accelerator 100 may provide support for multiple indexing requests to use the indexing accelerator 100, allow execution to move ahead by issuing prefetch requests on-the-fly, and support parallel fetching of multiple probe keys to mitigate and overlap certain index misses. Each of these aspects is described with reference to
With respect to providing support for multiple indexing requests to use the indexing accelerator 100, in transaction processing environments, inter-query parallelism may be prevalent as there may be thousands of transactions buffered and waiting for the execution cycles. Therefore, the indexing portion of these queries may be scheduled for the indexing accelerator 100. Even though the indexing accelerator 100 may execute one query at a time, the indexing accelerator 100 may switch its context (e.g., by the controller 108) upon a long-latency miss in the indexing accelerator 100 after issuing a memory request for a query 102. In order to support context switching, the indexing accelerator 100 may employ a configuration register 106 per context.
Referring to
At block 304, one of the received indexing requests (e.g., indexing request based on a first query) may begin execution. The execution of the indexing request may begin by reading the related information from one of the configuration register contexts 206 that has information for the indexing request under execution, Each configuration register context may include index-related information for one indexing request. The indexing request execution may include steps that calculate the address of an index entry and load/read addresses one by one until the requested entry (or entries) is located. The address calculation may include using the address of the base address of an index table, and adding offsets to the base address according the index table layout. Once the address of the index entry is calculated, the address may be read from the memory hierarchy 200. For example, the first entry of the index may be located by reading the base address of the index table and adding the base address with the length of each index entry, where these values may be sent to the indexing accelerator 100 during a configuration stage and reside in the configuration registers 106.
At block 306, the controller 108 may determine if there is a miss in the buffer 122, which means that the requested index entry is to be fetched from processor caches.
At block 308, in response to a determination that there is no miss, the results 130 may be sent to the processor cache if the found entry matches with a searched key.
At block 310, in response to a determination that there is a miss, the controller 108 (i.e., the FSM) may begin count cycles while waiting for the requested data to arrive from the memory hierarchy 200.
At block 312, in response to a determination that the miss has not been served longer than a specified threshold (e.g., hit latency of the L1 cache 202), the controller 108 may begin execution of another indexing request (e.g., based on a second query) with a context switch to another one of the configuration register contexts 206.
At block 314, the context switch operation may save the state of the controller 108 (i.e., the FSM state) to the configuration register 106. of the indexing request based on the first query. The state information may include the last state of the controller 108 and the MSHR 112 number that was used.
At block 316, during execution of the indexing request based on the second query, in response to a determination that there is a long latency miss, again the controller 108 may begin execution of another indexing request (e.g., based on a third query) with a context switch to another one of the configuration register contexts 206.
At block 318, during a context switch, the controller 108 may check the MSHRs 112 to determine if there is a reply to one of the indexing requests.
At block 320, in response to a determination that there is a reply to one of the indexing requests, the corresponding indexing request may be scheduled.
At block 322, in response to a determination that there is no reply to one of the indexing requests, a new indexing request may begin execution.
With respect to context switching, when a context switch is needed, if all the MSHRs 112 are full and/or there is no new query to begin, the execution may stall until one of the outstanding miss is served. Then the controller 108 may resume the corresponding context.
As described herein, in order to provide or MLP, the indexing accelerator 100 may provide support for multiple indexing requests to use the indexing accelerator, allow execution to move ahead by issuing prefetch requests on-the-fly, and support parallel fetching of multiple probe keys to mitigate and overlap certain index misses.
With respect to allowing execution to move ahead by issuing prefetch requests on-the-fly, the index execution may terminate when a searched key is found. In order to determine whether the searched key is found or not, at each level of the index, the comparisons against the found key and the searched key may be performed. The probability of finding the searched key in a first attempt may be considered low. Therefore the indexing accelerator 100 execution may speculatively move ahead and assume that the searched key is not found. The aspect of moving ahead by issuing prefetch requests on-the-fly may be beneficial for hash tables where the links may be accessed ahead of time once the first bucket is found, assuming that the table is organized with multiple arrays that are aligned to each other. Even if the table does not have an aligned layout, if processing each node needs additional computations besides comparing keys (e.g., updating a state in the node, indirectly stored node values, etc.), the indexing accelerator 100 may move ahead by skipping the computation and fetching the next node (i.e., dereferencing next link pointers) upon encounter. Moving ahead may also allow for overlapping of a long-latency load that may occur while moving from one link to another.
Referring to
At block 404, during hash table search, the value (e.g., the key that the indexing request searches for) may be hashed and the bucket may be accessed.
At block 406, before reading the value within the bucket, the next link (which is the entry with the same offset but in a different array) may be issued to one of the MSHRs 112 or to the prefetch buffer 114. Similarly, if the hash table data structures are not aligned (i.e., connected via a pointer), then the indexing accelerator 100 may decide to read and dereference the pointer before reading the value within the bucket.
At block 408, the key may be compared against the null value (i.e., which means there is no such entry in the hash table) and the key used to calculate the bucket address.
At block 410, in response to a determination that one of the comparisons is true, the execution may terminate. This may imply that the last issued prefetch was unnecessary.
At block 412, in response to a determination that none of the comparisons is true, the execution may continue to the next link.
The example of
As described herein, in order to provide for MLR the indexing accelerator 100 may provide support for multiple indexing requests to use the indexing accelerator, allow execution to move ahead by issuing prefetch requests on-the-fly, and support parallel fetching of multiple probe keys to mitigate and overlap certain index misses.
With respect to support for parallel fetching of multiple probe keys to mitigate and overlap certain index misses, the moving ahead technique may provide for prefetching of the links within a single probe operation (i.e., moving ahead may exploit intra-probe parallelism). However, as described herein, the prefetching may start once the bucket header position is found (i.e., once the key is hashed). Therefore, the bucket header read may incur a relatively long--latency miss even with respect to allowing execution to move ahead by issuing prefetch requests on-the-fly.
To mitigate the first bucket header miss, the indexing accelerator 100 may exploit inter-probe parallelism as there may be a plurality (e.g., millions) of keys searched on a single index table for an indexing request (e.g., hash joins in data analytics workloads). To exploit such parallelism, the next probe key may be prefetched and the hash value may be calculated to issue the bucket header's corresponding entry in advance. Prefetching the next probe key may be performed based on the probe key access patterns as these keys are stored in an array in a DBMS and may follow a fixed stride pattern (e.g., add 8 bytes to the previous address). Prefetching the next probe key may be performed in advance so that the value may be hashed and the bucket entry may be prefetched.
Referring to
At block 504, the probe key N+1 may continue normal operation of the indexing accelerator 100 by first hashing the probe key N+1, loading the bucket entry, and carrying out the comparison operations against NULL values (i.e., empty bucket entries), and looking for a possible match.
At block 506, while the probe key N+1 is busy with loads and comparisons, by using logic gates in the computational logic 120, the controller 108 may send the probe key N+2 to the computational logic 120 for hashing (if the probe key N+2 arrived in the meantime). Once the hashing is completed, a prefetch request may be inserted into the MSHRs 112 or to the prefetch buffer 114 to prefetch the bucket entry that corresponds to probe key N+2.
At block 508, when the probe for the probe key N+1 completes, the probe key N+2 may read the bucket entry (which was prefetched) for the comparisons and issue a prefetch request for a probe key N+3.
With respect to parallel fetching of multiple probe keys, the indexing accelerator 100 may use hashing to calculate the bucket position for a probe key. For example, the indexing accelerator 100 may employ additional computational logic 118 for the prefetching purposes or let the controller 108 arbitrate the computation logic 120 among the normal and prefetch operations. The additional computational logic 118 may be employed for prefetching purposes if the prefetch distance is larger than one. The prefetch distance of one may be ideal for hiding the operations with normal operations (i.e., prefetching more than one probe key may use a relatively long normal operation, and otherwise, calculating the prefetch addresses may use excessive execution time of the indexing accelerator 100).
Referring to
At block 604, an indexing request of the received indexing requests may be assigned to a configuration register of the configuration registers. For example, referring to
At block 606, data related to an indexing operation of the controller for responding to the indexing request may be stored. For example, referring to
Referring to
At block 704, an indexing request of the received indexing requests may be assigned to a configuration register of the configuration registers. For example, referring to
At block 706, data related to an indexing operation of the controller for responding to the indexing request may be stored. For example, referring to
At block 708, execution of the indexing request may move ahead by issuing prefetch requests for a next entry in a hash table chain for responding to the indexing request. For example, referring to
At block 710, parallel fetching of multiple probe keys may be implemented. For example, referring to
According to another example, the controller 108 may support MLP by determining if there is a miss during execution of the indexing request, where the execution of the indexing request corresponds to a configuration register context of the configuration register, and where the indexing request is designated a first indexing request, and the configuration register context of the configuration register is designated a first configuration register context of a first configuration register. In response to a determination that there is no miss during the execution of the first indexing request, the indexing accelerator 100 may forward results of the execution of the first indexing request to a processor cache. Further, in response to a determination that there is a miss during the execution of the first indexing request, the controller 108 may begin count cycles, and in response to a determination that the miss has not been served longer than a specified threshold based on the count cycles, the controller 108 may begin execution of another indexing request with a context switch to a configuration register context of another configuration register. According to another example, a state of the controller 108 may be saved to the first configuration register. According to a further example, the MSHRs 112 (or the prefetch buffer 114) may be checked to determine if there is a reply to one of the indexing requests.
According to another example, the controller 108 may implement parallel fetching of multiple probe keys by determining if probing for a probe key N is completed, and in response to a determination that probing for the probe key N is completed, the controller 108 may fetch a probe key N+1, and prefetch a probe key N+2.
The computer system 800 may include a processor 802 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 802 may be communicated to and received from the indexing accelerator 100. Moreover, commands and data from the processor 802 may be communicated over a communication bus 804. The computer system may also include a main memory 806, such as a random access memory (RAM), where the machine readable instructions and data for the processor 802 may reside during runtime, and a secondary data storage 808, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums.
The computer system 800 may include an I/O device 810, such as a keyboard, a mouse, a display, etc. The computer system may include a network interface 812 for connecting to a network. Other known electronic components may be added or substituted in the computer system.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/053040 | 7/31/2013 | WO | 00 |