I. Field of the Disclosure
The technology of the disclosure relates generally to efficient memory data transfers, particularly memory copies, in microprocessor-based systems.
II. Background
Microprocessors perform computational tasks in a wide variety of applications. A typical microprocessor application includes one or more central processing units (CPUs) that execute software instructions. The software instructions instruct a CPU to fetch data from a location in memory, perform one or more CPU operations using the fetched data, and store or accumulate the result. The memory from which the data is fetched can be local to the CPU, within a memory “fabric,” and/or within a distributed resource to which the CPU is coupled. CPU performance is the processing rate, which is measured as the number of operations that can be performed per unit of time (a typical rating is based on one second). The speed of the CPU can be increased by increasing the CPU clock rate. Since many CPU applications require fetching data from the memory fabric, increases in CPU clock speed without like kind decreases in memory fabric fetch times will only increase the amount of wait time in the CPU for the arrival of fetched data.
Memory fetch times have been decreased by employing interleaved memory systems. Interleaved memory systems can also be employed for local memory systems to a CPU. In an interleaved memory system, multiple memory controllers are provided that support interleaving the contiguous address lines between different memory banks in the memory. In this manner, contiguous address lines stored in different memory banks can be simultaneously accessed to increase memory access bandwidth. In a non-interleaved memory system, contiguous lines stored in a memory bank could only be accessed serially.
To further illustrate addresses interleaved among different memory banks,
Even though interleaved memory systems provide a theoretical increase in bulk transfer bandwidth, it is difficult for a CPU to use all this bandwidth. The address alignments used by the CPU may not often align with optimal interleaving boundaries in the interleaved memory system. This is because the address alignments used by the CPU are typically created based on the alignments of the memory buffers engaged by the CPU, and not the architecture of the interleaved memory systems. Further, data transfer sizes that are less than the stride of an interleaved memory system may not benefit from the interleaved memory system.
Illustrative embodiments of the present invention that are shown in the drawings are summarized below. These and other embodiments are more fully described in the Detailed Description section. It is to be understood, however, that there is no intention to limit the invention to the forms described in this Summary of the Invention or in the Detailed Description. One skilled in the art can recognize that there are numerous modifications, equivalents, and alternative constructions that fall within the spirit and scope of the invention as expressed in the claims.
Aspects of the invention may be characterized as a method for transferring data on a computing device that includes interleaved memory constructs and is capable of issuing an asynchronous load of data to a cache memory in advance of the data being used by a processor. The method may include receiving a first read address associated with a read data stream and receiving a second address that is associated with a second data stream, the second data stream is one of a read or write data stream. In addition, a minimum preload offset is obtained that is based, at least in part, upon the speed of the memory constructs. A next available interleaved memory address is calculated which is in a next adjacent interleave to the second address by adding the obtained interleave size to the second address, and a minimum preload address is calculated by adding the obtained minimum preload offset to the first address. In addition, a raw address distance is calculated by subtracting the minimum preload address from the next available interleaved address, and an interleave size mask is also calculated based upon an interleave stride and an interleave count to strip the higher-order bits from the raw address distance to produce a raw offset from the minimum preload address to a preferred memory preload address. A final preload offset is then calculated from the first read address by adding the minimum preload distance to the calculated raw offset and the final preload offset is used to address-align memory addresses to prevent the read data stream and the second data stream from simultaneously utilizing the memory constructs thereby accelerating the transfer of the data.
Other aspects may be characterized as a computing device that includes at least two memory constructs, a cache memory coupled to store data from the memory constructs, and a processor coupled to the cache memory. The processor may include registers to store a first read address associated with a read data stream and a second address that is associated with a second data stream. The processor may also include system registers including a minimum preload offset, an interleave stride, and an interleave count. In addition, the processor may include raw offset logic to determine a raw offset utilizing the first read address, the second address, the interleave stride, the interleave count, and the minimum preload offset and logic to add the raw offset to the minimum preload offset to obtain a final preload offset. A data prefetch generation component may be included in the processor that uses the final preload offset to prefetch data that is one interleave away from data being accessed at the second address to prevent the read data stream and the second data stream from simultaneously utilizing the memory constructs.
With reference now to the drawing figures, several exemplary embodiments of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
Embodiments disclosed herein include accelerated interleaved memory data transfers in processor-based devices and systems. Related devices, methods, and computer-readable media are also disclosed. Embodiments disclosed include accelerated large and small memory data transfers. As a non-limiting example, a large data transfer is a data transfer size greater than the interleaved address block size provided in the interleaved memory. As another non-limiting example, a small data transfer is a data transfer size less than the interleaved address block size provided in the interleaved memory.
To efficiently utilize interleaved memory systems for accelerated data transfers, in certain disclosed embodiments, data streams are address aligned to not access the same memory bank in interleaved memory at the same time during the data transfer. For example, a read data stream involved in a memory data transfer is address aligned so that the read data stream and a write data stream do not access the same memory bank in interleaved memory at the same time during the data transfer. Address alignment provides increased data transfer efficiency for large data transfers where the size of the data transfer is greater than the interleaved address block size. To provide further increases in data transfer efficiency, the memory data to be transferred may also be prefetched or preloaded into faster memory (e.g., faster cache memory) before the transfer operation is executed. In this manner, a processor (e.g., central processing unit (CPU)) can quickly read the data to be transferred from the faster memory when executing the transfer operation without having to wait for the data to be read from slower memory. Also, a minimum prefetch or preload offset may be employed with the prefetch or preload operation so that data read from slower memory and written to faster memory is completed before the CPU needs access to the data during the transfer operation.
In other disclosed embodiments, preload-related computations and operations are used to minimize overhead in setting up data transfers in data transfer software functions. For example, the data transfer software function may be included in software libraries that are called upon for data transfers. One non-limiting example of a data transfer software routine is a modified version of the “memcpy” software function in the C programming language. The use of preload-related computations and operations is designed to be dependent on the data transfer size. The use of preload-related computations and operations to provide efficient data transfers can vary depending on the data transfer size and other parameters of the CPU, including without limitation, the number of available internal registers, the size of the registers, and the line size of the memory.
In this regard,
With continuing reference to
With continuing reference to
The data bus 36 may be provided as interleaved data buses 36(0)-36(Y) configured to carry data for interleaved memory address blocks according to the interleaving scheme provided by the memory controllers 42(0)-42(Z). Alternatively, a single data bus 36 can be provided to transfer data serially between interleaved memory address blocks from the memory controllers 42(0)-42(Z) and the bus interface 38.
Providing the interleaved third level cache memory (L3) 30 and interleaved memory controllers 42(0)-42(Z) can increase memory bandwidth available to the CPU 22 by a multiple of the number of unique interleaved memory banks, but only if interleaved third level cache memory (L3) 30 and interleaved memory controllers 42(0)-42(Z) handle data transfer operations without waits one hundred percent (100%) of the time. However, memory address alignments used by the CPU 22 may not often align with optimal interleaving address boundaries in the interleaved memory. This is because the address alignments used by the CPU 22 are typically created based on the alignments of the memory buffers (e.g., four (4) bytes) engaged by the CPU 22, and not the architecture of the interleaved memory. Further, data transfer sizes that are less than the stride of the interleaved memory may not benefit from the interleaved memory system. For example, the address alignments of two CPU 22 data transfer streams (e.g., a large block of sequentially addressed memory being read/written) will have approximately a three percent (3%) likelihood of being aligned in the same cache line in the opposite memory bank 40 having a one (1) KB interleaved address block size at 1 KB boundaries with sixty-four (64) byte line size (64 bytes/2 KB=3.1%) to fully utilize the interleaved third level cache memory (L3) 30 and memory controllers 42(0)-42(Z).
To efficiently utilize interleaved memory systems for accelerated data transfers, including those provided in the processor-based system 20 in
To provide further increases in data transfer efficiency, the data to be transferred may also be prefetched or preloaded into faster memory (e.g., faster cache memory) before the transfer operation is executed. By prefetching or preloading, the CPU 22 can quickly read the data to be transferred from the faster memory when executing the transfer operation without having to wait for the data to be read from slower memory. The CPU 22 is typically pipelined such that multiple read and write operations (typically the size of a cache line) can be dispatched by the CPU 22 to memories without stalling the CPU 22 pipeline. Also, a minimum prefetch or preload offset may be employed with the prefetch or preload operation so that data read from slower memory and written to faster memory is completed before the CPU 22 needs access to the data during the transfer operation.
As an example, data stream address alignment of a 1 (one) KB interleaved address block size at 1 (one) KB boundaries (starting at memory address 0—e.g., boundaries 0x000, 0x400, 0x800, etc.) could be as follows. Consider a memory copy transfer (e.g., the memcpy C language function) where the least significant bits (LSBs) of the starting read memory address is x000 and the LSBs of the starting write memory address is x000. The third level cache memory (L3) 30 will be used by the CPU 22 to store read data for fast access when the read data is written during the data transfer operation. These read and write memory addresses will access the same memory bank 40 in the third level cache memory (L3) 30 during the data transfer. However, because the stride is 1 (one) KB, the starting read memory address could be set by the CPU 22 to 0x400 in the third level cache memory (L3) 30 for the memory reads and writes to be aligned for accessing different memory banks 40 during the data transfer. In this example, the starting read memory address could be set by the CPU 22 to also be 0xC00 in the third level cache memory (L3) 30 for the memory reads and writes to be aligned for accessing different memory banks 40 during the data transfer. In this example, the stride of the interleaved memory banks 40 controls the optimal alignment distance between the read memory address and the write memory address.
With continuing reference to
A preload operation is deemed to be an asynchronous operation in this example, because the CPU 22 does not wait for the result of a preload operation. At a later time, the CPU 22 uses a synchronous load/move instruction to read the “current data” from the faster cache (e.g., the third level cache memory (L3) 30) into a CPU 22 register. If a PLD instruction is issued far enough ahead of time, the read latency of the slower memory (e.g., the fabric memory) can be hidden from the CPU 22 pipeline such that the CPU 22 only incurs the latency of the faster cache memory access (e.g., the third level cache memory (L3) 30). Given a sequential stream of addresses, the term “minimum preload offset” is used to describe how many addresses ahead of the current read pointer to preload read data in order to be far enough ahead to overcome the read latency of the slower memory. In this example, the cache memory 26, 28, 30 in which the preloaded data from the fabric memory is loaded can be specified as desired.
As depicted in
With continuing reference to
With continuing reference to
ADDRESS_MASK=((INTERLEAVE_COUNT*INTERLEAVE_STRIDE*2)−1)
MADDR1=(ADDRESS_MASK “AND” addr1)
MADDR2=(ADDRESS_MASK “AND” addr2)
With continuing reference to
The calculated PLD (PLD_OFFSET) must always be greater than or equal to the minimum PLD offset (MINIMUM_PLD_OFFSET). The minimum PLD address is labeled “R” in
R=MADDR1+MINIMUM_PLD_OFFSET
With continuing reference to
W(0)=MADDR2+INTERLEAVE_STRIDE
W(1)=W(0)+(INTERLEAVE_COUNT*INTERLEAVE_STRIDE)
W(2)=W(1)+(INTERLEAVE_COUNT*INTERLEAVE_STRIDE)
To select the preferred memory address for alignment, the equation below may be used to identify a preferred memory address (PA):
If (R<=W(0) then PA=W(0)
Else If (R<=W(1) then PA=W(1)
Else PA=W(2)
As shown in
PLD_OFFSET=PA−MADDR1
It is desirable that the calculated preload offset (PLD_OFFSET) should be as small as possible. This is because the PLD_OFFSET determines the efficiency of several embodiments of the invention at the beginning and end of an interleaved acceleration as well as the minimum size of the streaming operation to which several embodiments of this invention can be applied. For example, the PLD_OFFSET of many these embodiments this invention will always be less than or equal to:
((INTERLEAVE_STRIDE_COUNT*INTERLEAVE_STRIDE)+MINIMUM_PLD_OFFSET).
As a result, the preload offset (PLD_OFFSET) is calculated such that a preload to the first memory address (addr1+PLD_OFFSET) will address the closet preload address which is also in a different interleaved memory bank than a read or write to the second memory address (addr2); thus an efficient and usable data transfer is provided from interleaved memory. This efficiency determines the minimum address block size (MINIMUM_BLOCK_SIZE) of the data streaming operation on which many embodiments can be applied. It may also be desired that the first memory address (addr1) preloads do not extend past the end of the first memory address (addr1) stream to avoid inefficiencies. Therefore, it may be desired to minimize the preload offset (PLD_OFFSET) for smaller data transfer sizes. Also, it may be desired to start preloading data for a data transfer as soon as possible. If the preload offset (PLD_OFFSET) is larger than the calculation in
As depicted in
With continuing reference to
With continuing reference to
With continuing reference to
W(0)=ADDR2+INTERLEAVE_STRIDE
W(1)=ADDR2+((INTERLEAVE_COUNT+1)*INTERLEAVE_STRIDE))
With continuing reference to
R=ADDR1+MINIMUM_PLD_OFFSET
With continuing reference to
D=W(0)−R
As one skilled in the art can appreciate, the modulo of a positive number is a well-defined mathematical operation and is represented by the “%” symbol. However, the modulo of a negative number is not explicitly defined and varies with the hardware or software implementation. A “bitwise AND” is also a well-known logical operation and is represented by the “&” symbol. The term modulo is used to describe a concept of embodiments described herein, but since a bitwise AND is actually used in the calculation, the expected results using either a positive or negative number are well defined.
It should be noted that the high order bits of both of the above “W(0)” and “R” addresses are unknown and therefore it is unknown whether the result “D” is positive or negative, large or small. This is resolved by using the remainder from a type of modulo operation. The modulo of “powers of 2” can be quickly calculated by using a bitwise “AND” and can alternately be expressed as:
X%2n==X&(2n−1)
Typically, only positive values of X are used in modulo equations; however embodiments will use the function with both positive and negative values in order to optimize the number of required steps. In the equation above, the “(2′-1)” component of the above equation will be referred to as the “INTERLEAVE_SIZE_MASK.”
As is well known, the CPU 22 stores negative numbers in a “2's complement” format which uses the highest order bit to represent a negative number. In several embodiments, the highest order bit of the INTERLEAVE_SIZE_MASK will always be 0. Therefore a bitwise “AND” using the INTERLEAVE_SIZE_MASK will apply both a modulo and an absolute value function to “X.” Besides the modulo and absolute functions, a third property of the computing system is used by the bitwise “AND.” As stated, negative numbers are stored in a 2's complement format. Using a 32-bit CPU 22 as an example, a negative number such as “−X” would be stored in a register as (232−X). When a bitwise “AND” of the INTERLEAVE_SIZE_MASK of (2n−1) is applied to a 2's complement number such as “X,” it will produce a modulo remainder equal to the absolute value (2n−X). When it is applied to a positive number, it will produce a modulo remainder equal to “X.”
With continuing reference to
The formulas used in Block 614 can be expressed as:
INTERLEAVE_SIZE_MASK=((INTERLEAVE_COUNT*INTERLEAVE_STRIDE)−1)
RAW_OFFSET=D & INTERLEAVE_SIZE_MASK
With continuing reference to
PLD_OFFSET=RAW_OFFSET+MINIMUM_PRELOAD_OFFSET
With continuing reference to
Many types of hardware prefetchers are in existence in hardware, but embodiments disclosed herein implement novel and unique algorithms, which have not previously been used in any hardware. The typical hardware prefetchers search for a recognized pattern of read data and then automatically begin to speculatively preload future data based on the current read pattern. Typical examples of prefetch algorithms are an instruction or data cache fill that will cause the next cache line to be prefetched. A strided data prefetcher will look for a constant address stride between several data reads. It then uses that constant stride and multiplies by a predetermined count to create a prefetch address that it speculates that the CPU will read in the future. The automatic strided prefetch operation stops when the read address stride is broken. The common theory of these existing prefetchers is to prefetch data based on a single stream of reads. They do not take interleaved memory devices or a second data stream into account.
Flowchart blocks 608, 610, 612, 614, and 616 are used to calculate the final PLD_OFFSET. Using the formulas from the flowchart description above, the hardware, as shown in
There are also a number of hints that can be specified with each data PLD instruction. The use of software hints with regards to reading and writing cache data are known in the art. These hints are not required by the either the software or hardware embodiment, but some of these software hints could be extended or new ones created to also apply to this embodiment. For example, one hint is whether data is to be streamed or not. Streaming data is transient and temporal and may not be expected to be needed again. Thus, streamed data may be removed from cache memory as soon as it is loaded by the CPU. This may make some data streaming operations more efficient by reducing cache use. Another hint is whether data is expected to be read and/or written in the future. These hints can be combined. Also, some or all of these hints may or may not be available depending on the type of CPU chip employed.
In this regard,
There are different approaches that can be used to calculate the preload offset (PLD_OFFSET). Below are three (3) examples that follow the steps outlined above with reference to
INTERLEAVE_STRIDE=1024 bytes
INTERLEAVE_STRIDE_COUNT=2
There are several architecture design choices that can influence the calculations described herein. The list below in Example 4 illustrates some of these dependencies.
Below are two further examples of how data transfer memory address alignment can be provided for write and read data streams. Example 5 below illustrates when the first memory address (addr1) is a read stream and the second memory address (addr2) is a write stream. Example 6 below illustrates when the first memory address (addr1) is a read stream and the second memory address (addr2) is a read stream.
It should be noted that the order of the operations in the examples above can vary for different implementations. It is not uncommon for preloads to follow loads in some cases. Also, there may be varying number of loads, stores and preloads in the loop, in order to match the total number of bytes loaded or stored with the total number of bytes preloaded each time through the loop. For example, if each load instruction only loads thirty-two (32) bytes (by specifying one or more registers to load that can collectively hold thirty-two (32) bytes), and each preload only loads one hundred twenty eight (128) bytes, there might be four loads in the loop for each preload. Many load and store instructions hold as few as four (4) bytes, so many loads and stores are needed for each preload instruction. And, there could be multiple preload instructions per loop for some multiple of preload size (PLDSize) processed per loop.
Also, there may be extra software instructions provided to handle the remainder of data that is not a multiple in size of preload size (PLDSize) (when DataSize modulo PLDSize is not zero). Also, note that some loops decrement a loop counter by one each time through the loop (rather than decrementing the number of bytes)—there are a number of equivalent ways loops can be structured.
The embodiments disclosed herein can also be implemented using prefetches in CPU hardware where the PLD instruction is not employed. For example, a CPU and cache may have the ability to “prefetch” data based on the recognition of data patterns. A simple example is when a cache memory reads a line from memory because of a request from the CPU. The cache memory is often designed to read the next cache line in cache memory in anticipation that the CPU will require this data in a subsequent data request. This is termed a speculative operation since the results may never be used.
The CPU hardware could recognize the idiom of a cached streaming read to a first memory address (addr1) register, a cached streaming read/write to a second memory address (addr2) register, and a decrementing register that is used as the data streaming size. The CPU hardware could then automatically convert the first memory address (addr1) stream to an optimized series of preloads, as described by this disclosure. It is also possible that a software hint instruction could be created to indicate to the CPU hardware to engage this operation. The CPU hardware calculates a preload offset (pld_offset) conforming to a set of rules, which when added to the first memory address (addr1) data stream, can create a stream of preloads to memory which will always access a different interleaved memory as the second memory address (addr2) stream. The second memory address (addr2) can be either a read or a write data stream. This allows the interleaved memory banks of cache memory or other memory to be accessed in parallel, thereby increasing bandwidth by utilizing the bandwidth of all interleaved memory banks. If more than two (2) interleaved devices exist, this approach can be applied multiple times.
The accelerated interleaved memory data transfers in microprocessor-based systems according to embodiments disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, and a portable digital video player.
In this regard,
Other devices can be connected to the system bus 78. As illustrated in
The CPU 72 may also be configured to access the display controller(s) 90 over the system bus 78 to control information sent to one or more displays 94. The display controller(s) 90 sends information to the display(s) 94 to be displayed via one or more video processors 96, which process the information to be displayed into a format suitable for the display(s) 94. The display(s) 94 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a processor, a DSP, an Application Specific Integrated Circuit (ASIC), an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The embodiments disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined. It is to be understood that the operational steps illustrated in the flow chart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art would also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present application for patent claims priority to Provisional Application No. 61/606,757 entitled ACCELERATED INTERLEAVED MEMORY DATA TRANSFERS IN MICROPROCESSOR-BASED SYSTEMS, AND RELATED DEVICES, METHODS, AND COMPUTER-READABLE MEDIA filed Mar. 5, 2012, and assigned to the assignee hereof and hereby expressly incorporated by reference herein. The present application is related to U.S. patent application Ser. No. 13/369,548 Docket Number 111094 filed on Feb. 9, 2012 and entitled “DETERMINING OPTIMAL PRELOAD DISTANCE AT RUNTIME,” which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61606757 | Mar 2012 | US |