Embodiments generally relate to memory structures. More particularly, embodiments relate to low overhead memory content estimation.
Natural language processing (NLP) workloads may benefit from the use of hardware acceleration technology. For example, a hardware queue manager (HQM) may schedule the workloads for execution by field programmable gate arrays (FPGAs) and dispatch the workloads to the FPGAs via a high speed data link such as a CXL (Compute Express Link, e.g., Compute Express Link Specification, Rev. 1.1, June 2019). Variable runtime conditions, however, may prevent conventional HQMs from achieving optimal scheduling performance.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Turning now to
In one example, the first CPU 20 executes a natural language processing (NLP) application 28 that stores ML data 30 (30a-30b, e.g., associated with one or more tasks submitted by one or more processor cores) to the local memory 24. A first portion 30a of the illustrated ML data 30 is relatively sparse (e.g., 60% of the matrix elements contain zero values), whereas a second portion 30b of the illustrated ML data 30 is relatively dense (e.g., 10% of the matrix elements contain zero values). As will be discussed in greater detail, the first CPU 20 may use a data movement accelerator 34 to sample the ML data 30 from the local memory 24 in accordance with a specified configuration 36. In an embodiment, the specified configuration 36, which may be generated by an AI request scheduler 32, includes a pattern (e.g., random), a number of samples (e.g., 100, 1000), a stride (e.g., memory controller/MC interleaving of four), a memory range (e.g., [A,B]), a function (e.g., sum, hash, average), a destination address (e.g., @H), etc., and/or any combination thereof. Additionally, the NLP application 28 may provide hints to the data movement accelerator 34 and/or the AI request scheduler 32 on how to handle the ML data 30 in accordance with one or more service level agreement (SLA) constraints.
A function may be registered as, for example, a bit stream and executed on the sampled ML data 30 by, for example, a relatively small accelerator (not shown) in a memory module 42 containing the local memory 24, a memory controller 44 coupled to the local memory 24, and so forth. Executing the function on the sampled ML data 30 may enable the data movement accelerator 34 to estimate the complexity of the ML data 30 based on one or more thresholds 40 (e.g., sparsity thresholds). For example, the data movement accelerator 34 might determine that the second portion 30b of the sampled ML data 30 is relatively complex and the first portion 30a of the sampled ML data 30 is not relatively complex.
In an embodiment, the AI request scheduler 32 schedules the task(s) for execution by one or more of the accelerators 14 based on the complexity and telemetry data 46 (e.g., bandwidth measurements) associated with the link 18 to the accelerators 14. Scheduling the task(s) may involve selecting a function implementation from a plurality of function implementations 48 (48a-48b). For example, tasks corresponding to the first portion 30a of the sampled ML data 30 may be scheduled in accordance with a first function implementation 48a, whereas tasks corresponding to the second portion 30b of the sampled ML data 30 might be scheduled in accordance with a second function implementation 48b.
With regard to the telemetry data 46, the AI request scheduler 32 may be able to capture real-time snapshots such as the following for the link 18.
The illustrated computing system 10 is therefore considered performance-enhanced at least to the extent that the AI request scheduler 32 takes into consideration runtime conditions such as data sparsity and link telemetry when scheduling tasks for execution on the accelerators 14. Moreover, using the low overhead data movement accelerator 34 to sample the ML data 30 and estimate the complexity obviates software-related concerns such as, for example, increased traffic, page table misses, cache misses, cache pollution, coherence traffic (e.g., for read operations) and/or latency. Indeed, during training, the AI request scheduler 32 may be able to automatically select between a long short term memory (LSTM) function implementation when the dataset/matrix is sparse and a less bandwidth intensive transformer-based neural network function implementation when the dataset is less sparse.
The local memory 24 may be part of a memory device that includes non-volatile memory and/or volatile memory. Non-volatile memory is a storage medium that does not require power to maintain the state of data stored by the medium. In one embodiment, the memory structure is a block addressable storage device, such as those based on NAND or NOR technologies. A storage device may also include future generation nonvolatile devices, such as a three-dimensional (3D) crosspoint memory device, or other byte addressable write-in-place nonvolatile memory devices. In one embodiment, the storage device may be or may include memory devices that use silicon-oxide-nitride-oxide-silicon (SONOS) memory, electrically erasable programmable read-only memory (EEPROM), chalcogenide glass, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level Phase Change Memory (PCM), a resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, magnetoresistive random access memory (MRAM) memory that incorporates memristor technology, resistive memory including the metal oxide base, the oxygen vacancy base and the conductive bridge Random Access Memory (CB-RAM), or spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thiristor based memory device, or a combination of any of the above, or other memory. The term “storage device” may refer to the die itself and/or to a packaged memory product. In some embodiments, 3D crosspoint memory may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance. In particular embodiments, a memory module with non-volatile memory may comply with one or more standards promulgated by the Joint Electron Device Engineering Council (JEDEC), such as JESD235, JESD218, JESD219, JESD220-1, JESD223B, JESD223-1, or other suitable standard (the JEDEC standards cited herein are available at jedec.org).
Volatile memory is a storage medium that requires power to maintain the state of data stored by the medium. Examples of volatile memory may include various types of random access memory (RAM), such as dynamic random access memory (DRAM) or static random access memory (SRAM). One particular type of DRAM that may be used in a memory module is synchronous dynamic random access memory (SDRAM). In particular embodiments, DRAM of the memory modules complies with a standard promulgated by JEDEC, such as JESD79F for Double Data Rate (DDR) SDRAM, JESD79-2F for DDR2 SDRAM, JESD79-3F for DDR3 SDRAM, or JESD79-4A for DDR4 SDRAM (these standards are available at jedec.org). Such standards (and similar standards) may be referred to as DDR-based standards and communication interfaces of the storage devices that implement such standards may be referred to as DDR-based interfaces.
The sampling logic 130b may implement the earlier described flow. For example, the sampling logic 130b may access N values depending on the specified pattern, interleaving and number of samples. In an embodiment, the sampling logic 130b also performs the specific function provided as parameter and stores the result to the specified address.
In one example, the data movement accelerator 130 includes the different supported functions 130c (sum, hash, average, etc.). Optionally, the functions may be registered (e.g., such as bit-streams) and executed in a small accelerator in the DIMM or MC. For such an approach, the data movement accelerator 130 may include another interface to register the bit-streams/functions 130c and an interface to enable the application to discover what functions 130c are available (e.g., via an identifier/ID of the function, meta-data on what the function implements, and meta-data on the parameters used by the function). In an embodiment, an accelerator 130d selects and executes the function.
The illustrated AI request scheduler 132 includes an interface 132a that facilitates registering the different functions 130c that each of the accelerators connected to the platform expose (e.g., when a better match is found). The interface 132a enables the registration of the function type for the selected implementation and the sampling configuration used to identify when the function is more suitable. In an embodiment, the sampling configuration is shared across multiple function implementations for a particular function type and is therefore hosted in a separate table (e.g., indexed by an ID). The interface 132a also facilitates the registration of the sampling threshold, which is a value that is used to decide whether an implementation is appropriate for the current data to be processed (e.g., a Boolean type of rule). Additionally, the interface 132a may be used to register the telemetry rule data that defines when a function is to be chosen based on the current platform telemetry (e.g., if the sampling configuration threshold matches). In one example, the interface 132a is also used to register the consumer or type of accelerator that implements the function.
The illustrated AI request scheduler 132 also includes scheduling logic 132b to decide how to execute a particular function. More particularly, the AI request scheduler 132 uses the functionality of the data movement accelerator 130 to estimate the complexity of data characteristics to be processed by the function. Once the data is to be processed, the AI request scheduler 132 uses the returned sampling, thresholds and telemetry (e.g., if part of the rule) to select the best type of implementation for the corresponding function type. The AI request scheduler 132 may therefore also include a telemetry processing component 132c and a configuration table 132d.
Illustrated processing block 142 provides for sampling machine learning data from a local memory in accordance with a specified configuration, wherein the machine learning data is associated with one or more tasks submitted by one or more processor cores. In an embodiment, block 142 includes generating the specified configuration, wherein the specified configuration includes one or more of a pattern, a number of samples, a stride, a memory range, a function, and a destination address. Additionally, block 142 may involve executing a function on the sampled machine learning data, wherein the function is executed on the sampled machine learning data by an accelerator in one or more of a memory module containing the local memory or a memory controller coupled to the local memory. In an embodiment, the machine learning data is sampled by a data movement accelerator.
Block 144 estimates a complexity of the sampled machine learning data based on one or more thresholds (e.g., sparsity thresholds). Block 144 may also be conducted by the data movement accelerator. In one example, block 146 schedules the one or more tasks for execution by one or more accelerators (e.g., hardware accelerators) based on the complexity and telemetry data (e.g., bandwidth measurements) associated with a link to the one or more accelerators. In an embodiment, the task(s) are scheduled by an AI request scheduler and block 146 includes selecting a function implementation from a plurality of function implementations.
The method 140 enhances performance at least to the extent that block 146 takes into consideration runtime conditions such as data sparsity and link telemetry when scheduling tasks for execution on the accelerators. Moreover, using a low overhead data movement accelerator to sample the ML data and estimate the complexity obviates software-related concerns such as, for example, increased traffic, page table misses, cache misses, cache pollution, coherence traffic (e.g., for read operations) and/or latency. Indeed, during training, the method 140 may be able to automatically select between an LSTM function implementation when the dataset/matrix is sparse and a less bandwidth intensive transformer-based neural network function implementation when the dataset is less sparse.
In one example, the logic 154 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 152. Thus, the interface between the logic 154 and the substrate 152 may not be an abrupt junction. The logic 154 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate 152.
Example 1 includes a processor comprising one or more substrates and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware logic, the logic coupled to the one or more substrates to sample machine learning data from a local memory in accordance with a specified configuration, wherein the machine learning data is associated with one or more tasks submitted by one or more processor cores, estimate a complexity of the sampled machine learning data based on one or more thresholds, and schedule the one or more tasks for execution by one or more accelerators based on the complexity and telemetry data associated with a link to the one or more accelerators.
Example 2 includes the processor of Example 1, wherein the logic coupled to the one or more substrates is to generate the specified configuration, and wherein the specified configuration includes one or more of a pattern, a number of samples, a stride, a memory range, a function, and a destination address.
Example 3 includes the processor of Example 1, wherein the logic coupled to the one or more substrates is to execute a function on the sampled machine learning data.
Example 4 includes the processor of Example 3, wherein the function is executed on the sampled machine learning data by an accelerator in one or more of a memory module containing the local memory or a memory controller coupled to the local memory.
Example 5 includes the processor of Example 1, wherein the one or more tasks are scheduled by an artificial intelligence (AI) request scheduler.
Example 6 includes the processor of Example 1, wherein to schedule the one or more tasks, the logic coupled to the one or more substrates is to select a function implementation from a plurality of function implementations.
Example 7 includes the processor of any one of Examples 1 to 6, wherein the machine learning data is sampled by a data streaming accelerator, and wherein the complexity is estimated by the data streaming accelerator.
Example 8 includes a performance-enhanced computing system comprising a local memory, one or more processor cores, one or more accelerators, and a processor coupled to the local memory, the one or more processor cores, and the one or more accelerators, wherein the processor includes logic coupled to one or more substrates, the logic to sample machine learning data from the local memory in accordance with a specified configuration, wherein the machine learning data is associated with one or more tasks submitted by the one or more processor cores, estimate a complexity of the sampled machine learning data based on one or more thresholds, and schedule the one or more tasks for execution by the one or more accelerators based on the complexity and telemetry data associated with a link to the one or more accelerators.
Example 9 includes the computing system of Example 8, wherein the logic is to generate the specified configuration, and wherein the specified configuration includes one or more of a pattern, a number of samples, a stride, a memory range, a function, and a destination address.
Example 10 includes the computing system of Example 8, wherein the logic is to execute a function on the sampled machine learning data.
Example 11 includes the computing system of Example 10, further including a memory module containing the local memory and a memory controller coupled to the local memory, wherein the function is executed on the sampled machine learning data by an accelerator in one or more of the memory module or the memory controller.
Example 12 includes the computing system of Example 8, wherein the one or more tasks are scheduled by an artificial intelligence (AI) request scheduler.
Example 13 includes the computing system of Example 8, wherein to schedule the one or more tasks, the logic is to select a function implementation from a plurality of function implementations.
Example 14 includes the computing system of any one of Examples 8 to 13, wherein the machine learning data is sampled by a data streaming accelerator, and wherein the complexity is estimated by the data streaming accelerator.
Example 15 includes a method of operating a performance-enhanced computing system, the method comprising sampling machine learning data from a local memory in accordance with a specified configuration, wherein the machine learning data is associated with one or more tasks submitted by one or more processor cores, estimating a complexity of the sampled machine learning data based on one or more thresholds, and scheduling the one or more tasks for execution by one or more accelerators based on the complexity and telemetry data associated with a link to the one or more accelerators.
Example 16 includes the method of Example 15, further including generating the specified configuration, wherein the specified configuration includes one or more of a pattern, a number of samples, a stride, a memory range, a function, and a destination address.
Example 17 includes the method of Example 15, further including executing a function on the sampled machine learning data, wherein the function is executed on the sampled machine learning data by an accelerator in one or more of a memory module containing the local memory or a memory controller coupled to the local memory.
Example 18 includes the method of Example 15, wherein the one or more tasks are scheduled by an artificial intelligence (AI) request scheduler.
Example 19 includes the method of Example 15, wherein scheduling the one or more tasks includes selecting a function implementation from a plurality of function implementations.
Example 20 includes the method of any one of Examples 15 to 19, wherein the machine learning data is sampled by a data streaming accelerator, and wherein the complexity is estimated by the data streaming accelerator.
Example 21 includes an apparatus comprising means for performing the method of any one of Examples 15 to 20.
Technology described herein therefore provides an effective way for software stacks to use hardware acceleration technology to understand or estimate the complexity of data to be processed. The technology also enables efficient determinations of the most appropriate function implementations to be used.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.