This application relates to the field of information technologies, and in particular, to a memory sharing control method and device, and a system.
With popularization of big data technologies, applications in various fields have increasing requirements on computing resources. Large-scale computing represented by applications such as graph computing and deep learning represents a latest application development direction. In addition, with slowdown of semiconductor technology development, scalable performance of the applications cannot be continuously improved during processor upgrade, and multi-core processors gradually become a mainstream.
A multi-core processor system has an increasingly high requirement for a capacity of a memory. As an indispensable component in a server, cost of the memory accounts for 30% to 40% of total cost of operations of the server. Improving utilization of the memory is an important means to reduce the total cost of operations (TCO).
This application provides a memory sharing control method and device, a computer device, and a system, to improve utilization of memory resources.
According to a first aspect, this application provides a computer device, including at least two processing units, a memory sharing control device, and a memory pool, where the processing unit is a processor, a core in a processor, or a combination of cores in a processor, and the memory pool includes one or more memories;
The at least two processing units in the computer device can access the at least one memory in the memory pool in different time periods via the memory sharing control device, to implement memory sharing by a plurality of processing units, so that utilization of memory resources is improved.
Optionally, that at least one memory in the memory pool is accessible by different processing units in different time periods means that any two of the at least two processing units can separately access the at least one memory in the memory pool in different time periods. For example, the at least two processing units include a first processing unit and a second processing unit. In a first time period, a first memory in the memory pool is accessed by the first processing unit, and the second processing unit cannot access the first memory. In a second time period, the first memory in the memory pool is accessed by the second processing unit, and the first processing unit cannot access the first memory. Optionally, the processor may be a central processing unit (CPU), and one CPU may include two or more cores.
Optionally, one of the at least two processing units may be a processor, a core in a processor, a combination of a plurality of cores in a processor, or a combination of a plurality of cores in different processors. The combination of the plurality of cores in the processor is used as a processing unit, or the combination of the plurality of cores in the different processors is used as a processing unit. In this way, in a parallel computing scenario, a plurality of different cores access a same memory when executing tasks in parallel, so that efficiency of performing parallel computing by the plurality of different cores can be improved.
Optionally, the memory sharing control device may separately allocate a memory from the memory pool to the at least two processing units based on a received control instruction sent by an operating system in the computer device. Specifically, a driver in the operating system may send, to the memory sharing control device over a dedicated channel, the control instruction used to allocate the memory in the memory pool to the at least two processing units. The operating system is implemented by the CPU in the computer device by executing related code. The CPU that runs the operating system has a privilege mode, and in this mode, the driver in the operating system can send the control instruction to the memory sharing control device over a dedicated channel or a specified channel.
Optionally, the memory sharing control device may be implemented by using a field programmable gate array (FPGA) chip, an application-specific integrated circuit (ASIC), or another similar chip. Circuit functions of the ASIC have been defined at the beginning of design, and the ASIC has features of high chip integration, being easy to implement mass tapeouts, low cost of a single tapeout, a small size, and the like.
In some possible implementations, the at least two processing units are connected to the memory sharing control device via a serial bus; and
The serial bus has characteristics of high bandwidth and low latency. The at least two processing units are connected to the memory sharing control device via the serial bus, so that efficiency of data transmission between the processing unit and the memory sharing control device can be ensured.
Optionally, the serial bus is a memory semantic bus. The memory semantic bus includes but is not limited to a quick path interconnect (QPI), peripheral component interconnect express (PCIe), Huawei cache coherence system (HCCS), or compute express link (CXL) interconnect-based bus.
Optionally, the memory access request generated by the first processing unit is a memory access request in a parallel signal form. The first processing unit may convert the memory access request in the parallel signal form into the first memory access request in the serial signal form through an interface that can implement conversion between a parallel signal and a serial signal, for example, a Serdes interface, and send the first memory access request in the serial signal form to the memory sharing device via the serial bus.
In some possible implementations, the memory sharing control device includes a processor interface, and the processor interface is configured to:
The processor interface converts the first memory access request into a second memory access request in a parallel signal form, so that the memory sharing control device can access the first memory, and implement memory sharing without changing an existing memory access architecture.
Optionally, the processor interface is the interface that can implement the conversion between the parallel signal and the serial signal, for example, may be the Serdes interface.
In some possible implementations, the memory sharing control device includes a control unit, and the control unit is configured to:
Optionally, the correspondence between the memory address of the first memory and the first processing unit may be dynamically adjusted. For example, the correspondence between the memory address of the first memory and the first processing unit may be dynamically adjusted as required.
Optionally, the memory address of the first memory may be a segment of consecutive physical memory addresses in the memory pool. The segment of consecutive physical memory addresses in the memory pool can simplify management of the first memory. Certainly, the memory address of the first memory may alternatively be several segments of inconsecutive physical memory addresses in the memory pool.
Optionally, memory address information of the first memory includes a start address of the first memory and a size of the first memory. The first processing unit has an identifier, and the establishing a correspondence between a memory address of the first memory and the first processing unit may be establishing a correspondence between the unique identifier of the first processing unit and the memory address information of the first memory.
In some possible implementations, the memory sharing control device includes a control unit, and the control unit is configured to:
Optionally, the first virtual memory device may be allocated to the first processing unit by establishing an access control table. For example, the access control table may include information such as the identifier of the first processing unit, an identifier of the first virtual memory device, and the start address and the size of the memory corresponding to the first virtual memory device. The access control table may further include permission information of accessing the first virtual memory device by the first processing unit, attribute information of a memory to be accessed (including but not limited to information about whether the memory is a persistent memory), and the like.
In some possible implementations, the control unit is further configured to:
Optionally, the correspondence between the virtual memory device and the processing unit may be dynamically adjusted based on a memory resource requirement of the at least two processing units.
The correspondence between the virtual memory device and the processing unit is dynamically adjusted, so that memory resource requirements of different processing units in different service scenarios can be flexibly adapted, and utilization of memory resources can be improved.
Optionally, the preset condition may be that a memory access requirement of the first processing unit decreases, and a memory access requirement of the second processing unit increases.
Optionally, the control unit is further configured to:
In some possible implementations, the memory sharing control device further includes a cache unit, and the cache unit is configured to: cache data read by any one of the at least two processing units from the memory pool, or cache data evicted by any one of the at least two processing units.
Efficiency of accessing the memory data by the processing unit can be further improved by using the cache unit.
Optionally, the cache unit may include a level 1 cache and a level 2 cache. The level 1 cache may be a small-capacity cache with a read/write speed higher than that of the level 2 cache. For example, the level 1 cache may be a 100-megabyte (MB) nanosecond-level cache. The level 2 cache may be a large-capacity cache with a read/write speed lower than that of the level 1 cache. For example, the level 2 cache may be a 1-gigabyte (GB) dynamic random access memory (DRAM). The level 1 cache and the level 2 cache are used, so that while a data access speed of the processor can be improved by using the caches, cache space can be increased, a range in which the processor quickly accesses the memory by using the caches is expanded, and a memory access rate of the processor resource pool is further improved generally.
Optionally, the data in the memory may be first cached in the level 2 cache, and the data in the level 2 cache is then cached in the level 1 cache based on a requirement of the processing unit for the memory data. Alternatively, the data that is evicted by the processing unit or does not need to be processed temporarily may be cached in the level 1 cache, and some data evicted by the processing unit in the level 1 cache may be cached in the level 2 cache, to ensure that the level 1 cache has sufficient space for other processing units to cache data for use.
In some possible implementations, the memory sharing control device further includes a prefetch engine, and the prefetch engine is configured to: prefetch, from the memory pool, the data that needs to be read by any one of the at least two processing units, and cache the data in the cache unit.
Optionally, the prefetch engine may implement intelligent data expectation by using a specified algorithm or a related artificial intelligence (AI) algorithm, to further improve efficiency of accessing the memory data by the processing unit.
In some possible implementations, the memory sharing control device further includes a quality of service (QoS) engine, and the QoS engine is configured to implement optimized storage of the data that needs to be cached by any one of the at least two processing units in the cache unit. By using the QoS engine, different capabilities of caching, in the cache unit 304, the memory data accessed by different processing units can be implemented. For example, a memory access request initiated by a processing unit with a high priority has exclusive cache space in the cache unit 304. In this way, it can be ensured that the data accessed by the processing unit can be cached in time, so that service processing quality of this type of processing unitis ensured.
In some possible implementations, the memory sharing control device further includes a compression/decompression engine, and the compression/decompression engine is configured to: compress or decompress data related to memory access.
Optionally, a function of the compression/decompression engine may be disabled.
The compression/decompression engine may compress, by using a compression ratio algorithm and at a granularity of 4 kilobits (KBs) per page, data written by the processing unit into the memory, and then write compressed data into the memory; or decompress data to be read when the processing unit reads compressed data in the memory, and then send the decompressed data to the processor. In this way, a data transmission rate can be improved, and efficiency of accessing the memory data by the processing unit can be further improved. Optionally, the compression/decompression engine may be disabled.
Optionally, the memory sharing control device further includes a storage unit, where the storage unit includes software code of at least one of the QoS engine, the prefetch engine, and the compression/decompression engine. The memory sharing control device may read the code in the storage unit to implement a corresponding function.
Optionally, the at least one of the QoS engine, the prefetch engine, and the compression/decompression engine may be implemented by using control logic of the memory sharing control device.
In some possible implementations, the first processing unit further has a local memory, and the local memory is used for memory access of the first processing unit. Optionally, the first processing unit may preferentially access the local memory. The first processing unit has a higher speed of accessing the local memory, so that the speed of accessing the memory by the first processing unit can be further improved.
In some possible implementations, the plurality of memories included in the memory pool are of different medium types. For example, the memory pool may include at least one of the following memory media: a DRAM, a phase change memory (PCM), a storage class memory (SCM), a static random access memory (SRAM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), a NAND flash memory, a spin-transfer torque random access memory (STT-RAM), or a resistive random access memory (RRAM). The memory pool may further include a dual in-line memory module (DIMM), or a solid-state disk (SSD).
Different memory media can meet memory resource requirements when different processing units process different services. For example, the DRAM has features of a high read/write speed and volatility, and a memory of the DRAM may be allocated to a processing unit that initiates hot data access. The PCM has a non-volatile feature, and a memory of the PCM may be allocated to a processing unit that accesses data that needs to be stored for a long term. In this way, flexibility of memory access control can be improved while a memory resource is shared.
For example, the memory pool includes a volatile DRAM storage medium and a non-volatile PCM storage medium. The DRAM and the PCM in the memory pool may be in a parallel architecture, and have no hierarchical levels. Alternatively, a non-parallel architecture in which the DRAM is used as a cache and the PCM is used as a main memory may be used. The DRAM may be used as a first-level storage medium, and the PCM is used as a second-level storage medium. For the architecture in which the DRAM and the PCM are parallel to each other, the control unit may store frequently-accessed hot data in the DRAM, in other words, establish a correspondence between a processing unit that initiates to access frequently-accessed hot data and a virtual memory device corresponding to the memory of the DRAM. In this way, a read/write speed of the memory data and a service life of a main memory system can be improved. The control unit may further establish a correspondence between a processing unit that initiates to access less frequently-accessed cold data and a virtual memory device corresponding to the memory of the PCM, to store the less frequently-accessed cold data in the PCM. In this way, security of important data can be ensured based on the non-volatile feature of the PCM. For the architecture in which the DRAM and the PCM are not parallel to each other, based on features of high integration of the PCM and low read/write latency of the DRAM, the control unit may use the PCM as a main memory to store various types of data, and use the DRAM as a cache. In this way, memory access efficiency and performance can be further improved.
According to a second aspect, this application provides a system, including at least two computer devices according to the first aspect, and the at least two computer devices according to the first aspect are connected to each other through a network.
A computer device of the system can not only access a memory pool on the computer device via a memory sharing control device, to improve memory utilization, but also access a memory pool on another computer device through a network. A range of the memory pool is expanded, so that utilization of memory resources can be further improved.
Optionally, the memory sharing control device in the computer device in the system may alternatively have a function of a network adapter, and can send an access request of a processing unit to another computer device in the system through the network, to access a memory of the another computer device.
Optionally, the computer device in the system may alternatively include a network adapter having a serial-to-parallel interface (for example, a Serdes interface). The memory sharing control device in the computer device may send, by using the network adapter, a memory access request of a processing unit to another computer device in the system through the network, to access a memory of the another computer device.
Optionally, the computer device in the system may be connected through an Ethernet-based network or a unified bus (U-bus)-based network.
According to a third aspect, this application provides a memory sharing control device, where the memory sharing control device includes a control unit, a processor interface, and a memory interface.
The processor interface is configured to receive memory access requests sent by at least two processing units, where the processing unit is a processor, a core in a processor, or a combination of cores in a processor.
The control unit is configured to separately allocate a memory from a memory pool to the at least two processing units, where at least one memory in the memory pool is accessible by different processing units in different time periods.
The control unit is further configured to access, through the memory interface, the memory allocated to the at least two processing units.
Via the memory sharing control device, different processing units can access the at least one memory in the memory pool in different time periods, so that a memory resource requirement of the processing units can be met, and utilization of memory resources is improved.
Optionally, that at least one memory in the memory pool is accessible by different processing units in different time periods means that any two of the at least two processing units can separately access the at least one memory in the memory pool in different time periods. For example, the at least two processing units include a first processing unit and a second processing unit. In a first time period, a first memory in the memory pool is accessed by the first processing unit, and the second processing unit cannot access the first memory. In a second time period, the first memory in the memory pool is accessed by the second processing unit, and the first processing unit cannot access the first memory.
Optionally, the memory interface may be a double data rate (DDR) controller, or the memory interface may be a memory controller with a PCM control function.
Optionally, the memory sharing control device may separately allocate a memory from the memory pool to the at least two processing units based on a received control instruction sent by an operating system in the computer device. Specifically, a driver in the operating system may send, to the memory sharing control device over a dedicated channel, the control instruction used to allocate the memory in the memory pool to the at least two processing units. The operating system is implemented by the CPU in the computer device by executing related code. The CPU that runs the operating system has a privilege mode, and in this mode, the driver in the operating system can send the control instruction to the memory sharing control device over a dedicated channel or a specified channel.
Optionally, the memory sharing control device may be implemented by an FPGA chip, an ASIC, or another similar chip.
In some possible implementations, the processor interface is further configured to receive, via a serial bus, a first memory access request sent in a serial signal form by a first processing unit in the at least two processing units, where the first memory access request is used to access a first memory allocated to the first processing unit.
The serial bus has characteristics of high bandwidth and low latency. The first memory access request sent by the first processing unit in the at least two processing units in the serial signal form is received via the serial bus, so that efficiency of data transmission between the processing unit and the memory sharing control device can be ensured.
Optionally, the serial bus is a memory semantic bus. The memory semantic bus includes but is not limited to a QPI, PCIe, HCCS, or CXL protocol interconnect-based bus.
In some possible implementations, the processor interface is further configured to: convert the first memory access request into a second memory access request in a parallel signal form, and send the second memory access request to the control unit.
The control unit is further configured to access the first memory based on the second memory access request through the memory interface.
Optionally, the processor interface is the interface that can implement the conversion between the parallel signal and the serial signal, for example, may be the Serdes interface.
In some possible implementations, the control unit is further configured to establish a correspondence between a memory address of the first memory in the memory pool and the first processing unit, to allocate the first memory from the memory pool to the first processing unit.
Optionally, the correspondence between the memory address of the first memory and the first processing unit is dynamically adjustable. For example, the correspondence between the memory address of the first memory and the first processing unit may be dynamically adjusted as required.
Optionally, the memory address of the first memory may be a segment of consecutive physical memory addresses in the memory pool. The segment of consecutive physical memory addresses in the memory pool can simplify management of the first memory. Certainly, the memory address of the first memory may alternatively be several segments of inconsecutive physical memory addresses in the memory pool.
Optionally, memory address information of the first memory includes a start address of the first memory and a size of the first memory. The first processing unit has an identifier, and the establishing a correspondence between a memory address of the first memory and the first processing unit may be establishing a correspondence between the unique identifier of the first processing unit and the memory address information of the first memory. In some possible implementations, the control unit is further configured to: virtualize a plurality of virtual memory devices from the memory pool, where a physical memory corresponding to a first virtual memory device in the plurality of virtual memory devices is the first memory; and allocate the first virtual memory device to the first processing unit.
Optionally, the virtual memory device corresponds to a segment of consecutive physical memory addresses in the memory pool. The virtual memory device corresponds to a segment of consecutive physical memory addresses in the memory pool, so that management of the virtual memory device can be simplified. Certainly, the virtual memory device may alternatively correspond to several segments of inconsecutive physical memory addresses in the memory pool.
Optionally, the first virtual memory device may be allocated to the first processing unit by establishing an access control table. For example, the access control table may include information such as the identifier of the first processing unit, an identifier of the first virtual memory device, and the start address and the size of the memory corresponding to the first virtual memory device. The access control table may further include permission information of accessing the first virtual memory device by the first processing unit, attribute information of a memory to be accessed (including but not limited to information about whether the memory is a persistent memory), and the like.
In some possible implementations, the control unit is further configured to: cancel a correspondence between the first virtual memory device and the first processing unit when a preset condition is met, and establish a correspondence between the first virtual memory device and a second processing unit in the at least two processing units.
Optionally, the correspondence between the virtual memory device and the processing unit may be dynamically adjusted based on a memory resource requirement of the at least two processing units.
The correspondence between the virtual memory device and the processing unit is dynamically adjusted, so that memory resource requirements of different processing units in different service scenarios can be flexibly adapted, and utilization of memory resources can be improved.
Optionally, the control unit is further configured to:
The cache unit is configured to: cache data read by any one of the at least two processing units from the memory pool, or cache data evicted by any one of the at least two processing units.
Efficiency of accessing the memory data by the processing unit can be further improved by using the cache unit.
Optionally, the cache unit may include a level 1 cache and a level 2 cache. The level 1 cache may be a small-capacity cache with a read/write speed higher than that of the level 2 cache. For example, the level 1 cache may be a 100-MB nanosecond-level cache. The level 2 cache may be a large-capacity cache with a read/write speed lower than that of the level 1 cache. For example, the level 2 cache may be a 1-GB DRAM. The level 1 cache and the level 2 cache are used, so that while a data access speed of the processor can be improved by using the caches, cache space can be increased, a range in which the processor quickly accesses the memory by using the caches is expanded, and a memory access rate of the processor resource pool is further improved generally.
In some possible implementations, the memory sharing control device further includes a prefetch engine, and the prefetch engine is configured to: prefetch, from the memory pool, the data that needs to be read by any one of the at least two processing units, and cache the data in the cache unit.
Optionally, the prefetch engine may implement intelligent data expectation by using a specified algorithm or an AI algorithm, to further improve efficiency of accessing the memory data by the processing unit.
In some possible implementations, the memory sharing control device further includes a quality of service QoS engine.
The QoS engine is configured to implement optimized storage of the data that needs to be cached by any one of the at least two processing units in the cache unit. By using the QoS engine, different capabilities of caching, in the cache unit 304, the memory data accessed by different processing units can be implemented. For example, a memory access request initiated by a processing unit with a high priority has exclusive cache space in the cache unit 304. In this way, it can be ensured that the data accessed by the processing unit can be cached in time, so that service processing quality of this type of processing unit is ensured.
In some possible implementations, the memory sharing control device further includes a compression/decompression engine.
The compression/decompression engine is configured to: compress or decompress data related to memory access.
Optionally, a function of the compression/decompression engine may be disabled.
Optionally, the compression/decompression engine may compress, by using a compression ratio algorithm and at a granularity of 4 KB per page, data written by the processing unit into a memory, and then write the compressed data into the memory; or decompress data to be read when the processing unit reads compressed data in the memory, and then send the decompressed data to the processor. In this way, a data transmission rate can be improved, and efficiency of accessing the memory data by the processing unit can be further improved. Optionally, the compression/decompression engine may be disabled.
Optionally, the memory sharing control device may further include a storage unit, where the storage unit includes software code of at least one of the QoS engine, the prefetch engine, and the compression/decompression engine. The memory sharing control device may read the code in the storage unit to implement a corresponding function.
Optionally, the at least one of the QoS engine, the prefetch engine, and the compression/decompression engine may be implemented by using control logic of the memory sharing control device.
According to a fourth aspect, this application provides a memory sharing control method, where the method is applied to a computer device, the computer device includes at least two processing units, a memory sharing control device, and a memory pool, the memory pool includes one or more memories, and the method includes:
The memory sharing control device receives a first memory access request sent by a first processing unit in the at least two processing units, where the processing unit is a processor, a core in a processor, or a combination of cores in a processor;
The memory sharing control device allocates a first memory from the memory pool to the first processing unit, where the first memory is accessible by a second processing unit in the at least two processing units in another time period.
The first processing unit accesses the first memory via the memory sharing control device.
According to the method, different processing units access the at least one memory in the memory pool in different time periods, so that a memory resource requirement of the processing units can be met, and utilization of memory resources is improved.
In a possible implementation, the method further includes:
The memory sharing control device receives, via a serial bus, a first memory access request sent in a serial signal form by the first processing unit in the at least two processing units, where the first memory access request is used to access the first memory allocated to the first processing unit.
In a possible implementation, the method further includes:
The memory sharing control device converts the first memory access request into a second memory access request in a parallel signal form, and accesses the first memory based on the second memory access request.
In a possible implementation, the method further includes:
The memory sharing control device establishes a correspondence between a memory address of the first memory in the memory pool and the first processing unit in the at least two processing units.
In a possible implementation, the method further includes:
The memory sharing control device virtualizes a plurality of virtual memory devices from the memory pool, where a physical memory corresponding to a first virtual memory device in the plurality of virtual memory devices is the first memory.
The memory sharing control device further allocates the first virtual memory device to the first processing unit.
In a possible implementation, the method further includes:
The memory sharing control device cancels a correspondence between the first virtual memory device and the first processing unit when a preset condition is met, and establishes a correspondence between the first virtual memory device and the second processing unit in the at least two processing units.
In a possible implementation, the method further includes:
The memory sharing control device caches data read by any one of the at least two processing units from the memory pool, or caches data evicted by any one of the at least two processing units.
In a possible implementation, the method further includes:
The memory sharing control device prefetches, from the memory pool, the data that needs to be read by any one of the at least two processing units, and caching the data.
In a possible implementation, the method further includes:
The memory sharing control device controls optimized storage of the data that needs to be cached by any one of the at least two processing units in a cache storage medium.
In a possible implementation, the method further includes:
According to a fifth aspect, an embodiment of this application further provides a chip, and the chip is configured to implement a function implemented by the memory sharing control device according to the third aspect.
According to a sixth aspect, an embodiment of this application further provides a computer-readable storage medium, including program code. The program code includes instructions used to perform some or all of steps in any method provided in the fourth aspect.
According to a seventh aspect, an embodiment of this application further provides a computer program product. When the computer program product runs on a computer, any method according to the fourth aspect is enabled to be performed.
It may be understood that any memory sharing control device, computer-readable storage medium, or computer program product provided above is configured to perform a corresponding method provided above. Therefore, for an advantageous effect that can be achieved by the memory sharing control device, the computer-readable storage medium, or the computer program product, refer to an advantageous effect in the corresponding method. Details are not described herein again.
The following briefly describes the accompanying drawings required for describing embodiments. It is clear that the accompanying drawings in the following description show merely some embodiments of the present invention, and a person of ordinary skill in the art may derive other drawings from these accompanying drawings without creative efforts.
The following describes embodiments of the present invention with reference to the accompanying drawings.
In the specification, claims, and accompanying drawings of this application, the terms “first”, “second”, and the like are intended to distinguish between similar objects but do not necessarily indicate a specific order or sequence. It should be understood that, the data termed in such a way is interchangeable in proper circumstances, so that embodiments described herein can be implemented in an order other than the order illustrated or described herein. In addition, the terms “first” and “second” are merely intended for a purpose of description, and shall not be understood as an indication or implication of relative importance or implicit indication of a quantity of indicated technical features. Therefore, a feature limited by “first” or “second” may explicitly or implicitly include one or more features.
In the specification and claims of this application, the terms “include”, “have” and any other variants mean to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those expressly listed steps or modules, but may include other steps or modules not expressly listed or inherent to such a process, method, product, or device. Names or numbers of steps in this application do not mean that the steps in the method procedure need to be performed in a time/logical sequence indicated by the names or numbers. An execution sequence of the steps in the procedure that have been named or numbered can be changed based on a technical objective to be achieved, provided that same or similar technical effects can be achieved. Unit division in this application is logical division and may be other division during actual implementation. For example, a plurality of units may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the units may be implemented in electronic or other similar forms. This is not limited in this application. In addition, units or subunits described as separate components may or may not be physically separate, may or may not be physical units, or may be distributed into a plurality of circuit units. Some or all of the units may be selected depending on actual requirements to achieve the objectives of the solutions of this application.
It should be understood that the terms used in the descriptions of the various examples in the specification and claims of this application are merely intended to describe specific examples, but are not intended to limit the examples. The terms “one” (“a” and “an”) and “the” of singular forms used in the descriptions of various examples and the appended claims are also intended to include plural forms, unless otherwise specified in the context clearly.
It should also be understood that the term “and/or” used in the specification and claims of this application indicates and includes any or all possible combinations of one or more items in associated listed items. The term “and/or” describes an association relationship between associated objects and indicates that three relationships may exist. For example, A and/or B may indicate the following three cases: Only A exists, both A and B exist, and only B exists. In addition, the character “/” in this application usually indicates an “or” relationship between associated objects.
It should be understood that determining B based on A does not mean that B is determined based only on A. B may alternatively be determined based on A and/or other information.
It should be further understood that the term “include” (also referred to as “includes”, “including”, “comprises”, and/or “comprising”) used in this specification specifies presence of stated features, integers, steps, operations, elements, and/or components, but does not preclude presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should be further understood that the term “if” may be interpreted as a meaning “when” (“when” or “upon”), “in response to determining”, or “in response to detecting”. Similarly, according to the context, the phrase “if it is determined that” or “if (a stated condition or event) is detected” may be interpreted as a meaning of “when it is determined that”, “in response to determining”, “when (a stated condition or event) is detected”, or “in response to detecting (a stated condition or event)”.
It should be understood that “one embodiment”, “an embodiment”, and “a possible implementation” mentioned in the entire specification mean that particular features, structures, or characteristics related to an embodiment or the implementations are included in at least one embodiment of this application. Therefore, “in one embodiment”, “in an embodiment”, or “in a possible implementation” appearing throughout this specification does not necessarily mean a same embodiment. In addition, these specified features, structures, or characteristics may be combined in one or more embodiments in any proper manner.
Preferably, some terms and related technologies in this application are explained and described, to facilitate understanding.
A memory controller is an important component for controlling a memory inside a computer system and implementing data exchange between the memory and a processor, and is a bridge for communication between a central processing unit and the memory. The memory controller is mainly configured to perform a read and write operation on the memory, and may be roughly classified as a conventional memory controller and an integrated memory controller. In a conventional computer system, the memory controller is located in a northbridge chip of a main board chipset. In this structure, any data transmission between the CPU and the memory passes through a path “CPU-northbridge-memory-northbridge-CPU”. When the CPU reads/writes data from the memory, multi-level data transmission is required. Therefore, long latency is caused. The integrated memory controller is located inside the CPU, and any data transmission between the CPU and the memory needs to pass through a path “CPU-memory-CPU”. In comparison with the conventional memory controller, latency of the data transmission is greatly reduced.
A DRAM is a widely used memory medium. Unlike sequential access to disk media, the DRAM allows the central processing unit to access any byte of the disk media directly and randomly. The DRAM has simple storage structure, and each storage structure mainly includes a capacitor and a transistor. When the capacitor is charged, it indicates that data “1” is stored. A state after the capacitor discharges completely represents data “0”.
A PCM is a non-volatile memory that stores information based on a phase change storage material. Each storage unit in the PCM includes a phase change material (for example, a sulfide glass) and two electrodes. The phase change material can be converted between a crystalline state and an amorphous state by changing a voltage of the electrodes and power-on time. When being in the crystalline state, the medium has low resistance. When being in the amorphous state, the medium has high resistance. Therefore, the data may be stored by changing the state of the phase change material. The most typical characteristic of the PCM is non-volatile.
A serializer/deserializer (Serdes) converts parallel data into serial data at a transmit end, and then transmits the serial data to a receive end through a transmission line; or converts serial data into parallel data at a receive end, so that a quantity of transmission lines can be reduced, and system cost is reduced. The Serdes is a time division multiplexing (TDM) and point-to-point communication technology. To be specific, a plurality of low-speed parallel signals (namely, parallel data) at the transmit end are converted into high-speed serial signals (namely, serial data), and the high-speed serial signals are then reconverted into low-speed parallel signals at the receive end through a transmission medium. The Serdes uses differential signals for transmission, so that interference and noise loaded on the two differential transmission lines can be mutually canceled. This improves a transmission speed, and also improves signal transmission quality. A parallel interface technology refers to parallel transmission of multi-bit data, and a synchronous clock is transmitted to divide data bytes. Therefore, this manner is simple and easy to implement, but is usually used for short-range data transmission because there are a large quantity of signal lines. A serial interface technology is widely applied in long-distance data communication to transmit byte data bit by bit.
With continuous improvement of a technology level of an integrated circuit, especially continuous improvement of an architecture design of a processor, performance of the processor is gradually improved. In comparison with the processor, memory performance improvement is much slower. With an accumulated increase of a gap, a consequence of an unbalanced accumulated increase is that a memory access speed is seriously behind a computing speed of the processor, and a bottleneck formed by the memory makes it difficult to exert an advantage of a high-performance processor. For example, a memory access speed is greatly restricted to increasing high performance computing (HPC).
In addition, a multi-core processor gradually replaces a single-core processor, and the number of times of accessing a memory (for example, an off-chip memory, also referred to as a main memory) by parallel execution of a plurality of cores in the processor also greatly increases. This also leads to a corresponding increase in a bandwidth requirement between the processor and the memory.
An access speed and bandwidth between the processor and the memory are usually improved by sharing memory resources.
Depending on whether there is a difference in processor-to-memory access, an architecture in which a plurality of processors share a memory may be divided into a centralized memory sharing system and a distributed memory sharing system. The centralized memory sharing system has features of a small quantity of processors and a single interconnection manner, and the memory is connected to all the processors via a cross switch or a shared bus.
The centralized memory sharing system has a single memory system, and therefore is faced with a problem that required access memory bandwidth cannot be provided after a quantity of the processors reaches a specified scale. This becomes a bottleneck that restricts performance. The distributed memory sharing system effectively resolves this problem.
In the NUMA system, address space of a shared memory is managed by the respective processor. Due to a lack of a unified memory management mechanism, when a processor needs to use memory space managed by another processor, memory resources are not flexible enough to share, and low utilization of the memory resources exists. In addition, when the processor accesses memory address space that is not managed by the processor, long latency is usually caused because the processor crosses the interconnect bus.
Embodiments of this application provide a memory sharing control device, a chip, a computer device, a system, and a method, and provide a new memory access architecture, in which a bridge for access between a processor and a shared memory pool (which may also be briefly referred to as a memory pool in embodiments) is established via a memory sharing control device, to improve utilization of memory resources.
The memory sharing control device 200 may be a chip located between a processor (a CPU or a core in a CPU) and a memory (also referred to as a main memory) in a computer device, for example, may be an FPGA chip.
The processor interface 202 is an interface through which the memory sharing control device 200 is connected to the processor 210. The interface can receive a serial signal sent by the processor, and convert the serial signal into a parallel signal. Based on the processor interface 202, the memory sharing control device 200 may be connected to the processor 210 via a serial bus. The serial bus has characteristics of high bandwidth and low latency, to ensure efficiency of data transmission between the processor 210 and the memory sharing control device 200. For example, the processor interface 202 may be a low latency-based Serdes interface. The Serdes interface serving as the processor interface 202 is connected to the processor via the serial bus, to implement conversion between the serial signal and the parallel signal based on serial-to-parallel logic. The serial bus may be a memory semantic bus. The memory semantic bus includes but is not limited to a QPI, PCIe, HCCS, or CXL protocol interconnect-based bus.
During specific implementation, the processor 210 may be connected to the serial bus through the Serdes interface, and is connected to the processor interface 202 (for example, the Serdes interface) of the memory sharing control device 200 via the serial bus. A memory access request initiated by the processor 210 is a memory access request in a parallel signal form. The memory access request in the parallel signal form is converted into a memory access request in a serial signal form through the Serdes interface in the processor 210, and the memory access request in the serial signal form is sent via the serial bus. After receiving the memory access request in the serial signal form from the processor 210 via the serial bus, the processor interface 202 converts the memory access request in the serial signal form into the memory access request in the parallel signal form, and sends the memory access request obtained through conversion to a control unit 301. The control unit 301 may access a corresponding memory based on the memory access request in the parallel signal form. For example, the corresponding memory may be accessed in a parallel manner. In this embodiment of this application, the parallel signal may be a signal that transmits a plurality of bits once, and the serial signal may be a signal that transmits one bit once.
Similarly, when the memory sharing control device 200 returns a response message of the memory access request to the processor 210, the response message in the parallel signal form is converted into a response message in the serial signal form through the processor interface 202 (for example, the Serdes interface), and the response message in the serial signal form is sent to the processor 210 via the serial bus. After receiving the response message in the serial signal form, the processor 210 converts the response message in the serial signal form into a parallel signal, and then performs subsequent processing.
The memory sharing control device 200 may access a corresponding memory in the memory 220 through the memory interface 203 used as a memory controller. For example, when the memory 220 is a shared memory pool including the DRAM, the memory interface 203 is a DDR controller having a DRAM control function, and is configured to implement interface control of a DRAM storage medium. When the memory 220 is a shared memory pool including the PCM, the memory interface 203 is a memory controller having a PCM control function, and is configured to implement interface control of a PCM storage medium.
It should be noted that one processor 210 shown in
One memory 220 shown in
The control unit 201 is configured to control memory access based on the memory access request, including but not limited to dividing memory resources in the shared memory pool into a plurality of independent memory resources, and separately allocating (for example, allocating on demand) the plurality of independent memory resources to the processing units in the processor resource pool. The independent memory resources obtained through division by the control unit 201 may be memory storage space corresponding to a segment of physical addresses in the shared memory pool. The physical addresses of the memory resources may be consecutive or inconsecutive. For example, the memory sharing control device 200 may virtualize a plurality of virtual memory devices based on the shared memory pool, and each virtual memory device corresponds to or manages some memory resources. The control unit 201 respectively allocates, by establishing a correspondence between different virtual memory devices and the processing units, the plurality of independent memory resources obtained through division in the shared memory pool to the processing units in the processor resource pool.
However, a correspondence between the processing unit and the memory resource is not fixed. When a specific condition is met, the correspondence may be adjusted. That is, the correspondence between the processing unit and the memory resource may be dynamically adjusted. That the control unit 201 adjusts the correspondence between the processing unit and the memory resource may include: receiving a control instruction sent by a driver in an operating system, and adjusting the correspondence based on the control instruction. The control instruction includes information about deleting, modifying, or adding the correspondence.
For example, a computer device 20 (not shown in the figure) includes the processor 210, the memory sharing device 200, and the memory 220 shown in
The driver in the operating system may send the control instruction to the memory sharing control device 200 over a dedicated channel or a specified channel. Specifically, when the processor that runs the operating system is in a privilege mode, the driver in the operating system can send the control instruction to the memory sharing control device 200 over the dedicated channel or specified channel. In this way, the driver in the operating system may send, over the dedicated channel, a control instruction for deleting, changing, or adding the correspondence.
The memory sharing control device 200 may be connected to the processor 210 through an interface (for example, the Serdes interface) that supports serial-to-parallel. The processor 210 can communicate with the memory sharing control device 200 via the serial bus. Based on the characteristics of the high bandwidth and the low latency of the serial bus, even if a communication distance between the processor 210 and the memory sharing control device 200 is relatively large, an access rate of accessing the shared memory pool by the processor 210 can also be ensured.
In addition, the control unit 201 may be further configured to implement data buffering control, data compression control, data priority control, or the like. Therefore, efficiency and quality of accessing the memory by the processor are further improved.
The following describes, by using an example in which an FPGA is used as a chip for implementing the memory sharing control device 200, an example of an implementation of the memory sharing control device 200 provided in this embodiment of this application.
As a programmable logic device, the FPGA may be classified into three types according to different principles of programmability: a static random access memory (SRAM)-based SRAM-type FPGA, an anti-fuse-type FPGA, and a flash-type FPGA. Due to eraseability and volatility of the SRAM, the SRAM-type FPGA can be programmed repeatedly, but configuration data is lost due to a power failure. The anti-fuse-type FPGA can be programmed only once. After the programming, a circuit function is fixed and cannot be modified again. Therefore, the circuit function does not change even if no power is supplied.
The following uses the SRAM-type FPGA as an example to describe an internal structure of the FPGA.
A configurable logic block (CLB) mainly includes programmable resources inside such as a lookup table (LUT), a multiplexer, a carry chain, and a D trigger, is configured to implement different logic functions, and is a core of an entire FPGA chip.
A programmable input/output block (IOB) provides an interface between the FPGA and an external circuit, and when internal and external electrical characteristics of the FPGA are different, provides a proper drive for an input/output signal to implement matching. Electronic design automation (EDA) software is configured to configure different electrical standards and physical information as required, for example, to adjust a value of a drive current and to change resistances of a pull-up resistor and a pull-down resistor. Usually, several IOBs are grouped into a bank. FPGA chips of different series have a different quantity of IOBs included in each group.
A block random access memory (BRAM) is configured to store data with a large amount of data. To meet different data read/write requirements, the BRAM may be configured as a common storage structure such as a single-port RAM, a dual-port RAM, a content addressable memory (CAM), and a first in first out (FIFO) cache queue, and a storage bit width and depth can be changed based on design requirements. The BRAM can extend an application scope of the FPGA and improve flexibility of the FPGA.
A switch matrix (SM) is an important part of an interconnection resource (interconnect resource, IR) inside the FPGA, and is mainly distributed at a left end of each resource module, where switch matrices at left ends of different modules are very similar but also different, and are configured to connect module resources. Another part of the interconnection resource inside the FPGA is a wire segment. The wire segment and the SM are used together to connect entire chip resources.
The control unit 201 in
The processor interface 202 in
The encoder and the decoder complete functions of encoding and decoding data, to ensure direct current balance of serial data streams and as many data jumps as possible. For example, an 8b/10b and irregular scrambling/descrambling encoding/decoding solution may be used. The parallel-to-serial module and the serial-to-parallel module are configured to complete conversion of data between a parallel form and a serial form. A clock generation circuit generates a conversion clock for a parallel-to-serial circuit, which is usually implemented by a phase locked loop. The clock generation circuit and the clock recovery circuit provide a conversion control signal for a serial-to-parallel circuit, which is usually implemented by the phase locked loop, but may alternatively be implemented by a phase interpolator or the like.
The foregoing merely describes an example of an implementation of the Serdes interface. That is, the Serdes interface shown in
The memory interface 203 in
A control module 502 is configured to control initialization, power-off, and the like on a memory. In addition, the control module 502 may further control a depth of a memory queue used to control memory access, determine whether the memory queue is empty or full, determine whether a memory request is completed, determine an arbitration solution to be used, determine a scheduling manner to be used, and the like.
An address mapping module 503 is configured to implement conversion between an address of the access request and an address that is identifiable by the memory. For example, a memory address of a DDR4 memory system includes six parts: Channel, Rank, Bankgroup, Bank, Row, and Column. Different address mapping manners have different access efficiency.
A refresh module 504 is configured to implement scheduled refresh on the memory.
The DRAM includes many repeated cells, and each cell includes a transistor (Mosfet) and a capacitor. The capacitor is configured to: store a charge and determine whether a logical state of a DRAM unit is 1 or 0. However, because a capacitor is subject to electricity leakage, a charge is lost at intervals, and consequently data is lost. Therefore, the refresh module 504 needs to perform scheduled refresh.
A scheduling module 505 is configured to separately schedule access requests to different queues based on the access requests sent by the address mapping module 503 and request types. For example, the scheduling module may schedule an access request to a queue with a high priority, and select a request with a highest priority from a queue with a highest priority according to a preset scheduling policy, to complete one time of scheduling, where the queue is a memory access control queue; and the scheduling policy may be determined based on a time sequence of arrival of the requests, and earlier arrival time indicates a higher priority; or the scheduling policy may be determined based on a request that is prepared first.
It should be noted that
The foregoing describes an implementation of the memory sharing control device 200 by using the FPGA as an example. During specific implementation, the memory sharing control device 200 may alternatively be implemented by using another chip or another device that can implement a similar chip function. For example, the memory sharing control device 200 may alternatively be implemented by using an ASIC. Circuit functions of the ASIC have been defined at the beginning of design, and the ASIC has features of high chip integration, being easy to implement mass tapeouts, low cost of a single tapeout, a small size, and the like. A specific hardware implementation of the memory sharing control device 200 is not limited in this embodiment of this application.
In this embodiment of this application, a processor connected to the memory sharing control device 200 may be any processor that implements a processor function.
It may be understood that
The following further describes a specific implementation of the memory sharing control device provided in this embodiment of this application.
Specifically, the control unit 301 in
The memory resources connected to the memory sharing control device 300 form a shared memory pool. The control unit 301 may perform unified addressing on the memory resources in the shared memory pool, and divide memory physical address space after the unified addressing into several address segments, where each address segment corresponds to one virtual memory device. Address space sizes corresponding to the address segments obtained through division may be the same or different. In other words, sizes of the virtual memory devices may be the same or different.
The virtual memory device is not a device that actually exists, but is a segment of memory address space that is in the shared memory pool and that the control unit 301 is configured to identify. The segment of address space is allocated to a processing unit (which may be a processor, a core in a processor, a combination of different cores in a same processor, or a combination of cores in different processors) for memory access (for example, data read/write), and therefore is referred to as the virtual memory device. For example, each virtual memory device corresponds to a segment of memory areas with consecutive physical addresses. Optionally, one virtual memory device may alternatively correspond to inconsecutive physical address space.
The control unit 301 may allocate one identifier to each virtual memory device, to identify different virtual memory devices.
The control unit 301 may allocate a virtual memory device to a processing unit. To avoid possible complex logic or possible traffic storm, when allocating a virtual memory device, the control unit 301 avoids allocating one virtual memory device to a plurality of processors, or avoids allocating one virtual memory device to a plurality of cores in one processor. However, for some services, when different cores in the same processor need to execute computing tasks in parallel, or different cores in different processors need to execute computing tasks in parallel, a memory corresponding to a virtual memory device is allocated to a combination of cores through complex logic, so that service processing efficiency during parallel computing can be improved.
A manner in which the control unit 301 allocates the virtual memory device may be establishing a correspondence between the identifier of the virtual memory device and an identifier of the processing unit. For example, the control unit 301 establishes a correspondence between the virtual memory devices and different processing units based on a quantity of processing units connected to the memory sharing control device 300. Optionally, the control unit 301 may alternatively establish a correspondence between the processing units and the virtual memory devices, and establish a correspondence between the virtual memory devices and different memory resources, to establish a correspondence between the processing units and different memory resources. 3. Record a correspondence between the virtual memory devices and the allocated processing units.
During specific implementation, the control unit 301 may maintain an access control table (also referred to as a mapping table), used to record the correspondence between the virtual memory devices and the processing units. An implementation of the access control table may be shown in Table 1.
In Table 1, Device_ID represents an identifier of a virtual memory device, Address represents a start address of a physical memory address that is managed or is accessible by the virtual memory device, Size represents a size of a memory that is managed or is accessible by the virtual memory device, and Access Attribute represents an access manner, to be specific, a read operation or a write operation. Resource_ID represents an identifier of a processing unit.
In Table 1, Resource_ID usually corresponds to one processing unit. Because a processing unit may be a processor, a core in a processor, a combination of a plurality of cores in a processor, or a combination of a plurality of cores in different processors, the control unit 301 may further maintain a correspondence table between Resource_ID and the combination of cores, to determine information about the cores or a processor corresponding to each processing unit. For example, Table 2 shows an example of the correspondence between Resource_ID and the cores.
In a computer device, cores in different processors have a unified identifier. Therefore, cores IDs in Table 2 can be used to distinguish between different cores in different processors. It may be understood that Table 2 merely shows an example of the correspondence between Resource_ID of the processing unit and the corresponding cores or the corresponding processor. A manner in which the memory sharing control device 300 determines the correspondence between Resource_ID and the corresponding cores or the corresponding processor is not limited in this embodiment of this application.
In another implementation, if a memory connected to the memory sharing control device 300 includes a DRAM and a PCM, due to a non-persistent characteristic of a DRAM storage medium and a persistent characteristic of a PCM storage medium, the access control table maintained by the control unit 301 may further include whether each virtual memory device is a persistent virtual memory device or a non-persistent virtual memory device.
Table 3 shows an implementation of another access control table according to an embodiment of this application.
In Table 3, Persistent Attribute represents a persistent attribute of a virtual memory device, in other words, represents whether memory address space corresponding to the virtual memory device is persistent or non-persistent.
Optionally, the access control table maintained by the control unit 301 may further include other information for further memory access control. For example, the access control table may further include permission information of accessing the virtual memory device by the processing unit, where the permission information includes but is not limited to read-only access or write-only access.
4. When a memory access request sent by a processing unit is received, determine, based on a correspondence that is between virtual memory devices and processing units and that is recorded in the access control table, a virtual memory device corresponding to the processing unit that sends the memory access request, and access a corresponding memory based on the determined virtual memory device.
For example, a memory access request includes information such as RESOURCE_ID, address information, and an access attribute that are of a processing unit. RESOURCE_ID is an ID of a combination of cores, the address information is address information of a memory to be accessed, and the access attribute indicates whether the memory access request is a read request or a write request. The control unit 301 may query an access control table (for example, Table 1) based on RESOURCE_ID, to determine at least one virtual memory device corresponding to RESOURCE_ID. For example, the determined virtual memory device is the virtual memory device a shown in
It should be noted that, access control performed by the control unit 301 on the virtual memory device is access control implemented on the memory for a physical address space of a memory resource corresponding to the virtual memory device.
5. Dynamically adjust the correspondence between the virtual memory device and the processing unit.
The control unit 301 may dynamically adjust a virtual memory device by changing the correspondence between the processing units and the virtual memory devices in the access control table based on a preset condition (for example, different processing units has different requirements for memory resources). For example, the control unit 301 deletes a correspondence between a virtual memory device and a processing unit, in other words, releases a memory resource corresponding to the virtual memory device, and the released memory resource may be allocated to another processing unit for memory access. Specifically, this may be implemented with reference to a manner in which the control unit 201 dynamically adjusts the correspondence to delete, modify, or add the correspondence in
In an optional implementation, modulation of a correspondence between the processing units and the memory resources in the shared memory pool may alternatively be implemented by changing a memory resource corresponding to each virtual memory device. For example, when a service processed by a processing unit is in a dormant state and there is no need to occupy too much memory, a memory resource managed by a virtual memory device corresponding to the processing unit may be allocated to a virtual memory device corresponding to another processing unit, so that the same memory resource is accessed by different processing units in different time periods.
For example, when the memory sharing control device 300 is implemented by using the FPGA chip shown in
It should be noted that the control unit 301 may virtualize the plurality of virtual memory devices, allocate the plurality of virtualized virtual memory devices to the processing units connected to the memory sharing control device 300, and dynamically adjust the correspondence between the virtual memory devices and the processing units. This may be implemented based on a received control instruction sent by a driver in an operating system over a dedicated channel. In other words, the driver in the operating system of the computer device in which the memory sharing control device 300 is located sends, to the memory sharing control device 300 over the dedicated channel, an instruction for virtualizing the plurality of virtual memory devices, allocating the virtual memory devices to the processing units, and dynamically adjusting the correspondence between the virtual memory devices and the processing units, and the control unit 301 implements the corresponding functions based on the received control instruction.
The memory sharing control device 300 is connected to the processor via a serial bus through a serial-to-parallel interface (for example, the Serdes interface), so that long-distance transmission between the memory sharing control device 300 and the processor can be implemented while a speed of accessing the memory by the processor is ensured. Therefore, the processor can quickly access the memory resources in the shared memory pool. Because the memory resources in the shared memory pool can be allocated to different processing units in different time periods for memory access, utilization of the memory resources is improved.
For example, the control unit 301 in the memory sharing control device 300 can dynamically adjust a correspondence between virtual memory devices and a processing unit, and when a processing unit requires more memory space, adjust unoccupied virtual memory devices or virtual memory devices that have been allocated to other processing units but are temporarily idle to the processing unit that requires more memory, that is, establish a correspondence between these idle virtual memory devices and the processing unit that requires more memory. In this way, an existing memory resource can be effectively utilized to meet different service requirements of the processing unit. This not only ensures requirements of the processing unit for memory space in different service scenarios, but also improves the utilization of the memory resource.
The cache unit 304 may be a random access memory (RAM), and is configured to cache data that needs to be accessed by a processing unit during memory access. For example, data that needs to be read by the processing unit is read from a shared memory pool in advance and is cached in the cache unit 304, so that the processing unit quickly accesses the data, and a rate of reading the data by the processing unit can be further improved. The cache unit 304 may alternatively cache data evicted by the processing unit, for example, Cacheline data evicted by the processing unit. A speed of accessing memory data by the processing unit can be further improved by using the cache unit 304.
In an optional implementation, the cache unit 304 may include a level 1 cache and a level 2 cache. As shown in
The level 1 cache 3041 may be a cache with a small capacity (for example, a capacity at a 100 MB level); or may be a nanosecond-level SRAM medium, and caches the Cacheline data evicted from the processing unit.
The level 2 cache 3042 may be a cache with a large capacity (for example, a capacity at a 1 GB level), and may be a DRAM medium. The level 2 cache 3042 may cache, at a granularity of 4 KB per page, the Cacheline data evicted from the level 1 cache and data prefetched from a memory 220 (for example, a DDR or a PCM medium). The Cacheline data is data in a cache. For example, a cache in the cache unit 304 includes three parts: a significant bit, a flag bit, and a data bit, each row includes this three types of data, and one row of data forms one Cacheline. When initiating a memory access request, the processing unit matches data in the memory access request with a corresponding bit in the cache, to read Cacheline data in the cache or write data into the cache.
For example, when the memory sharing control device 300 is implemented by using the FPGA chip shown in
The cache unit 304 further includes the level 1 cache 3041 and the level 2 cache 3042, so that while a data access speed of the processing unit can be improved by using the caches, cache space can be increased, a range in which the processing unit quickly accesses the memory by using the caches is expanded, and a memory access rate of the processor resource pool is further improved generally.
The program code stored in the storage unit 305 may include at least one of a QoS engine 306, a prefetch engine 307, and a compression/decompression engine 308.
The QoS engine 306 is configured to control, based on RESOURCE_ID in a memory access request, a storage area of data to be accessed by the processing unit in the cache unit 304 (the level 1 cache 3041 or the level 2 cache 3042), so that memory data accessed by different processing units has different cache capabilities in the cache unit 304. For example, a memory access request initiated by a processing unit with a high priority has exclusive cache space in the cache unit 304. In this way, it can be ensured that the data accessed by the processing unit can be cached in time, so that service processing quality of this type of processing unit is ensured.
The prefetch engine 307 is configured to: prefetch memory data based on a specific algorithm, and prefetch data to be read by the processing unit. Different prefetch manners affect prefetch precision and memory access efficiency. The prefetch engine 307 implements prefetching with higher precision based on the specified algorithm, to further improve a hit rate when the processing unit accesses the memory data. For example, the prefetching implemented by the prefetch engine 307 includes but is not limited to prefetching Cacheline from the level 2 cache to the level 1 cache, or prefetching data from an external DRAM or PCM to the cache.
The compression/decompression engine 308 is configured to: compress or decompress memory access data, for example, compress, by using a compression ratio algorithm and at a granularity of 4 KB per page, data written by the processing unit into a memory, and then write the compressed data into the memory; or decompress data to be read when the processing unit reads compressed data in the memory, and then send the decompressed data to the processing unit. Optionally, the compression/decompression engine may be disabled. In this way, the compression/decompression engine 308 is disabled, and does not perform compression or decompression when the processing unit accesses the data in the memory.
The QoS engine 306, the prefetch engine 307, and the compression/decompression engine 308 described above are stored in the storage unit 305 as software modules, and the control unit 301 reads the corresponding code in the storage unit to implement the corresponding functions. In an optional implementation, at least one of the QoS engine 306, the prefetch engine 307, and the compression/decompression engine 308 may alternatively be directly configured in the control unit 301, and this function is implemented through control logic of the control unit 301. In this way, the control unit 301 may execute the related control logic to implement the related functions, and does not need to read the code in the storage unit 305. For example, when the memory sharing control device 300 is implemented by using the FPGA chip shown in
It may be understood that some of the QoS engine 306, the prefetch engine 307, and the compression/decompression engine 308 may be directly implemented by using the control unit 301; and a part is stored in the storage unit 305, and the control unit 301 reads software code in the storage unit 305 to execute the corresponding function. For example, the QoS engine 306 and the prefetch engine 307 are directly implemented through the control logic of the control unit 301. The compression/decompression engine 308 is software code stored in the storage unit 305, and the control unit 301 reads the software code of the compression/decompression engine 308 in the storage unit 305, to implement the function of the compression/decompression engine 308.
For example, when the memory sharing control device 300 is implemented by using the FPGA chip shown in
An example in which memory resources connected to the memory sharing control device 300 include a DRAM and a PCM is used below to describe an example of implementations of memory access performed by the memory sharing control device 300.
In an optional implementation, in a horizontal architecture shown in
It should be noted that, although the cache (the level 1 cache 3041 or the level 2 cache 3042), the QoS engine 306, the prefetch engine 307, and the compression/decompression engine 308 are included in
Based on features of different architectures in
In
Optionally, the memory sharing control device 800a in
Optionally, the memory sharing control device 800a in
The memory sharing control devices 80a in
In the computer device 80a or the computer device 80b, the plurality of processors may quickly access the shared memory pool via the memory sharing control device, and this can improve utilization of memory resources in the shared memory pool. In addition, because the network adapter 830 is connected to the bus through the serial interface, and data transmission latency between the processor and the network adapter does not increase significantly as a distance increases, the computer device 80a or the computer device 80b may expend, via the memory sharing control device and the network adapter, the memory resources that are accessible by the processor to that of another device connected to the computer device 80a or the computer device 80b. Therefore, a range of memory resources that can be shared by the processor is further expanded, so that the memory resources are shared in a larger range, and utilization of the memory resources is further improved.
It may be understood that the computer device 80a may alternatively include processors having no local memories. These processors access the shared memory pool via the memory sharing control device 800a, to implement memory access. The computer device 80b may alternatively include processors having local memories. The processors may access the local memories, or may access the memories in the shared memory pool via the memory sharing control device 800b. Optionally, when some processors of the computer device 80b have local memories, most memory access of these processors is implemented in the local memories.
In
The computer device 82a and another computer device M may have a similar structure as that of the computer device 80a. Details are not described again.
In the system 901, the processor 80a may access, via the memory sharing control device 800a, the network adapter 830a, the network 910a, the network adapter 8014a, and the memory sharing control device 8011a, the shared memory pool including the memory 8013a. In other words, memory resources that are accessible by the processor 810a include memory resources in the computer device 80a and memory resources in the computer device 81a. In a similar manner, the processor 810a may alternatively access memory resources of all the computer devices in the system 901. In this way, when a computer device, for example, the processor 8012a running on the computer device 81a, has low service load, and has a large quantity of memories 8013a in an idle state, but the processor 810a in the computer device 80a needs a large quantity of memory resources to execute an application such as HPC, the memory resources in the computer device 81a may be allocated to the processor 810a in the computer device 80a via the memory sharing control device 800a. In this way, the memory resources in the system 901 are effectively utilized. This not only meets memory requirements of different computer devices for processing services, but also improves utilization of the memory resources in the entire system, so that an aspect of improving the utilization of the memory resources to reduce TCO is more obvious.
It should be noted that, in the system 901 shown in
The computer device 82b and another computer device M may have a similar structure as that of the computer device 80b. Details are not described again.
In the system 902, the processor 810b may access, via the memory sharing control device 800b, the network adapter 830b, the network 910b, and the network adapter 8014b, the shared memory pool including the memory 8013b. In other words, memory resources that are accessible by the processor 810b include memory resources in the computer device 80b and memory resources in the computer device 81b. In a similar manner, the processor 810b may alternatively access memory resources in all the computer devices in the system 902, so that the memory resources in the system 902 are used as shared memory resources. In this way, when a computer device, for example, the processor 8012 running on the computer device 81b, has low service load, and has a large quantity of memories 8013b in an idle state, but the processor 810b in the computer device 80b needs a large quantity of memory resources to execute an application such as HPC, the memory resources in the computer device 81b may be allocated to the processor 810b in the computer device 80b via the memory sharing control device 800b. In this way, the memory resources in the system 902 are effectively utilized. This meets memory requirements of different computer devices for processing services, and improves utilization of the memory resources in the system 902, so that an aspect of improving the utilization of the memory resources to reduce TCO is more obvious.
It should be noted that, in the system 902 shown in
It should be noted that, in the system 901 to the system 903, the computer device needs to transmit a memory access request through a network. Because a network adapter 830b is connected to a memory sharing control device via a serial bus through a Serdes interface, a transmission rate and bandwidth of the serial bus can ensure a data transmission rate. Therefore, although network transmission affects the data transmission rate to some extent, in terms of improving the utilization of the memory resources, in this manner, the utilization of the memory resources can be improved while a memory access rate of a processor is considered.
The schematic logical diagram in which the computer device 80a in
The schematic logical diagram in which the system 902 shown in
That the at least two processing units 1102 are coupled to the memory sharing control device 1101 means that the at least two processing units 1102 are separately connected to the memory sharing control device 1101, and any one of the at least two processing units 1102 may be directly connected to the memory sharing control device 1101, or may be connected to the memory sharing control device 1101 via another hardware component (for example, another chip).
For specific implementations of the computer device 1100 shown in
The at least two processing units 1102 in the computer device 1100 shown in
Step 1200: The memory sharing control device receives a first memory access request sent by a first processing unit in the at least two processing units, where the processing unit is a processor, a core in a processor, or a combination of cores in a processor.
Step 1202: The memory sharing control device allocates a first memory from the memory pool to the first processing unit, where the first memory is accessible by a second processing unit in the at least two processing units in another time period.
Step 1204: The first processing unit accesses the first memory via the memory sharing control device.
Based on the method shown in
Specifically, the method shown in
A person of ordinary skill in the art may be aware that, with reference to the examples described in embodiments disclosed in this specification, units and method steps may be implemented by electronic hardware, computer software, or a combination thereof. To clearly describe the interchangeability between the hardware and the software, the foregoing has generally described compositions and steps of each example according to functions. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present invention.
In the several embodiments provided in this application, the described apparatus embodiments are merely illustrative. For example, unit division is merely logical function division, and may be another division during actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electrical, mechanical, or other forms.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, in other words, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected depending on actual requirements to achieve the objectives of the solutions of embodiments of the present invention.
The foregoing descriptions are merely specific embodiments of the present invention, but are not intended to limit the protection scope of the present invention. Any modification or replacement readily figured out by a person skilled in the art within the technical scope disclosed in the present invention shall fall within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202110270731.8 | Mar 2021 | CN | national |
202110351637.5 | Mar 2021 | CN | national |
This application is a continuation of International Application PCT/CN2022/080620, filed on Mar. 14, 2022, which claims priority to Chinese Patent Application No. 202110351637.5, filed on Mar. 31, 2021, which claims priority to Chinese Patent Application No. 202110270731.8, filed on Mar. 12, 2021. All of the aforementioned priority patent applications are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/080620 | Mar 2022 | US |
Child | 18460608 | US |