HIERARCHICAL NETWORK FOR STACKED MEMORY SYSTEM

Information

  • Patent Application
  • 20230297269
  • Publication Number
    20230297269
  • Date Filed
    February 28, 2022
    2 years ago
  • Date Published
    September 21, 2023
    a year ago
Abstract
A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile’s local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50x for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10x.
Description
Claims
  • 1. A hierarchical network, comprising: conductive paths between each processing tile fabricated in a processor die and a corresponding memory tile fabricated in each memory die of at least one memory die for communication between each processing tile and the corresponding memory tile, wherein the processor die and the at least one memory die comprise a die stack and the corresponding memory tile is stacked on the processing tile; anda tile communication network fabricated in the processor die for transmitting data between a first one of the processing tiles and the memory tile corresponding to a second one of the processing tiles.
  • 2. The hierarchical network of claim 1, further comprising a stack communication gateway fabricated in the processor die for transmitting data between the die stack and at least one additional die stack, wherein a device includes the die stack and the at least one additional die stack.
  • 3. The hierarchical network of claim 2, further comprising a package communication gateway that is coupled to the stack communication gateway and configured for transmitting data between the device and an additional device.
  • 4. The hierarchical network of claim 2, wherein the stack communication gateway is coupled to input/output circuitry located at a perimeter of the processor die.
  • 5. The hierarchical network of claim 2, wherein a bandwidth capacity of the package communication gateway is limited by a capacity of input/output circuits at a perimeter of the device.
  • 6. The hierarchical network of claim 2, wherein a bandwidth capacity of the stack communication gateway is less than a bandwidth capacity of the tile communication network.
  • 7. The hierarchical network of claim 2, wherein a bandwidth capacity of the stack communication gateway is limited by a capacity of input/output circuits at a perimeter of the processor die.
  • 8. The hierarchical network of claim 2, wherein the device is enclosed within an integrated circuit package.
  • 9. The hierarchical network of claim 1, wherein the tile communication network comprises a narrow sub-network for transmitting read requests and write replies and a wide sub-network for transmitting write requests and read replies.
  • 10. The hierarchical network of claim 9, wherein the wide sub-network transmits data for invoking execution of a processing thread using the data.
  • 11. The hierarchical network of claim 1, wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in a local memory block comprising the corresponding memory tile in each memory die of the at least one memory die.
  • 12. The hierarchical network of claim 1, wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in one of a local memory block comprising the corresponding memory tile in each memory die of the at least one memory die, the local memory block corresponding with a different processing tile within the processor die, an additional stack of dies that is included within a device, or an additional stack of dies that is external to the device.
  • 13. The hierarchical network of claim 1, wherein the conductive paths comprise a through-die via structure that is fabricated within each one of the at least one memory die.
  • 14. The hierarchical network of claim 13, wherein the through-die via structure comprises at least one of through-silicon vias, solder bumps, or hybrid bonds.
  • 15. The hierarchical network of claim 1, wherein the die stack and the at least one additional die stack are affixed to an interposer substrate to produce a device.
  • 16. The hierarchical network of claim 1, wherein a bandwidth capacity of the conductive paths equals or is greater than a bandwidth capacity of the tile communication network.
  • 17. The hierarchical network of claim 1, wherein the processor die comprises a graphics processing unit.
  • 18. The hierarchical network of claim 1, wherein each processing tile comprises one or more central processing units.
  • 19. The hierarchical network of claim 1, wherein the tile communication network comprises one of a two-dimensional mesh structure, a flattened butterfly structure, or a concentrated mesh structure.
  • 20. A method, comprising: generating a memory access request by a first processing tile of a plurality of processing tiles that are fabricated within a processor die, wherein the processor die and at least one memory die comprise a die stack with at least one memory tile of a plurality of memory tiles fabricated within each memory die stacked with each processing tile to provide local memory for the processing tile;determining whether the memory access request specifies a location in the local memory for the first processing tile, andresponsive to determining that the memory access request specifies the location in the local memory for the first processing tile, transmitting the memory access request from the first processing tile to the local memory provided by the at least one memory tile stacked with the first processing tile through conductive paths between the first processing tile and the local memory; orresponsive to determining that the memory access request does not specify the location in the local memory for the first processing tile, transmitting the memory access request from the first processing tile to a second processing tile of the plurality of processing tiles and through second conductive paths between the second processing tile and the local memory provided by the at least one memory tile stacked with the second processing tile.
  • 21. The method of claim 20, further comprising a stack communication gateway fabricated in the processor die for transmitting data between the die stack and at least one additional die stack, wherein a device package includes the die stack and the at least one additional die stack.
  • 22. The method of claim 21, further comprising a package communication gateway fabricated in the processor die for transmitting data between the die stack and at least one additional die stack within the device package and at least one additional device package.
  • 23. The method of claim 20, wherein at least one of the steps of generating, determining, and transmitting are performed on a server or in a data center to generate an image, and the image is streamed to a user device.
  • 24. The method of claim 20, wherein at least one of the steps of generating, determining, and transmitting are performed within a cloud computing environment.
  • 25. The method of claim 20, wherein at least one of the steps of generating, determining, and transmitting are performed for training, testing, or inferencing with a neural network employed in a machine, robot, or autonomous vehicle.
  • 26. The method of claim 20, wherein at least one of the steps of generating, determining, and transmitting is performed on a virtual machine comprising a portion of a graphics processing unit.