VECTOR PROCESSING UNIT

Abstract
A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.
Description
Claims
  • 1. (canceled)
  • 2. A system comprising: a first vector processing unit (VPU) lane;a vector memory co-located with the first VPU lane, the vector memory having a plurality of memory banks; anda second VPU lane that is within a distance to the vector memory co-located with the first VPU lane such that data traverses the distance in a single clock cycle.
  • 3. The system of claim 2, wherein each of the first VPU lane and the second VPU lane includes a respective vector memory having a plurality of memory banks.
  • 4. The system of claim 2, wherein each of the first VPU lane and the second VPU lane is a VPU sub-lane.
  • 5. The system of claim 2, wherein each of the first VPU lane and the second VPU lane is a respective computing resource of an integrated circuit die section of the system.
  • 6. The system of claim 5, wherein each of the first VPU lane and the second VPU lane comprises a plurality of VPU sub-lanes.
  • 7. The system of claim 5, wherein a first resource within a VPU sub-lane of the first VPU lane is within a distance to a second resource within the VPU sub-lane of the first VPU lane such that data traverses the distance in a single clock cycle.
  • 8. The system of claim 7, wherein a first resource within a VPU sub-lane of the second VPU lane is within a distance to a second resource within the VPU sub-lane of the second VPU lane such that data traverses the distance in a single clock cycle.
  • 9. The system of claim 5, further comprising: an external memory coupled to each of the first VPU lane and the second VPU lane; andan inter-chip interconnect that interconnects each of the external memory, the first VPU lane, and the second VPU lane.
  • 10. The system of claim 9, wherein the external memory is external to the integrated circuit die section.
  • 11. The system of claim 9, wherein each of the external memory and the inter-chip interconnect is configured to exchange data with the vector memory and the first VPU lane.
  • 12. The system of claim 5, wherein the vector memory is included in the first VPU lane.
  • 13. The system of claim 12, further comprising: a plurality of second VPU lanes; anda respective vector memory in each second VPU lane of the plurality of second VPU lanes.
  • 14. The system of claim 5, further comprising: a matrix unit configured to perform matrix multiplication on data that is received from the first VPU lane and the second VPU lane.
  • 15. The system of claim 14, wherein: the matrix unit is external to the integrated circuit die section; andthe data traverses a distance between the matrix unit and at least the first VPU lane in a single clock cycle.
  • 16. The system of claim 14, wherein the data comprises at least 1024 vector operands.
  • 17. The system of claim 2, wherein the data is represented as a multi-dimensional vector and the system further comprises: a permute unit configured to reshape or rearrange the data with reference to the multi-dimensional vector.
  • 18. The system of claim 2, further comprising: a cross-lane unit configured to move data between the first VPU lane and the second VPU lane.
  • 19. A system comprising: an external memory;an inter-chip interconnect; andat least one vector processing unit (VPU) lane;wherein each VPU lane of the at least one VPU lane comprises corresponding vector memory having a plurality of memory banks,wherein each of the external memory and the inter-chip interconnect is configured to exchange data with the vector memory of the at least one VPU lane,wherein each VPU lane of the at least one VPU lane comprises a plurality of VPU sub-lanes, andwherein each VPU lane of the at least one VPU lane is within a distance to the corresponding vector memory such that data traverses the distance in a single clock cycle.
  • 20. The system of claim 19, further comprising: a matrix unit configured to perform matrix multiplication on vector operands corresponding to the data, wherein the vector operands are received from the first VPU lane and the second VPU lane.
  • 21. The system of claim 20, wherein the data is represented as a multi-dimensional vector and the system further comprises: a permute unit configured to reshape or rearrange the data with reference to the multi-dimensional vector; anda cross-lane unit configured to move data between two or more VPU lanes of the system.
Continuations (4)
Number Date Country
Parent 17327957 May 2021 US
Child 18074990 US
Parent 16843015 Apr 2020 US
Child 17327957 US
Parent 16291176 Mar 2019 US
Child 16843015 US
Parent 15454214 Mar 2017 US
Child 16291176 US