The present invention is described with respect to particular exemplary embodiments thereof and reference is accordingly made to the drawings in which:
An embodiment of a vector processor of the present invention is illustrated schematically in
In operation, the vector control & distribution unit 102 receives vector instructions 106 (e.g., from a control unit), decomposes the vector instructions into vector element operations, and forwards the vector element operations to the lanes 104 for processing. The vector element operations in each lane operate on vector element data 108. Each lane 104 receives a portion of the vector element operations. Each lane proceeds to execute its vector element operations independently of execution of vector element operations in other lanes. As used herein, to execute instructions independently of other lanes means to allow lanes to run ahead of other lanes. For example, if a first lane completes execution of a first vector element operation prior to any other lane completing execution of its first vector element operation received in the same time period, the first lane may proceed to begin executing a second vector element operation while the other lanes continue to execute their first vector element operations.
An embodiment of a system for vector processing of the present invention is illustrated schematically in
Typically in operation, the main memory 204 holds vector instructions and vector data. The host processor 202 forwards the vector instructions and the vector data to the vector processor 206. Alternatively, the vector data may reside in the memory units 210 or in caches (not shown). The host processor 202 may communicate with the vector processor 206 using a point-to-point transport protocol (e.g., HyperTransport Protocol). The vector control & distribution unit decomposes the vector instructions into vector element operations and forwards the vector element operations to the lanes. Each lane proceeds to execute the vector element operations that the lane receives on a portion of the vector data independent of execution of the vector element operations executing in other lanes.
An embodiment of a vector processor of the present invention is illustrated schematically in
The crossbar switch 306 provides interconnectivity between components of the vector processor 300. For example, the crossbar switch 306 provides access to any of the memory channels 315 by any of the lanes 304. In an embodiment, each lane 304 has access to a primary memory channel selected from the memory channels 315 in which access by the lane 304 to the primary memory channel is faster than access to others of the memory channels 315.
In operation, the vector processor 300 receives input 334 that includes vector instructions and initial vector data. The initial vector data and other vector data is forwarded to the memory channels 315 (i.e., the cache banks 312, the memory units 314, or a combination of the cache banks 312 and the memory units 314). Vector instructions may also be held in memory channels 315 or may be held in the instruction cache 332. The fetch & control unit 308 forwards the vector instructions to the vector control & distribution unit 302.
The vector control & distribution unit 302 decomposes the vector instructions into vector element operations and forwards the vector element operations to the lanes 304 for processing. The vector control & distribution unit 302 performs a dependency analysis on each vector instruction prior to forwarding its vector element operations to the lanes for processing to determine if the vector instruction is dependent upon an earlier vector instruction. Responsive to the dependency existing, the vector control and distribution unit forwards the vector element operations of the dependent vector instruction to the lanes for execution after forwarding the vector element operations of the vector instruction upon which it depends. Responsive to no dependency, the vector control and distribution unit 302 forwards the vector element operations of the different vector instructions to the lanes for execution independent of a particular order requirement that would be imposed by a dependency. In one example, the vector element operations of the different vector instructions can be forwarded to the lanes 304 at the same time. Particularly for lanes which can execute more than one instruction at a time, this allows for faster execution of the different vector instructions.
The lanes 304 independently execute the vector element operations, which allow some lanes to run-ahead of other lanes. Long latency instructions in a particular lane do not prevent other lanes from executing other instructions. For example, a particular lane may encounter a cache miss while others do not. Over a series of vector instructions, various lanes are likely to experience long latency instructions causing some lanes to at first run ahead of other lanes and then slow down as these lanes encounter long latency instructions. Thus, independent execution of vector element operations in the lanes 304 is expected to provide more efficient processing as long latency instructions occur randomly among the lanes 304.
The load/store units 320 of the lanes 304 load vector data from the memory channels 315. The floating point unit 316 of each lane 304 performs floating point calculations on floating point data that has been loaded into the floating point registers 322 of each lane 304. The arithmetic logic unit 318 performs logic operations and arithmetic operations on data that has been loaded into the integer registers 326 of each lane 304. The arithmetic logic unit 318 also performs bit matrix multiplications in conjunction with other arithmetic logic units 318 of others lanes on data that has been loaded into bit matrix multiplication registers 324. An embodiment of a bit matrix multiplication is discussed in more detail below. Resultant data from the lanes 304 form resultant vector data that may be forwarded to the memory channels 315 or may be forwarded to the interface 310 to form output 336.
The cache 312 perform several functions including increasing bandwidth for memory references that fit in the cache 312, reducing the power of accessing the memory units 314, which are located off-chip, and acting as buffers for communications between lanes. Use of the cache 312 also reduces latency for memory operations.
An embodiment of a bit matrix multiplication of matrices A and B performed on the vector processor 300 performs a logical AND operation of each bit in a row of matrix A and the corresponding bit in a row of matrix B, then performs a logical XOR to find the resultant bit value. This is repeated using one row of A and each row of B to create one output row. The process is then repeated for the other rows of A to create other output rows. Each lane performs a local bitwise AND on its portions of matrices A and B. These intermediate results are combined in a tree-like fashion by all lanes communicating by way of the crossbar switch 306. Synchronization point instructions may be inserted in the vector element operations provided to each lane to ensure proper coordination of the combination of intermediate results.
An exemplary operation of the vector processor 300 is illustrated as a flow chart in
A timing diagram illustrating the exemplary operation 400 is shown in
Impending completion can be computed for fixed-latency functional units (such as arithmetic units) once an element operation has been initiated by adding the functional unit latency to the cycle the operation was initiated, producing the cycle the result will be available. In practice this is often implemented by simply pipelining a completion notification by N fewer pipestages than the computed result of the fixed-latency functional unit, starting from the initiation of the computation. This results in a completion notification that is produced N cycles before the result. Impending completion in advance of results by more than one cycle is often difficult or impossible for variable latency functional units such as cache memories that may hit or miss. For these units, once cycle advance notification can still be provided as follows. For example, in the case of a set-associative cache, the fact that a hit has occurred and the way of the set which hits is often known a small amount before the data is produced, since the way that hits must be used to select the result from among the different ways of the cache. Note that once a cache miss has occurred, if data is being retrieved from DRAM memories instead of another level of cache, because the timing characteristics of the DRAMs are known, once the DRAM access has been inititated the impending availability of the results can be known in advance of the arrival of the result data.
Between times t2 and t3, the vector control & distribution unit 302 releases a third set of vector element operations, add v1A and v2A . . . add v1D and v2D, to the first through fourth lanes, 304A . . . 304D, respectively. The first through fourth lanes, 304A . . . 304D, execute the third set of vector element operations by time t4.
As depicted in the timing diagram 500, the first lane 304A runs ahead of the other lanes when it completes execution of load v1A and begins executing load v2A. Further, the third lane 304C runs ahead of the second and fourth lanes, 304B and 304D, when it completes execution of load v1C and begins executing load v2C. The ability of lanes to run ahead of other lanes accommodates situations where some vector element data of a particular vector is found in cache and remaining vector element data of the particular vector must be retrieved from memory. Because retrieving data from memory has a longer latency than retrieving data from cache, the ability to run ahead allows the lanes that receive data from cache to begin executing next vector element operations ahead of lanes that retrieve data from memory. Over time, it is anticipated that cache misses will be dispersed among lanes leading to some lanes to run ahead initially and other lanes to catch up with these lanes later.
As depicted in the timing diagram 500, the vector control & distribution unit 302 releases the third vector element operations as a pipeline operation in anticipation of the first lane 304A completing its second vector element operation (i.e., load v2A). Employing the pipeline operation allows each of the first through fourth lanes, 304A . . . 304D, to immediately execute its third vector element operation upon completion of the first and second vector element operations by all of the lanes.
Another embodiment of a vector processor of the present invention is illustrated schematically in
An exemplary operation of the vector processor 600 is illustrated as a flow chart in
A timing diagram illustrating the exemplary operation 700 is shown in
As depicted in the timing diagram 800, the first lane 604A runs ahead of the second through fourth lanes, 604B . . . 604D, when it completes execution of load v1A and begins executing load v2A. The third lane 604C runs ahead of the second and fourth lanes, 604B and 604D, when it completes execution of load v1C and begins executing load v2C. Further, the second and fourth lanes, 604B and 604D, run ahead of the first and third lanes, 604A and 604D, when the second and fourth lanes, 604B and 604D, complete execution of load v2B and load v2D and begin execution of second and fourth lane additions, respectively.
In the vector processor 600, the vector control & distribution unit 602 contributes to resolving a cross-lane dependency requirement. A cross-lane dependency requirement arises where an instruction within a particular lane cannot be executed until an instruction within another lane completes execution. In an embodiment, the vector control & distribution unit 602 resolves the cross-lane dependency requirement by awaiting confirmation of fulfillment or impending fulfillment of the cross-lane dependency requirement prior to releasing vector element operations that depend upon the cross-lane dependency requirement. In another embodiment, the vector control & distribution unit 602 forwards inter-lane dependency instructions to the lane control units 605 that instruct the lanes 604 to await fulfillment or impending fulfillment of an inter-lane dependency requirement prior to the lanes 604 executing vector element operations that depend upon the inter-lane dependency requirement.
An example depicts operation of the vector processor 600 when a cross lane dependency exists and where the vector control & distribution unit 602 resolves the dependency. The vector control & distribution unit 602 of the vector processor 600 (
In an embodiment of the vector processor 600, the lane control units 605 may independently adjust pipelining of their vector element operations. For example, with reference to the timing diagram 800, the lane control unit 605 of the first lane 604A may reverse the order of load v1A and load v2A.
Another example of independent adjustment of pipelining within a lane is provided as timing diagram in
Another example of independent adjustment of pipelining within a lane is provided as timing diagram in
Another embodiment of a vector processor of the present invention is illustrated schematically in
The foregoing detailed description of the present invention is provided for the purposes of illustration and is not intended to be exhaustive or to limit the invention to the embodiments disclosed. Accordingly, the scope of the present invention is defined by the appended claims.