The present disclosure relates to processing units and computer architecture, and in particular to a method and device for providing a vector stream instruction set architecture extension for a central processing unit (CPU).
Generally, memory access latency in computing is high and is often the system performance bottleneck. In fields such as high performance computing (HPC), digital signal processing (DSP), artificial intelligence (AI)/machine learning (ML) and computer vision (CV) similar computation operations are repeatedly performed on data stored in memory, often in the form of streams. In the case of loops or nested loops within an application, similar operations are repeatedly performed on data, which presents a considerable obstacle to performance given the limitations of memory access. To improve efficiency of the memory accesses, instructions and/or data that are likely to be accessed by CPUs are copied over from memory locations wherein the data is stored to faster local memory, such as local cache memory, through an operation known as prefetching.
Array-based memory accesses can be categorized into two types based on the index value type, direct memory access for index values based on an induction variable and indirect memory access for index values based on another array access. The effectiveness of prefetching operation is greatly diminished using existing solutions in the case of indirect array-based memory access, wherein the access is defined in an array structure wherein the array index is defined by another array access.
The present disclosure provides a method and device for providing a vector stream instruction set architecture extension for a CPU and for processing vector data streams. Both general-purpose CPUs and domain-specific CPUs can be designed into broadly defined architecture types with an instruction set architecture (ISA) for each processor. The vectorized stream instruction set of the present disclosure may be used with general-purpose and domain-specific CPUs. In various examples described herein, there is provided a vector ISA extension that operates on multiple data streams configured in vector format in parallel (i.e., concurrently). An ISA represents an abstract model of a computer. An ISA can be implemented, or realized, in the form of a physical CPU. The vector ISA extension of the present disclosure extends an ISA so that it can process vector data streams. The vector ISA extension maintains dependency relationships between arrays of the vector data streams. The vector data streams may be processed vectorially wherein array index calculation in batches is enabled by determining array indices from registers based on the dependency relationship. An explicit instruction to retrieve memory for a higher-level dependency stream causes implicit instructions to be performed from one or more of the vector data streams from which the higher-level dependency stream depends. The vector ISA extension of the present disclosure enables the processing unit(s) of a host computing device to issue vector instructions in addition to scalar instructions. The present disclosure also provides a Vector Stream Engine Unit (V-SEU). The V-SEU is a hardware processing unit which is configured to execute vector streams output from the vector processing ISA extensions.
In accordance with a first aspect of the present disclosure, there is provided a method of processing vector data streams by a processing unit, the method comprising: initiating a first vector data stream for a first set of array-based memory accesses, wherein the first vector data stream is associated with a first array index for advancing the first set of array-based memory accesses, wherein the first array index is an induction variable; initiating a second vector data stream for a second set of array-based memory accesses, wherein the second vector data stream is associated with a second array index for advancing the second set of array-based memory accesses, wherein the second array index is dependent on array values of the first set of array-based memory accesses; prefetching a first plurality of data elements requested by the first set of array-based memory accesses from a memory into a first fast memory storage by advancing the first array index by a plurality of increments; prefetching a second plurality of data elements requested by the second set of array-based memory accesses from a vector register file into a second fast memory storage, wherein the second array index is advanced as the first plurality of data is used as array values; and processing a plurality of the prefetched second plurality of data elements through an explicit instruction for the second vector data stream, wherein the execution of the explicit instruction causes the processing unit to translate the explicit instruction to an implicit instruction to execute a plurality of the prefetched first plurality of data elements.
In some or all examples of the first aspect, the first plurality of data elements and the second plurality of data elements are prefetched based on stream information stored in a stream configuration table (SCT).
In some or all examples of the first aspect, an initial value and end value of the induction variable and the base address of the first set of array-based memory accesses are stored in the SCT for the first vector data stream.
In some or all examples of the first aspect, the stream information of the SCT includes stream dependency relationship information.
In some or all examples of the first aspect, the method further comprising: determining conflicts in the second plurality of data elements prior to prefetching the second plurality of data elements; and serializing at least the conflicting data elements of the second plurality of data elements in response to detection of a conflict during the prefetching of the second plurality of data elements.
In some or all examples of the first aspect, only the conflicting data elements are serialized during the prefetching of the second plurality of data elements.
In some or all examples of the first aspect, the method further comprises: generating a conflict mask in response to detection of a conflict; wherein the conflicting data elements are serialized using the conflict mask
In some or all examples of the first aspect, the vector data streams are processed vectorially while maintaining dependency relationships between arrays of the vector data streams, wherein array-index calculation is performed in batches by determining array-index values from registers of the vector register file based on the dependency relationships.
In some or all examples of the first aspect, the method further comprises: converting scalar instructions to vector instructions comprising the first vector data stream and second vector data stream.
In accordance with a second aspect of the present disclosure, there is provided a system, exemplarily, which may be a vector stream engine unit, comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.
In some or all examples of the second aspect, an initial value and end value of the induction variable and the base address of the first set of array-based memory accesses are stored in the SCT for the first vector data stream.
In some or all examples of the second aspect, the stream information of the SCT includes stream dependency relationship information.
In some or all examples of the second aspect, the first fast memory storage and second fast memory storage are First-In-First-Out (FIFO) buffers.
In some or all examples of the second aspect, the FIFOs have a size based on a prefetching depth.
In some or all examples of the second aspect, the data in each vectorized data stream is accessed in vector batches of a fixed size and the FIFO size is a multiple of a size of the vector batches of the vector data streams.
In some or all examples of the second aspect, multiplexers select signals of the vector stream engine unit and pass the selected signal to the memory or vector register file in accordance with a respective signal type.
In some or all examples of the second aspect, the vector data streams are comprised of a sequence of memory accesses having repeated patterns that are the result of loops and nested loops.
In some or all examples of the second aspect, the vector data streams are classified into two groups consisting of memory streams that define a memory access pattern and induction streams that define a repeating pattern of values.
In some or all examples of the second aspect, the memory streams are dependent on either an induction stream for direct memory access or another memory stream for indirect memory access.
In some or all examples of the second aspect, the vector stream engine unit further comprises a compiler for compiling source code and porting the compiled code to at least one processing unit of a host computing device for execution.
In some or all examples of the second aspect, the vector stream engine unit is configured to perform the methods described above in the first aspect and herein.
In accordance with a further aspect of the present disclosure, there is provided a computing device comprising a processor, a memory and a communication subsystem. The memory having tangibly stored thereon executable instructions for execution by the processor. The executable instructions, in response to execution by the processor, cause the computing device to perform the methods described above and herein.
In accordance with a further aspect of the present disclosure, there is provided a non-transitory machine-readable medium having tangibly stored thereon executable instructions for execution by a processor of a computing device. The executable instructions, in response to execution by the processor, cause the computing device to perform the methods described above and herein.
Other aspects and features of the present disclosure will become apparent to those of ordinary skill in the art upon review of the following description of specific implementations of the application in conjunction with the accompanying figures.
The present disclosure is made with reference to the accompanying drawings, in which embodiments are shown. However, many different embodiments may be used, and thus the description should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete. Wherever possible, the same reference numbers are used in the drawings and the following description to refer to the same elements, and prime notation is used to indicate similar elements, operations or steps in alternative embodiments. Separate boxes or illustrated separation of functional elements of illustrated systems and devices does not necessarily require physical separation of such functions, as communication between such elements may occur by way of messaging, function calls, shared memory space, and so on, without any such physical separation. As such, functions need not be implemented in physically or logically separated platforms, although they are illustrated separately for ease of explanation herein. Different devices may have different designs, such that although some devices implement some functions in fixed function hardware, other devices may implement such functions in a programmable processor with code obtained from a machine-readable medium. Lastly, elements referred to in the singular may be plural and vice versa, except wherein indicated otherwise either explicitly or inherently by context.
Indirect memory access can be difficult for a prefetcher to process because the actual data element being accessed depends on the values of other arrays. Hence, each memory access may be random in practice, thereby causing substantial misses in the cache hierarchy resulting in poor prefetching accuracy. Further, when array accesses are converted to an assembly language, a number of assembly instructions are required to calculate the memory address to load the instructions from. In the case of indirect memory accesses, additional instructions for address calculation are required for each additional level of indirection before the correct data can be accessed and loaded from memory. The overhead in address calculation further exacerbates the performance degradation.
There are two major drawbacks associated with the approaches of the prior art. Firstly, only scalar operations are supported by the proposed stream instruction set architecture (ISAs). Specifically, the existing proposals for a streaming ISA, or similar ideas, do not go beyond scalar data elements. At each access point, the software is loading/storing a single data element of the stream, and correspondingly, the stream is also advanced by a single element at each loop iteration. Thus, the data consumption rate, and thereby the result generation rate, per stream is limited to one element per loop iteration. This significantly limits the gain potentially attainable by the streams prefetched by the hardware.
Secondly, the streams are often consumed by vector instructions, which necessitates additional preparations, precautions, or support. Streams usually occur in hot loops of the application programs and thus they inherently contain repetitive operations on a multitude of data elements in arrays. Consequently, the streams lend themselves well to vector operations, and indeed current compilers vectorize many such loops. However, when vectorized, even if only one of the vector operands is not available due to a cache miss, the entire vector instruction must wait, thereby resulting in additional stall cycles.
The embodiments of the present disclosure provide methods and systems for a vector stream instruction processing by a computing device. With the implementation of embodiments and examples described below in the present disclose, above drawbacks could be overcome accordingly.
Within the present disclosure, the following sets of terms are used interchangeably: arrays and streams.
The computing device 100 includes at least one processing unit (also known as a processor) 102 such as a central processing unit (CPU) with an optional hardware accelerator, a vector processing unit (also known as an array processing unit), a graphics processing unit (GPU), a tensor processing unit (TPU), a neural processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, a dedicated artificial intelligence processor unit, or combinations thereof.
The computing device 100 may also include one or more input/output (I/O) interfaces 104, which may enable interfacing with one or more appropriate input devices 106 and/or output devices 108. In the example shown, the input device(s) 106 (e.g., a keyboard, a mouse, a microphone, a touchscreen, and/or a keypad) and output device(s) 108 (e.g., a display, a speaker and/or a printer) are shown as optional and external to the computing device 100. In other examples, one or more of the input device(s) 106 and/or the output device(s) 108 may be included as a component of the computing device 100. In other examples, there may not be any input device(s) 106 and output device(s) 108, in which case the I/O interface(s) 104 may not be needed.
The computing device 100 may include one or more network interfaces 110 for wired or wireless communication with a network. In example embodiments, network interfaces 110 include one or more wireless interfaces such as transmitters 112 that enable communications in a network. The network interface(s) 110 may include interfaces for wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more radio frequency links) for intra-network and/or inter-network communications. The network interface(s) 110 may provide wireless communication via one or more transmitters 112 or transmitting antennas, one or more receivers 114 or receiving antennas, and various signal processing hardware and software. In this regard, some network interface(s) 110 may include respective computing systems that are similar to computing device 100. In this example, a single antenna 116 is shown, which may serve as both transmitting and receiving antenna. However, in other examples there may be separate antennas for transmitting and receiving.
The computing device 100 may also include one or more storage units 118, which may include a non-transitory machine-readable medium (or device) such as a solid-state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. The computing device 100 includes memory 120, which may include a volatile or non-volatile memory, such as flash memory, random access memory (RAM), and/or a read-only memory (ROM). The storage units 118 and/or memory 120 may store instructions for execution by the processing units(s) 102 to carry out the methods of the present disclosure as well as other instructions, such as for implementing an operating system or other applications/functions.
The storage devices (e.g., storage units 118 and/or non-transitory memory(ies) 120) may store software source code of the vector ISA extension for a general-purpose processor architecture or domain-specific processor architecture. The memory 120 may also include a Load-Store Queue and L1 Data cache wherein data elements are read from for loading stream operations or written to for storing stream operations.
In some examples, one or more data sets and/or module(s) may be provided by an external memory (e.g., an external drive in wired or wireless communication with the computing device 100) or may be provided by a transitory or non-transitory computer-readable medium. Examples of non-transitory computer readable media include a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a CD-ROM, or other portable memory storage.
The computing device 100 may also include a bus 122 providing communication among components of the computing device 100, including the processing units(s) 102, I/O interface(s) 104, network interface(s) 110, storage unit(s) 118, memory(ies) 120. The bus 122 may be any suitable bus architecture including, for example, a memory bus, a peripheral bus or a video bus.
The vector register file 220 comprises a plurality of vector registers. Each vector register includes a plurality of elements. The vector register file 220 is configured to perform a method including receiving a read command at a read port of the vector register file 220. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file 220. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A data element in a memory array of the selected vector register is selected as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder.
The stream load FIFO 202 and the stream store FIFO 204, collectively referred to as stream FIFOs, are configured to temporarily hold data fetched from memory 120 or from the vector register file 220 in vector format when reading from, or writing to, registers of the vector register file 220. Streams are comprised of a sequence of memory accesses having repeated patterns that are the result of loops and nested loops. Each memory access of a stream may be identified through a register of the vector register file 220, which is a special-purpose register identifier that may be used to refer to data within a particular stream. Each register of the vector register file 220 may be defined with a register width that in turn determines the amount of data that may be loaded or stored by each instruction. In instances wherein the entirety of the register data is not used, a mask may be used to define the useful portion of register data as described in more detail below. In vector instructions in accordance with the present disclosure, instruction operands may be delivered from the vector register file 220.
In some embodiments, the streams may be classified into two groups: memory streams that describe a memory access pattern; and induction streams that define a repeating pattern of values. Memory streams may be dependent on either an induction stream (direct memory access) or another memory stream (indirect memory access).
The stream FIFOs 202, 204 are configured to hold data to be consumed, or generated, by vector instructions issued from the processing unit(s) 102 of the computing device 100. It is understood that despite only two FIFOs being shown in
The vector controller 206 includes a Stream Configuration Table (SCT) module 208, a prefetcher 210, and any additional control logic based on the application. The SCT module 208 is a memory or cache that maintains a stream configuration table. The SCT module 208 and prefetcher 210 may be implemented in firmware. On the memory interface side, appropriate logic, for memory-address generation and tracking, should be added to do the prefetch by contacting appropriate components of the processing unit(s) 102. This is implementation dependent, based on amount of parallel address-calculation circuitry the implementer wishes to dedicate, and should be designed accordingly. On the FIFO interface side, appropriate logic for vectored access should be designed to allow access to the heads of the FIFOs accessed as a corresponding register of the vector register file 220 by the software being run on the processing unit(s) 102 for data communication between the FIFO and the vector register file 220 as well as for writing prefetched data to the tail of the FIFO (i.e., prefetch operation).
The SCT is a table that holds per-stream information necessary to prefetch data elements of the streams. In some embodiments, one row of the SCT is allocated per stream. The per-stream information includes stream dependency relationship information. By way of an example, in the programming code of c[a[i]]=a[i]+b[i], wherein a, b, and c are arrays in memory 120, and i is the induction variable for looping through elements of the arrays, a separate stream would be initiated for each of i, a, b, and c. The induction variable stream i is referred to as the base stream. The load stream a[i] is directly dependent from the induction variable stream. The load stream b[i] is directly dependent from stream a[i] and indirectly dependent from a[i]. The store stream c[i] is directly dependent from b[i], indirectly dependent from a[i], and indirectly dependent from the base stream with an additional level of indirect dependency compared to the dependency with a[i]. Other per-stream information including the initial value and end value of the induction variable, base address of each of the arrays a, b, and c may also be stored in the SCT.
Reading from the memory 120 into the stream load FIFO 202 and writing data to the tail of stream store FIFO 204 are non-speculative operations performed by the prefetcher 210. Speculative prefetching can happen in case of control-flow, such as if-then-else, operations in the loop. The prefetcher 210 uses the per-stream information from the SCT such as base memory addresses, array induction variable/streams, and vector length to access data elements from memory 120. For maximum performance gain, prefetching may be performed before the software reaches the point wherein the data is required. Notably, for storing streams, the stream FIFOs of the V-SEU 200 are indifferent to the cache write policies of the processing unit(s) 102. For example, a data block of data elements may be prefetched into L1 data cache in memory 120 on write-allocate policy or bypass the cache on write-around policy. In either case, the stream FIFO 204 maintains the processor-produced data based on the data type being used in software. Because data is read/written from/to caches in blocks, the remaining part of each data block that the processing unit(s) 102 writes to is still prefetched from the memory 120.
At step 302, streams are explicitly initiated or constructed through vector instructions such as a vector_stream_open instruction, wherein the V-SEU 200 is initialized and creates a new vector_stream, which is referred to as a data stream or a vector data stream. Each stream includes a set of array-based memory accesses and the contents of each stream may be accessed through a register in the vector register file 220. Each of the streams is associated with an array index. Each register in the vector register file 220 may be associated with a respective stream of a respective stream FIFOs 202, 204. During this phase, sufficient information may be passed to the prefetcher 210 such that data elements are properly prefetched and stored vector-wise in the corresponding stream FIFOs 202, 204 for later load/store by corresponding instructions. The information provided to prefetcher 210, or stream metadata, are stored in the SCT to enable the start of prefetching operations for the streams. By way of a non-limiting example, in one embodiment, the vector_stream_open instruction is as follows:
One stream may be initiated for each array dependency level. In an example, the array index of a vector data stream initiated may be dependent on array values of another set of array-based memory accesses. By way of an example, in the programming code of c[a[i]]=a[i]+b[i], separate streams may be initiated for the base stream induction variable i, directly dependent streams a[i], b[i], and indirectly dependent stream c[a[i]]. The induction variable i is the array index for streams a and b, and values of array b serve as the array index for array c.
At step 304, data elements from the memory 120 are prefetched by the prefetcher 210 into fast memory storages such as the stream FIFOs 202, 204 and readied for consumption. In examples, the data elements may be prefetched from the memory to the fast memory storage by advancing the array index by a plurality of increments. While for each individual prefetch a memory address may be calculated and provided to the memory subsystems, the batch prefetching of the present disclosure does the same in parallel for a multitude of addresses as maximally wide as the number of lanes in the vector system in the processing unit(s) 102. As non-limiting alternative implementations, this could be realized by replicating the address calculation hardware and other necessary resources such as bus lines, or the resource usage could be reduced by sharing some parts among them. The V-SEU 200 may read ahead all entries of a vector stream even though a subset of those streams may actually be used through masking as discussed in more detail below. Some of the data elements may not be actually used due to conditional data access. The V-SEU 200, and more specifically the prefetcher 210, is responsible for prefetching data from memory 120 ahead of execution and without further software intervention after stream initialization by such as the vector_stream_open instruction. The prefetcher 210 may speculatively prefetch the data and maintain it in the vector stream FIFOs 202 and 204.
In one embodiment, by referencing the parent stream (i.e., stream upon which a stream is directly dependent) and the base memory address of the parent stream, prefetching is performed for indirect memory accesses. This saves address-calculation instructions as well as instructions needed for loading index arrays for that batch of indirect memory accesses, and effectively performs the gather/scatter operations, corresponding to indirect load/store operations, fully in hardware, transparent to the software.
The prefetching may be performed on every memory reference, or alternatively, on every cache miss or on positive feedback from prefetched data hits. The prefetched information is stored in the stream FIFOs 202 and 204, and upon reference from a software instruction during execution, the stored data from FIFOs 202, 204 are transferred to/from the vector registers, for respectively load/store operations, in batches.
At step 306, the data elements of the streams are consumed or processed by executing software instructions from the application. The data element consumption is primarily facilitated by storing and loading operations. In an example, a plurality of the prefetched second plurality of data elements are processed by executing a first instruction (e.g., an explicit instruction) for the second vector data stream, and the execution of the explicit instruction causes the processing unit to translate the explicit instruction to a second instruction (e.g., an implicit instruction) to execute a plurality of the prefetched first plurality of data elements.
For loading operations, the V-SEU 200 loads the corresponding data element from a vector stream into a vector register. In one exemplary embodiment, the loading operation may be carried out with the vector instruction of:
For storing operations, the V-SEU 200 retrieves a vector register value and writes it to a vector stream similar to a vector-transfer operation. In one exemplary embodiment, the storing operation may be performed in response to a vector instruction as follows:
In certain embodiments, there exists the possibility of duplicate entries in the one or more of the stream FIFOs 202, 204 for duplicate addresses in the “index” streams.
To resolve the conflict, as part of the vectorized storing instruction, the V-SEU 200 detects the conflict. In some embodiments, a conflict detection ISA may be used. By way of a non-limiting example, a run-time conflict detection instruction vec_conf_detect introduced in Advanced Vector Extensions (AVX)-512, which are extensions to the x86 instruction set architecture for processing units, may be used to detect conflicts. In this particular embodiment, the instruction determines the conflicts among vector elements, and returns a mask that is used in subsequent vector instructions to avoid the conflicting vector elements in the following pseudo code:
In
In some embodiments, the vector_stream_store instruction returns a conflict_mask that is taken into account by the software code to properly resolve the detected conflicts.
In resolving detected conflicts, the software code, upon receipt of the conflict_mask, may serialize writing to the conflicting vector elements, and operate each data element in the same order as the original non-vectorized software code so that the original semantics is kept intact. In some embodiments, the V-SEU 200 may revert the instructions from a vector version to a scalar version of a loop for vector-length iterations upon detecting a conflict. In that case, the scalar version of the loop, which has also been produced by the compiler, is executed. In this conflict resolution method, all elements of the vector are serialized. This may be simpler to implement but imposes unnecessary serialization on conflict-free elements.
In some further embodiments, detected conflicts may be resolved by serializing operations on conflicting elements only. Again, the outcome of the conflict-detection identifies the vector lanes that are conflicting, such as 408D, 408E, and 408G, and only serializes these ones, namely vector lanes corresponding to base stream values 3, 4, and 6. The semantics of the original non-vector code is kept intact, while parallelism is kept for non-conflicting elements.
During data processing (or data consumption), the stream steps update associated induction variables by advancing the register position of all dependent streams. Vector streams are advanced with multiples of vector-length. As data is consumed/produced in vectors, at the end of each round of access in the loop, the streams are moved forward by vector-length elements. The same for the induction variable of the vectorized loop. Similarly, the induction variable of the corresponding loop is always advanced by multiples of the vector-length, thus enabling the advancement of the register position by multiples of vector length.
When a loop is vectorized, all accesses to the streams are vector-length aligned. Thus, in some embodiments, scalar and vector streams may not be mixed under each base stream, including nested loops wherein in the inner loop accesses are made using the outer loop induction variable. For accesses using separate base-streams, however, the loops can be independently scalar or vector.
At step 308, at the end-of-life of the streams, each stream is closed and its occupied resources in the V-SEU 200 are returned back to the system. An example stream close instruction may be as follows:
The pipeline 500 shows the relative position of added hardware components to the processing unit 202 with respect to the data path in various stages of the processing pipeline. The components with dashed outlines show the added components so that the vectorized stream instructions may be decoded in accordance with the present disclosure. The stream FIFOs 202, 204 are respectively shown before and after (left and right of) the data cache L2 528 as data is read from memory 120 and stored in the load stream FIFOs 204, and it is written to the memory 120 from the store stream FIFOs 202. The software-directed vector-stream prefetcher 210 perform prefetching operations by utilizing the stream information stored in the SCT to identify how and wherein from the prefetching should be done for each stream, and issues necessary control signals to appropriate units in the processor accordingly. As explained above, the exact operations and signals are implementation-dependent and would differ from processor to processor based on how hardware prefetching is done (if at all) in that processor. The important addition of this disclosure is the batched prefetching corresponding to maximally vector-length worth of prefetches. The batch of memory addresses are calculated. If the addresses overlap, they are coalesced into smaller number of memory requests. Then these memory requests are passed to the memory subsystem 120 and the returning data is stored in stream FIFOs 202 and 204.
As shown in
Inside the loop, there is no need to calculate addresses for a[i] elements, nor for b[i] and c[b[i]]. In fact, loads from b[i] are eliminated altogether. Simply a load from stream s_a and storing to stream s_c is all that is necessary at lines 6 and 7, respectively. By executing the explicit load and store operation, the stream b[i] is implicitly executed. This happens because the relation between b[ ] and c[ ] has already been passed to the V-SEU 200, and hence, any access to c[ ] implies a corresponding access to b[ ].
In the stream initiation command for the base stream at line 1, the vector length (VL) is set to 4. Thus, to advance to the next iteration of the loop, the induction variable i would need to be incremented by 4. This is done by line 8, which also causes the other three streams of s_a, s_b, and s_c go advance to their next 4 vectors due to their dependency on s_i as the base stream.
Finally, at line 10, all four streams are closed (or destructed) by instructing the hardware to close the base stream s_i and all its dependent streams.
General
Through the descriptions of the preceding embodiments, the present invention may be implemented by using hardware only, or by using software and a necessary universal hardware platform, or by a combination of hardware and software. The coding of software for carrying out the above-described methods described is within the scope of a person of ordinary skill in the art having regard to the present disclosure. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be an optical storage medium, flash drive or hard disk. The software product includes a number of instructions that enable a computing device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present disclosure.
All values and sub-ranges within disclosed ranges are also disclosed. Also, although the systems, devices and processes disclosed and shown herein may comprise a specific plurality of elements, the systems, devices and assemblies may be modified to comprise additional or fewer of such elements. Although several example embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the example methods described herein may be modified by substituting, reordering, or adding steps to the disclosed methods.
Features from one or more of the above-described embodiments may be selected to create alternate embodiments comprised of a subcombination of features which may not be explicitly described above. In addition, features from one or more of the above-described embodiments may be selected and combined to create alternate embodiments comprised of a combination of features which may not be explicitly described above. Features suitable for such combinations and subcombinations would be readily apparent to persons skilled in the art upon review of the present disclosure as a whole.
In addition, numerous specific details are set forth to provide a thorough understanding of the example embodiments described herein. It will, however, be understood by those of ordinary skill in the art that the example embodiments described herein may be practiced without these specific details. Furthermore, well-known methods, procedures, and elements have not been described in detail so as not to obscure the example embodiments described herein. The subject matter described herein and in the recited claims intends to cover and embrace all suitable changes in technology.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the invention as defined by the appended claims.
The present invention may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive. The present disclosure intends to cover and embrace all suitable changes in technology. The scope of the present disclosure is, therefore, described by the appended claims rather than by the foregoing description. The scope of the claims should not be limited by the embodiments set forth in the examples but should be given the broadest interpretation consistent with the description as a whole.