The invention relates to parallel processing units and to methods and arrangement for operating a processing pipeline with a parallel processor architecture.
Typical instruction processing pipelines in modem processor architectures have several stages and include at least a fetch stage and a decode stage. The fetch stage loads instruction data useable by the instructions (often called immediate values) where the data is passed along with the instructions within the instruction stream. The data and instructions can be retrieved from an instruction memory system and forwarded to a decode stage. The decode stage can expand and split the instructions assigning portions or segments of the total instruction to individual processing units and passes the segments to the execution stage.
One advantage of instruction pipelines is that the complex process of instruction processing like accessing the instruction memories, fetching the instructions, decoding and expanding of instructions, analyzing whether data is scheduled to be written to registers in parallel while other instructions use it, executing the instructions, or writing of results back to memories or to register files can be broken up in separate stages which execute concurrently. Each stage performs a task, e.g., the fetch stage fetches instructions from an instruction memory system. Therefore, pipeline processing enable a system to process a sequence of instructions, one instruction per stage concurrently to improve processing power due to the concurrent operation of all stages. In a pipeline environment in one clock cycle one instruction can be fetched by the fetch stage, whilst another is decoded in the decode stage, whilst another instruction is be executed in the execute stage. Therefore, in a pipeline environment each instruction needs three processor cycles to propagate through a three-stages pipeline and to be processed in each of the three stages (i.e. one clock cycle for each fetch, decode and execute), assuming one cycle per stage. However, in a pipeline configuration while an instruction is being processed by one stage, others stages are concurrently processing.
Therefore, generally, one instruction can be executed by an execute stage each clock cycle. The more stages the instruction processing task can be broken into the faster each stage can operate. Higher clock frequencies can be achieved if the stages can operate faster and hence the system can operate faster. It is a pursuit of designers to design a pipeline with smaller and faster stages even though the pipeline itself may be longer.
In pipeline processing, jump conditions can occur, where the instruction stream is not continuous and instructions must be locate and loaded into the pipeline because of the jump and the pipeline processing is interrupted. The earlier in the pipeline a jump can be detected the quicker the system can react to the break in the instruction chain and hence the smaller latency on the pipeline. On the other hand, if a jump is detected very late in the pipeline, each previous stage has to stall (or be idle) until instructions from the new instruction address(es) requested by the jump condition are loaded to these stages. As instructions are processed sequentially in a pipeline, the reload of the pipeline due to a jump can take several clock cycles. In the case of a jump, generally very long instruction pipelines are less flexible than short pipelines.
Two basic approaches are utilized to prevent a pipeline from stalling in case of a jump. One approach is to completely decouple the fetching of instructions from the pipeline. Whenever a jump occurs, the decoupled fetching system reads the new instructions—the so-called jump-target—from the new address and feeds the instructions starting with the jump-target to the pipeline. One disadvantage with this approach is that conditional jumps are not possible in such designs. A conditional jump is a jump which is only performed in case a certain condition evaluates to true. Such an evaluation typically can only be performed by the execute stage which is typically located in the middle of the pipeline. Another approach is to try to detect a jump very early in the pipeline and this approach has similar disadvantages. In modem processor architectures, jumps typically are detected in the execute stage which offers the highest flexibility, however this arrangement has the drawback that all previous stages would have to stall in case of a jump.
In one embodiment, a method for operating a processing pipeline is disclosed. The method can include fetching an instruction in a first clock cycle, decoding the instruction in a second clock cycle and fetching an instruction data associated with the instruction in the second clock cycle. The method can also include associating the instruction data with the instruction and feeding the instruction and the instruction data to a processing unit utilizing the association. The method can also include loading a register with instruction data wherein the number of bits of instruction data loaded per clock cycle varies based on the amount of instruction data required to execute at least one instruction in a clock cycle. In addition, the arrangement can execute the instruction utilizing the association and the instruction data. Further, the arrangement can load the instruction data into a register where the instruction data has segments and at least one segment utilizes instruction data to execute the instruction in an execute stage. The instruction can have a first size and the instruction data can have a second size.
In another embodiment an apparatus is disclosed. The apparatus can include a fetch module to fetch instructions to a pipeline in a first clock cycle and to fetch instruction data in a second clock cycle. The apparatus can also include a decode module coupled to the fetch module to decode the instructions in the second clock cycle, an association module to associate the instructions with the instruction data and an execute module coupled to the decode module to execute the instructions utilizing the instruction data. In one embodiment the apparatus can include a forward module to feed instructions to the execute module.
In yet another embodiment the apparatus can include a memory access module to store results from the execute stage and a fetch buffer to store fetched instructions. The instruction word can be fetched and forwarded to a forwarding stage in a single clock cycle.
In another embodiment a computer program product is disclosed. The product can include a computer useable medium having a computer readable program, wherein the computer readable program when executed on a computer causes the computer to fetch an instruction in a first clock cycle, the instructions assignable to one instruction segment from a plurality of instruction segments, decode the instruction in a second clock cycle, fetch instruction data associated with the instruction in the second clock cycle, associate the instruction data with the instruction, and feed the one instruction segment and the instruction data to a processing unit.
The product can also causes the computer to load the instructions from cache, load the instruction data into a register, the instruction data having segments wherein at least one segment to utilize instruction data in an execute stage and execute the instruction utilizing the association and the instruction data.
In the following the disclosure is explained in further detail with the use of specific embodiments, which should not be utilized to limit the scope of the invention.
a depicts fetching and decoding of instructions and immediate values in conventional architectures;
b depicts fetching and decoding of instructions and immediate values in a combined fetch/decode stage;
The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claims. The descriptions below are designed to make such embodiments obvious to a person of ordinary skill in the art.
While specific embodiments will be described below with reference to particular configurations of hardware and/or software, those of skill in the art will realize that embodiments of the present disclosure may advantageously be implemented with other equivalent hardware and/or software systems. Aspects of the disclosure described herein may be stored or distributed on computer-readable media, including magnetic and optically readable and removable computer disks, as well as distributed electronically over the Internet or over other networks, including wireless networks. Data structures and transmission of data (including wireless transmission) particular to aspects of the disclosure are also encompassed within the scope of the disclosure.
In one embodiment, methods, apparatus and arrangements for executing instructions utilizing multi-unit processors that can execute very long instruction words (VLIW)s are disclosed. The processor can have a plurality of modules which can operate in a plurality of one or more stages of a pipeline.
In the disclosed methods and architectures for fetching and decoding of instruction words and immediate words (alternately called instruction data) in a processor using N parallel processing units is provided. In each processor cycle the processor can execute one processor instruction. A processor instruction can consists of an instruction word group and an immediate word group containing a variable number of immediate values corresponding to the instructions of the instruction word group. Each instruction word can have zero, one, or several immediate values. In each processor cycle the processor can decode the current instruction words and can fetch the current immediate words and the next immediate words.
In case of a jump-hit the instruction cache system can extract the jump-target instruction(s) directly from the appropriate cache-lines and can bypass them to the decode stage whilst in parallel the fetch stage can fetch a next instruction word group and the immediate values which belong to the jump-target instructions.
One advantage of the method and apparatus of the present disclosure is that at least one cycle can be saved in case of a jump-hit compared to conventional solutions. Moreover, the clear structure of the processor instruction combined with a fetching of instruction groups and their assigned immediate values can reduce the complexity of the fetch stage, save chip area, and speed up the fetching task and the entire process.
In operation, in one clock cycle, the fetch stage 32 can fetch an instruction 51 and can write it to a decode register 33. The pipeline also can contain a decode stage 34 which can decode the instruction in a second clock cycle. Whilst the instructions 53 are decoded (in the same clock cycle), the fetch stage can fetch instruction data 52 associated with the instruction 53 being decoded. In other words, in a first clock cycle, the fetch stage 32 can fetch an instruction and in a second cycle the decode stage 34 can decode the instruction 53 while the fetch stage 32 can fetch instruction data 52 associated with the instruction 53. The association module 42 can associate the instruction data 52 with the instruction 53 it belongs to as these operations are performed in different clock cycles. The decoded instruction 53 and the fetched instruction data 52 can be written to a forward register 35.
The forward stage 36 can read data from a forward register 35 and from other modules to provide execution data in a register 37 for the execute stage 38. The execute stage 38 can consist of a multitude of parallel processing units which each can read instructions and data from the execute register 37. The parallel processing units can access a common register file or can access own registers which are not shown in the
In case of a jump, instructions beginning from a new position in the instruction memory may need to be loaded. The instruction at the new position, i.e., the instruction at the jump address, is called jump-target. In some embodiments of the disclosure, jumps in the instruction stream processed in the main pipeline can be detected by a jump possibility detector 41 which can receive signals from registers 35 and/or 37 and/or from the modules 36 and/or 38 and which can send a jump signal to a jump-hit fetcher 40. In the event that the instructions at the jump address are stored in cache 30, (i.e., in case of an instruction cache hit—a so-called jump-hit) a jump-hit fetcher 40 can directly extract a jump target instruction 55 from the cache system 30 and can send the jump target instruction 55 to the decode register 33 bypassing the fetch stage 32. In another embodiment, the jump-hit fetcher 40 can have an additional functionality and the jump-hit fetcher 40 can send the jump target instruction 55 to a forward register 35, and/or an execute register 37. A smaller stage also allows faster clocking. This procedure can save one cycle in case of a jump-hit as explained below in more detail. Hence, the blocks 30, 40, 33 can define a bypass path for the jump target instruction 55 in parallel to the main path of the main pipeline shown by the blocks 30, 31, 32 and 33. In case of a jump-hit, the association module 42 can associate the instruction data to the jump target instruction with a pointer or other relational method. Thus the instruction data can bypass a decode stage and a jump instruction can bypass a fetch stage providing substantial benefit to pipeline operation.
The external memory 170 can utilize bus interface modules 122 and 171 to facilitate such an instruction fetch or instruction retrieval. In one embodiment the processor core 110 can utilize four separate ports to read data from a local arbitration module 105 whereas the local arbitration module 105 can schedule and access the external memory 170 using bus interface modules 103 and 171. In one embodiment, instructions and the data are read over a bus or interconnect network from the same memory 170 but this is not a limiting feature, instead any bus/memory configuration could be utilized such as a “Harvard” architecture for data and instruction access can be utilized.
The processor core 110 could also have a periphery bus which can be used to access and control a direct memory access (DMA) controller 130 using the control interface 131, a fast scratch pad memory over a control interface 151, and to communicate with external modules, a general purpose input/output (GPIO) interface 160. The DMA controller 130 can access the local arbitration module 105 and read and write data to and from the external memory 170. Moreover, the processor core 110 can access a fast Core RAM 140 to allow faster access to data. The scratch pad memory 150 can be a high speed memory that can be used to store intermediate results or data which is frequently utilized. The fetch and decode method and apparatus according to the disclosure can be implemented in the processor core 110.
Further, data can be loaded from or written to data memories 8 from a register area or register set 7. Generally, data memories can provide data and can save the results of the arithmetic proceeding provided by the execute stage. The program flow to the parallel processing units 21-24 of the execute stage 3 can be influenced for every clock cycle with the use of at least one control unit 9. The architecture shown provides connections between the control unit 9, processing units, and all of the stages 3, 4 and 5.
The control unit 9 can be implemented as a combinational logic circuit. It can receive instructions from the fetch 4 or the decode stage 5 (or any other stage) for the purpose of coupling processing units for specific types of instructions or instruction words for example for a conditional instruction. In addition, the control unit 9 can receive signals from an arbitrary number of individual or coupled parallel processing units 21-24, which can signal whether conditions are contained in the loaded instructions.
Typical instruction processing pipelines known in the art have a fetch stage 32 and a decode stage 34 as shown in
The pipeline shown in
The instruction processing pipeline can consist of several stages which can be a fetch-decode stage, a forward stage 241, an execute stage 251, a memory and register transfer stage 261, and a post-sync stage 271. It is characteristic to the disclosure, that the fetch and the decode modules 231 and 233 are combined in one fetch-decode stage. The fetch-decode stage, hence, performs the fetch stage and the decode stage. The fetch stage 231 can write the fetched instructions back to the fetch/decode register 229 and writes the immediate values to the forward register 239. The decode stage 233 can read the fetched instructions from the fetch/decode register 229 and or from the fetch stage 231 and can write the decoded instructions to the forward register 239.
a shows processing of fetch and decode stages in a conventional or prior art pipeline. At time step 0 the first instructions 1 and the immediate values 1 which can be passed along with the instructions 1 are fetched. The instructions can have one instruction for each processing unit. The immediate values can be associated to the instructions. At time stamp 1 the instructions 2 and the immediate values 2 which are passed along with the instructions 2 are fetched. Moreover, the instructions 1 and the immediate values 1 are decoded. At time stamp 2 the instructions 3 and the immediate values 3 which are passed along with the instructions 3 are fetched. Moreover, the instructions 2 and the immediate values 2 are decoded.
b shows processing utilizing a combined fetch and decode stage according to the disclosure. At time step 0 the instructions 1 are fetched only. At time stamp 1 the instructions 2 and the immediate values 1 are fetched and the instructions 1 are decoded. At time stamp 2 the instructions 3 and the immediate values 2 are fetched and the instructions 2 are decoded. It is to note, that no decoding is applied on immediate values as they do not require any processing in a decoding step. Conventional pipeline architectures require a decoding step for immediate values and thereby simply store the immediate values in registers. It is one of the advantages of the present disclosure, that the number of tasks performed by the fetch and decode stages are reduced compared to conventional pipeline designs. According to the embodiment the instruction 1 at time step 0 can be bypassed in a bypass path.
The module 241 or the decode stage 233 of
The module 251 of
In one embodiment, the module 261 of
The module 271 of
As explained above, the processor core 1 shown in
However, it is characteristic to the disclosure that the instruction words 352 are grouped to instruction word groups. The immediate words 353 are grouped as well. The immediate word groups can be located after the instruction word groups 352 within the processor instruction 350. Grouping of instruction and immediate words in this disclosure means that the words of a type are arranged one after the other in order. In embodiments of the disclosure an additional or a dedicated instruction word can store global instructions which are used to control the processor 100. In other embodiments of the disclosure certain bits of each instruction word can be used for controlling purposes or global instructions to the processor 100.
The processor core 110 can contain a number of so-called instruction line buffers.
As described above, each of the processing units can execute one instruction per processor cycle. In the example of
The lines 301-307 show the same instruction stream buffers which have the same processor instructions. The positions of the words are denoted by position indicators 300 for clearness. The instructions are executed from left to the right through both instruction line buffers 361 and 362. According to
After a reset or a jump-miss to an address at the beginning of an instruction line the fetch window 330 can be set to the beginning of the instruction line buffer 361 and the four instructions at position 00-03 are fetched denoted by the frame 344. This situation is shown for a first processor cycle in the line 301 of
Line 302 in
Line 303 shows the actions that can be performed in a third processor cycle. The four instructions at positions 07-10 which could have been fetched in the cycle before are decoded which is denoted by the hatched frame 342 and the fetch window 330 is shifted by seven to the position 11-18. One of the four instruction words at positions 07-10 have an immediate value. This immediate value is at position 11 and is fetched along with four next instructions at positions 12-15 and, hence, five words of the fetch window are fetched which is denoted by the hatched frame 344 in line 303. The decoded instruction words of positions 07-10 and the fetched immediate word of position 11 are forwarded to the next stage which can be the forward stage 241 of the pipeline shown in
Line 304 shows the actions that can be performed in a fourth processor cycle whereas the decoded instruction words have no immediate values. The four instructions at positions 12-15 which could have been fetched in the cycle before are decoded which is denoted by the hatched frame 342 and the fetch window 330 is shifted by five to the position 16-23. None of the four instruction words at positions 12-15 have an immediate value. The four next instructions at positions 16-19 are fetched and, hence, four words of the fetch window are fetched which is denoted by the hatched frame 344 in line 304. The decoded instruction words of positions 12-15 are forwarded (denoted by the bar 340) to the next stage which can be the forward stage 241 of the pipeline shown in
In line 305 the four instructions at positions 16-19 which could have been fetched in the cycle before are decoded which is denoted by the hatched frame 342 and the fetch window 330 is shifted to the position 20-27. All four instruction words at positions 16-19 have an immediate value. These immediate values are at positions 20-23 and are fetched along with four next instructions at positions 24-27 and, hence, all eight words of the fetch window are fetched which is denoted by the hatched frame 344 in line 305. The decoded instruction words of positions 16-19 and the fetched immediate words of positions 20-23 are forwarded to the next stage.
The lines 306-307 are processed similar to line 302-305. The only difference is that the fetch window overlaps the second instruction line buffer 362. A logic which is not drawn has to take care that at least that part of the second instruction line which is inside the fetch window is completely loaded. In some cases—even in case of an instruction cache miss—it can be possible that the second instruction line buffer cannot be loaded until the fetch window runs into to buffer. In this case the processor can stall until that part of the instruction line buffer which is inside the fetch window is loaded. In other embodiments the processor can stall until the whole line buffer is filled.
As depicted above, an advantage of the present method and apparatus is that the tasks of fetching and decoding could be broken in small tasks which are fetching of instructions, fetching of immediate values, and decoding of the previously fetched instructions. Another advantage of the present method and apparatus is that in case of a jump-hit one cycle can be saved compared to conventional methods. As stated above, the scenario of
On contrary, a jump-hit means, that the jump-target is available in the cache—a so-called cache-hit—and the instructions at the new instruction address can be loaded within a few processor cycles. A jump-hit normally causes the processor—in the best case—to lose a few cycles as the jump-target has to be loaded first and the disclosed arrangements improve the time delays normally associated with a jump-hit.
The lines 312 to 317 shown in
In the embodiment shown in
The instruction and immediate word fetch is performed similar to the lines 312 and 303 of
As discussed above, jump-misses can be handled by designing a sophisticated instruction cache system. One of the advantages of the current disclosure is that in the case of a jump-hit, i.e., the jump-target is in the cache, that one cycle can be saved compared to the conventional method of processing fetching and decoding which is illustrated in
As described, one of the advantages of the present method and apparatus is that in case of a jump-hit even one processor cycle is saved compared to conventional methods and compared to a jump-miss as shown in
In the description an embodiment of the disclosure with four processing units and instruction words—one for each processing unit—has been presented. Each instruction word could have zero or one immediate words. However, it is to note, that the disclosure is not limited to any number of processing units or instruction words and other embodiments of the disclosure can use instruction words that each can have any number of immediate words. Moreover, in the description two instruction stream buffers 371 and 372 and two instruction line buffers per instruction stream buffers are used (see
The instruction cache system mentioned above is indicated in the pipeline of
The modules 212, 213, 222, 223, and 224 denote the procedure of accessing the cache and updating pointers and instruction line buffers or instruction line buffers and are discussed in more detail in
The tag to select the appropriate cache-line register 410 can be computed by the “Get Tag” module which is broken up into two modules 212 and 222 in
For a fast forwarding to the next stage the new instruction line can be bypassed (once it is available) to the next stage using a switching logic 469. The switching logic 469 can be controlled by a logic module or a register 421 and 422 which can determine whether the cache access 213 has been a hit. The incremented jump pointers 431, the jump-target instructions 341, and the new instruction line can be forwarded to the expand top module which can be an implementation of the fetch/decode stage 231 and 233 of
In a regular program flow the current instruction words are decoded while the associated current immediate words and the next instruction words are fetched. This is shown, e.g., in the lines 302-307 of
The next instruction words 653 can be determined by the module 613 from the fetch window words 635 using the number of immediate words 633. As the current immediate words 655 of the current instruction words 753 are followed directly by the next instruction words (see also lines 301-307 of
After extraction of the immediate words and the instruction words of the fetch window the fetch window pointers 751 have to be shifted by the actual number of current immediate words plus constantly four next instruction words which can be performed by the module 611.
The examples of
The logic 607 can raise a stall signal 657 which can be signal the processor pipeline to stall if the fetch window approaches a line buffer or words of a line buffer which have not been reloaded yet. This could be the case in a cache-miss, e.g., when that part of the instruction line buffer which is within the fetch window could not have been finished. The signal 657 can be used within the module 500 to stall until at least those words of the instruction stream buffer which are within the fetch window are available. A validity signal 641 can tell the module 607 which words of the line buffers have already been updated.
The expand decoder module 700 of
In case of a jump-hit the expand decoder module 700 combined with the logic of 400 of
The module 707 can be used to choose the correct fetch window pointers 751 for the next cycle among the jump pointers 421, the incremented next fetch window pointers 651, and the incremented jump pointers 431. The incremented next fetch window pointers 651 could have been calculated by the module 600 and point to the next fetch window in a regular program flow (no jump). The jump pointers 421 denote the position of the fetch window 751 at a jump-miss (compare
The module 711 in
At decision block 1706, it can be determined whether the jump was a jump-hit or not. In case a jump-hit is detected both paths starting at blocks 1708 and 1716 can be processed in parallel. The path starting at module 1716 can be called main-path and the path starting at module 1708 can be called a bypass path. As indicated by block 1708, the jump-target instruction can be loaded from an instruction cache. As indicated by block 1710 the loaded jump-target instruction can be handled in a way to bypass the fetch stage. As indicated by block 1712 the bypassed jump-target instruction can be forwarded to a decode stage and decoded in a further step as illustrated by block 1714.
In the main path, the whole instruction line can be loaded from the instruction cache which can contain the jump-target instruction and instructions subsequent to the jump target instruction which is illustrated by block 1716. As illustrated by block 1718 the instruction subsequent to the jump-target instruction can be fetched by the fetch stage.
However, in case a jump-miss has been detected at decision block 1706, the instruction line cannot be loaded from the instruction cache. Instead, the instruction line can be loaded from an external instruction memory which is illustrated by block 1720. As illustrated by block 1722 the fetch stage (and in one embodiment all subsequent pipeline stages) can wait until at least the fetch window portion of the instruction line is loaded. In other embodiments of the disclosure not only the fetch window but even the whole instruction line can be requested to continue processing. As illustrated by block 1724, the jump target can be fetched by the fetch stage from the instruction line once at least a fetch window portion is available.
Each process disclosed herein can be implemented with a software program. The software programs described herein may be operated on any type of computer, such as personal computer, server, etc. Any programs may be contained on a variety of signal-bearing media. Illustrative signal-bearing media include, but are not limited to: (i) information permanently stored on non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive); (ii) alterable information stored on writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive); and (iii) information conveyed to a computer by a communications medium, such as through a computer or telephone network, including wireless communications. The latter embodiment specifically includes information downloaded from the Internet, intranet or other networks. Such signal-bearing media, when carrying computer-readable instructions that direct the functions of the present disclosure, represent embodiments of the present disclosure.
The disclosed embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In one embodiment, the arrangements can be implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the disclosure can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The control module can retrieve instructions from an electronic storage medium. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD. A data processing system suitable for storing and/or executing program code can include at least one processor, logic, or a state machine coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
It will be apparent to those skilled in the art having the benefit of this disclosure that the present disclosure contemplates methods, systems, and media that facilitate pipeline processing. It is understood that the form of the arrangements shown and described in the detailed description and the drawings are to be taken merely as examples. It is intended that the following claims be interpreted broadly to embrace all the variations of the example embodiments disclosed.