This invention relates to data processing, and more particularly to a modified-SIMD data processing architecture.
Signal and media processing (also referred to herein as “data processing”) is pervasive in today's electronic devices. This is true for cell phones, media players, personal digital assistants, gaming devices, personal computers, home gateway devices, and a host of other devices. From video, image, or audio processing, to telecommunications processing, many of these devices must perform several if not all of these tasks, often at the same time.
For example, a typical “smart” cell phone may require functionality to demodulate, decrypt, and decode incoming telecommunications signals, and encode, encrypt, and modulate outgoing telecommunication signals. If the smart phone also functions as an audio/video player, the smart phone may require functionality to decode and process the audio/video data. Similarly, if the smart phone includes a camera, the device may require functionality to process and store the resulting image data. Other functionality may be required for gaming, wired or wireless network connectivity, general-purpose computing, and the like. The device may be required to perform many if not all of these tasks simultaneously.
Similarly, a “home gateway” device may provide basic services such as broadband connectivity, Internet connection sharing, and/or firewall security. The home gateway may also perform bridging/routing and protocol and address translation between external broadband networks and internal home networks. The home gateway may also provide functionality for applications such as voice and/or video over IP, audio/video streaming, audio/video recording, online gaming, wired or wireless network connectivity, home automation, VPN connectivity, security surveillance, or the like. In certain cases, home gateway devices may enable consumers to remotely access their home networks and control various devices over the Internet.
Depending on the device, many of the tasks it performs may be processing-intensive and require some specialized hardware or software. In some cases, devices may utilize a host of different components to provide some or all of these functions. For example, a device may utilize certain chips or components to perform modulation and demodulation, while utilizing other chips or components to perform video encoding and processing. Other chips or components may be required to process images generated by a camera. This may require wiring together and integrating a significant amount of hardware and software.
Currently, there is no unified architecture or platform that can efficiently perform many or all of these functions, or at least be programmed to perform many or all of these functions. Thus, what is needed is a unified platform or architecture that can efficiently perform tasks such as data modulation, demodulation, encryption, decryption, encoding, decoding, transcoding, processing, analysis, or the like, for applications such as video, audio, telecommunications, and the like. Further needed is a unified platform or architecture that can be easily programmed to perform any or all of these tasks, possibly simultaneously. Such a platform or architecture would be highly useful in home gateways or other integrated devices, such as mobile phones, PDAs, video/audio players, gaming devices, or the like.
In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific examples illustrated in the appended drawings. Understanding that these drawings depict only typical examples of the invention and are not therefore to be considered limiting of its scope, the invention will be described and explained with additional specificity and detail through use of the accompanying drawings, in which:
The present invention provides an apparatus and method for processing data that overcome various shortcomings of the prior art. The features and advantages of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
In a first embodiment, an apparatus for processing data includes an array of processing elements to simultaneously perform operations on multiple data elements using a single instruction. A grouping module assigns each processing element within the array to one of several groups. A modification module designates how each group of processing elements should handle the single instruction. This enables each group of processing elements to handle the single instruction differently. Each processing element is configured to handle the single instruction based on the group the processing element belongs to.
In selected embodiments, the grouping module uses a processing element (PE) map to designate which group each processing element belongs to. Similarly, in selected embodiments, the modification module uses an instruction modifier to designate how a group of processing elements should handle the single instruction. In certain embodiments, the instruction modifier designates how to modify one or more operands, such as source operands and/or destination operands, of the single instruction.
In another embodiment in accordance with the invention, a method for processing data includes simultaneously performing, with an array of processing elements, operations on multiple data elements using a single instruction. The method further includes assigning each processing element within the array to one of multiple groups and designating how each group of processing elements should handle the single instruction. This enables each group to handle the single instruction differently. The method may then include handling, with each processing element, the single instruction based on the group the processing element belongs to.
In yet another embodiment, an apparatus for processing data includes an array of processing elements to simultaneously perform operations on multiple data elements using a single instruction. A modification module designates how each processing element should handle the single instruction, thereby enabling each processing element to handle the single instruction differently.
In yet another embodiment, a method for processing data includes simultaneously performing, with an array of processing elements, operations on multiple data elements using a single instruction. The method further includes designating how each processing element should handle the single instruction, thereby enabling each processing element to handle the single instruction differently.
It will be readily understood that the components of the present invention, as generally described and illustrated in the Figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the apparatus and methods of the present invention, as represented in the Figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
Many of the functional units described in this specification are shown as modules (or functional blocks) in order to emphasize their implementation independence. For example, a module may be implemented as a hardware circuit comprising custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.
Modules may also be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the module and achieve the stated purpose of the module.
Indeed, a module of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals on a system or network.
Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, specific details may be provided, such as examples of programming, software modules, user selections, or the like, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods or components. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The illustrated embodiments of the invention will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of apparatus and methods that are consistent with the invention as claimed herein.
Referring to
In certain embodiments, the data processing architecture 100 may include one or more groups 102, each containing one or more clusters of processing elements (as shown in
The data processing architecture 100 may also be configured to perform certain tasks (e.g., demodulation, decryption, decoding) simultaneously. For example, certain groups and/or clusters within each group may be configured for demodulation while others may be configured for decryption or decoding. In other cases, different clusters may be configured to perform different steps of the same task, such as performing different steps in a pipeline for encoding or decoding video data. The data processing architecture 100 may provide a unified platform for performing various tasks without the need for supporting hardware.
In certain embodiments, the data processing architecture 100 may include one or more processors 104, memory 106, memory controllers 108, interfaces 110, 112 (such as PCI interfaces 110 and/or USB interfaces 112), and sensor interfaces 114. A bus 116, such as a crossbar switch 116, may be used to connect the components together. A crossbar switch 116 may be useful because it provides a scalable interconnect that can mitigate possible throughput and contention issues.
In operation, data, such as video data, may be streamed through the interfaces 110, 112 into a data buffer memory 106. This data may be streamed from the data buffer memory 106 to group memories 206 (as shown in
A host processor 104 (e.g., a MIPS processor 104) may control and manage the actions of each of the components 102, 108, 110, 112, 114 and act as a supervisor for the data processing architecture 100. A sensor interface 114 may interface with various sensors (e.g., an IRDA sensor) which may receive commands from various control devices (e.g., a remote control). The host processor 104 may receive the commands from the sensor interface 114 and take appropriate action. For example, if the data processing architecture 100 is configured to decode television channels and the host processor 104 receives a command to begin decoding a particular television channel, the processor 104 may determine what the current loads of each of the groups 102 are and determine where to start a new process. For example, the host processor 104 may decide to distribute this new process over multiple groups 102, keep the process within a single group 102, or distribute it across all of the groups 102. In this way, the host processor 104 may perform load-balancing between the groups 102 and determine where particular processes are to be performed within the data processing architecture 100.
Referring to
Referring to
The VPC 302 may have associated therewith a scalar ALU 306 which may perform scalar algorithm computations, perform control-related functions, and manage the operation of the VPU array 300. For example, the scalar ALU 306 may reconfigure the processing elements by modifying the groups that the processing elements belong to or designating how the processing elements should handle instructions based on the group they belong to.
The cluster 200 may also include a data memory 308 storing vectors having a defined number (e.g., sixteen) of elements. In certain embodiments, the number of elements in each vector may be equal to the number of processing elements in the VPU array 300. Similarly, in selected embodiments, each vector element may include a defined number (e.g., sixteen) of bits. The number of bits in each element may be equal to the width (e.g., sixteen bits) of the data path between the data memory 308 and each processing element. It follows that if the data path between the data memory 308 and each processing element is 16-bits wide, the data ports (i.e., the read and write ports) to the data memory 308 may be 256-bits wide (16 bits for each of the 16 processing elements). These numbers are presented only by way of example are not intended to be limiting.
In selected embodiments, the cluster 200 may include an address generation unit 310 to generate real addresses when reading data from the data memory 308 or writing data back to the data memory 308. As will be explained in association with
In selected embodiments, instructions fetched from the instruction memory 304 may include a multiple-slot instruction (e.g., a three-slot instruction). For example, where a three-slot instruction is used, up to two (i.e., 0, 1, or 2) instructions may be sent to each processing element and up to one (i.e., 0 or 1) instruction may be sent to the scalar ALU 306. Instructions sent to the scalar ALU 306 may, for example, be used to change the grouping of processing elements, change how each group of processing elements should handle a particular instruction, or change the configuration of a permutation engine 318. In certain embodiments, the processing elements within the VPU array 300 may be considered parallel-semantic, variable-length VLIW (very long instruction word) processors, where the packet length is at least two instructions. Thus, in certain embodiments, the processing elements in the VPU array 300 may execute at least two instructions in parallel in a single clock cycle.
In certain embodiments, the cluster 200 may further include a parameter memory 314 to store parameters of various types. For example, the parameter memory 314 may store a processing element (PE) map to designate which group each processing element belongs to. The parameters may also include an instruction modifier designating how each group of processing elements should handle a particular instruction. In selected embodiments, the instruction modifier may designate how to modify at least one operand of the instruction, such as a source operand, destination operand, or the like. This concept will be explained in more detail in association with
In selected embodiments, the cluster 200 may be configured to execute multiple threads simultaneously in an interleaved fashion. In certain embodiments, the cluster 200 may have a certain number (e.g., two) of active threads and a certain number (e.g., two) of dormant threads resident on the cluster 200 at any given time. Once an active thread has finished executing, a cluster scheduler 316 may determine the next thread to execute. In selected embodiments, the cluster scheduler 316 may use a Petri net or other tree structure to determine the next thread to execute, and to ensure that any necessary conditions are satisfied prior to dispatching a new thread. As previously mentioned, in certain embodiments, one or more of the group processors 204 (shown in
Because a cluster 200 may execute and finish threads very rapidly, it is important that threads can be scheduled in an efficient manner. In certain embodiments, an interrupt may be generated each time a thread has finished executing so that a new thread may be initiated and executed. Where threads are relatively short, the interrupt rate may become so high that thread scheduling has the potential to undesirably reduce the processing efficiency of the cluster 200. Thus, apparatus and methods are needed to improve scheduling efficiency and ensure that scheduling does not create bottlenecks in the system. To address this concern, in selected embodiments, the cluster scheduler 316 may be implemented in hardware as opposed to software. This may significantly increase the speed of the cluster scheduler 316 and ensure that new threads are dispatched in an expeditious manner. Nevertheless, in certain cases, the cluster hardware scheduler 316 may be bypassed and scheduling may be managed by other components (e.g., the group processor 204).
In certain embodiments, the cluster 200 may include permutation engine 318 to realign data that it read from or written to the data memory 308. The permutation engine 318 may be programmable to allow data to a reshuffled in a desired order before or after it is processed by the VPU array 300. In certain embodiments, the programming for the permutation engine 318 may be stored in the parameter memory 314. The permutation engine 318 may permute data having a width (e.g., 256 bits) corresponding to the width of the data path between the data memory 308 and the VPU array 300. In certain embodiments, the permutation engine 318 may be configured to permute data with a desired level of granularity. For example, the permutation engine 318 may reshuffle data on a byte-by-byte basis or other desired level of granularity.
Referring to
Referring to
For example, in selected embodiments, an exchange register 502a may have a read port that is coupled to PE 0 and a write port that is coupled to PE 4, allowing data to be transferred from PE 4 to PE 0. Similarly, an exchange register 502b may have a read port that is coupled to PE 4 and a write port that is coupled to PE 0, allowing data to be transferred from PE 0 to PE 4. This enables two-way communication between adjacent processing elements PE 0 and PE 4.
Similarly, for those processing elements on the edge of the array 300, the processing elements may be configured for “wrap-around” communication. For example, in selected embodiments, an exchange register 502c may have a write port that is coupled to PE 0 and a read port that is coupled to PE 12, allowing data to be transferred from PE 0 to PE 12. Similarly, an exchange register 502d may have a write port that is coupled to PE 12 and a read port that is coupled to PE 0, allowing data to be transferred from PE 12 to PE 0. Similarly, exchange registers 502e, 502f may enable two-way communication between processing elements PE 0 and PE 3 and exchange registers 502g, 502h may enable two-way communication between processing elements PE 0 and PE 1.
In certain embodiments, the cluster 200 may be configured such that data may be loaded from data memory 308 into either the internal registers 500 or the exchange registers 502 of the VPU array 300. The cluster 200 may also be configured such that data may be loaded from the data memory 308 into the internal registers 500 and exchange registers 502 simultaneously. Similarly, the cluster 200 may also be configured such that data may be transferred from either the internal registers 500 or the exchange registers 502 to data memory 308.
Referring to
In selected embodiments, the modification module 614 may include an instruction modifier 604 to designate how each group should handle an instruction 606. Like the PE map 602, this instruction modifier 604 may, in certain embodiments, be stored in a register 600 that may be read by each processing element in the array 300. For example, consider a VPU array 300 where the PE map 602 designates that PE 0 through PE 7 belong to “group 0” and PE 8 through PE 15 belong to “group 1.” An instruction modifier 604 may designate that group 0 should handle an ADD instruction as an ADD instruction, while group 1 should handle the ADD instruction as a SUB instruction. This will allow each group to handle the ADD instruction differently. Although the ADD instruction is used in this example, this feature may be used for a host of different instructions.
In certain embodiments, the instruction modifier 604 may also be configured to modify a source operand 608 and/or a destination operand 610 of an instruction 606. For example, if an ADD instruction is designed to add the contents of a first source register (R1) to the contents of a second source register (R2) and to store the result in a third destination register (R3), the instruction modifier 604 may be used to modify any or all of these source and/or destination operands. For example, the instruction modifier 604 for a group may modify the above-described instruction such that a processing element will use the source operand in the register (R5) instead of R1 and will save the destination operand in the destination register (R8) instead of R3. In this way, different processing elements may use different source and/or destination operands 608, 610 depending on the group they belong to.
Referring to
In selected embodiments, a “connection” 708 may be identified by a connection ID 700. Thus, whenever code attempts to read or write to the data memory 308, the code may identify a connection_ID 700 as opposed to a real address 706. In certain embodiments, the connection ID 700 may be composed of both a buffer_ID 702 and a port_ID 704. The buffer_ID 702 and port_ID 704 may correspond to a buffer and port, respectively. In general, the buffer may identify one or more regions in data memory 308 in which to read or write data. The port, on the other hand, may identify an access pattern for reading or writing data to the buffer. Various different types of buffers and ports will be explained in more detail in association with
In selected embodiments, the connection_ID 700 may be made up of a pre-defined number of bits (e.g., sixteen bits). Accordingly, the buffer_ID 702 and port_ID 704 may use some portion of the pre-defined number of bits. For example, where the connection_ID 700 is sixteen bits, the buffer_ID 702 may make up the lower seven bits of the connection_ID 700 and the port_ID 704 may make up the upper nine bits of the connection_ID 700. This allows for 27 (i.e., 128) buffers and 29 (i.e., 512) ports.
Referring to
In selected embodiments, the buffer descriptor memory 800 may contain a buffer descriptor table 804 containing buffer records 808. In certain embodiments, the buffer records 808 are indexed by buffer_ID 702, although other indexing methods are also possible. Along with other information, the buffer records 808 may include a type 810, which may describe the type of buffer associated with the buffer_ID. In selected embodiments, buffer types may include but are not limited to “point-to-point,” “broadcast,” “scatter,” and “gather” buffer types, which will be explained in more detail in association with
The buffer records 808 may also store attributes 812 associated with the buffers. These attributes 812 may include, among other information, the size of the buffer, a data available indicator (indicating whether data is available that may be read from the buffer), a space available indicator (indicating whether space is available in the buffer to write data), or the like. In selected embodiments, the buffer record 808 may also include a buffer base address 814. Using the buffer base address 814 and an offset 822 (as will be described in more detail hereafter), the address generation unit 310 may calculate real addresses in the data memory 308 when reading or writing thereto. The address generation unit 310 may generate the real addresses internally, eliminating the need for external code to specify real addresses for reading and writing.
Similarly, in selected embodiments, the port descriptor memory 802 may store a port descriptor table 806 containing port records 816. In certain embodiments, the port records 816 are also indexed by port_ID 704. In certain embodiments, the port records 816 may store a type 818, which may describe the type of port associated with the port_ID 704. In selected embodiments, port types may include but are not limited to “FIFO,” “matrix transform,” “nested loop,” “end point pattern” (EPP), and “non-recursive pattern” (NRP) port types, various of which will be explained in more detail in association with
The port records 816 may also store attributes 820 of the ports they describe. These attributes 820 may vary depending on the type of port. For example, attributes 820 for a “nested loop” port may include, among other information, the number of times the nested loops are repeated, the step size of the loops, the dimensions of the two-dimensional data structure (to support wrapping in each dimension), or the like. Similarly, for an “end point pattern” port, the attributes 820 may include, among other information, the end points to move between when scanning the vectors in a buffer, the step size between the end points, and the like. Similarly, for a “matrix transform” port, the attributes 820 may include the matrix that is used to generate real addresses, or the like. The attributes 820 may also indicate whether the port is a “read” or “write” port.
In general, the attributes 820 may include the rules or parameters required to advance the offset 822 as vectors are read from or written to the buffer. The rules may follow either a “FIFO,” “matrix transform,” “nested loop,” “end point pattern” (EPP), or “non-recursive pattern” model, as previously discussed, depending on the type 818 of port. The offset 822 may be defined as the distance from the base address 814 of the buffer where data is read from or written to memory 308 (depending on whether the port is a “read” or “write” port). The offset 822 may be updated in the port descriptor 816a when data is read from or written to the data memory 308 using the port 816a. The address generation unit 310 may advance and keep track of the offset 822 internally, making it transparent to code executed on the VPU array 300.
Referring to
As illustrated in
As shown in
As shown in
As shown in
Referring to
In selected applications, the buffer 900 may be used to store a multi-dimensional data structure, such as a two-dimensional data structure (e.g., two-dimensional video data). The VPU array 300 may operate on the multi-dimensional data structure. In such an embodiment, each of the vectors 1000 may represent some portion of the multi-dimensional data structure. For example, where the multi-dimensional data structure is a two-dimensional data structure, each of the vectors 1000 may represent a 4×4 block of pixels, where each element of a vector 1000 represents a pixel within the 4×4 block.
For example, referring to
As previously mentioned, different “ports” may be used to access (i.e., read and/or write) data in a buffer 900 in different patterns. It has been found that processing video data may require the data to be accessed in different patterns. Some of these ports, more particularly the “FIFO,” “nested loop,” “matrix transform,” and “end point pattern” ports previously discussed, will be explained in more detail in association with
An access pattern for a FIFO port (also known as “raster scan” access) may simply include an address increment with wrap around. For example, referring to
In the above example, the loops do not inherit the starting points of the previous loops. However, in other embodiments, the loops may be configured to inherit the starting points of the previous loops. The parameters (i.e, the step-size and number of iterations for each loop) of the nested loops may be varied to generate various types of access patterns. Thus, the access pattern shown in
Ports having the matrix port type may have a counter multiplied by a transform matrix to determine a buffer offset 822. The matrix multiplication of a FIFO pointer (or simple counter) by a transform matrix creates a new programmable access pattern. An 8-bit offset may require a 64-entry transform matrix, where each entry is one bit. Since the matrix elements are single bits, the multiplication reduces to an AND operation, while the addition reduces to an XOR operation as shown in the equation below. The port descriptor may contain both the offset information as well as the transform matrix.
The matrix transform may be used to support recursive access patterns such as U-order (
Ports having the “end-point pattern” type may be used to support non-recursive access patterns such as the wiper scan (
For example, referring to the access pattern of
The “end-point pattern” port type is useful to generate many access patterns that may be difficult or impossible to generate using other port types. This algorithm may also be useful in many mathematical operations, particularly faster search algorithms to improve encoding efficiency.
This type of port may be used to support non-recursive access patterns that are not achievable or supported using the matrix transform port or other types of ports. In general, the “non-recursive pattern” port may be similar to the “nested loop” port except that it may use consecutive loops (i.e., sequential loops) instead of nested loops to generate addresses.
The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.