The present disclosure is in the field of data exchange and electronics technology, and in particular relates to a multifunctional data reorganization network.
Current mainstream parallel computing architectures, such as CPU and GPU, provide parallel computing capability in the form of large-scale arrays of computational units. In this architecture, the array of computational units can only perform regular batch data processing, and the transmission path of the data stream between the main memory and the computational units is relatively fixed, resulting in that such compute architectures can support only a limited number of communication and computation modes. However, in many scientific computing and engineering application problems today, the processed data tends not to be simple data structures, but rather organized and stored in the form of matrices, tensors, and even graphs. Such computations, known as non-regular computations, are one of the major challenges faced by current computer technology. In non-regular computational problems, the operation of each data is often no longer a simple numerical operation, but rather different operations need to be completed depending on their attributes in the data structure, such as location, size relative to other data, etc., e.g., dynamically reorganize and map the permutation order of a set of operands onto different computational units. Such non-regular computation requires the architecture to be able to flexibly reassemble and adjust data to adapt to dynamically variable computing patterns. Existing parallel computing architectures lack flexible and efficient data reorganization capabilities, with both data transmission and computation patterns being relatively fixed, and thus suffer from data transmission inefficiency and computational inefficiency when dealing with such non-regular computational problems, resulting in severe performance bottleneck
In view of this, the present disclosure provides a multifunctional data reorganization network including a binary switching unit and a recursive shuffle network, both of which can enable bidirectional transmission of data, and the data reorganization network completes data reorganization by controlling the transmission direction of a signal in the network.
By the above solution, a multifunctional data reorganization network is implemented based on binary non-blocking switching network technology, by employing multifunctional binary switching units and data stream control, reorganization can be realized during data transmission. This approach is important for solving the major performance bottleneck of non-regular computing as a modern computer technology.
The present disclosure is described in further detail below with reference to
In one embodiment, a multifunctional data reorganization network is disclosed, the network includes a binary switching unit and a recursive shuffle network (RSN), wherein both the binary switching unit and recursive shuffle network can enable bidirectional transmission of data, and the data reorganization network completes data reorganization by controlling the transmission direction of a signal in the network.
In terms of this embodiment, referring to
In another embodiment, the binary switching unit includes a basic switching unit and a reduction switching unit.
In terms of this embodiment, the SOM network is constructed based on binary non-blocking switching network technology, whose basic functional modules are multifunctional binary switching units. Each switching unit has two input ports and two output ports. The switching unit may route the two input signals onto the two output ports in different ways. As shown in
In another embodiment, the input signal of the binary switching unit includes a tag and a data payload, wherein the data payload is the data content actually needed to be transferred, and the tag is the corresponding routing information.
In another embodiment, the binary switching unit has two modes of operation: a self-routing mode and routing fulfilled by following routing information input externally, wherein the self-routing mode is that the routing method is determined according to the value of the tag or the data payload of the input signal.
In terms of this embodiment,
The specific meaning of the individual signals of the binary switching unit is listed in Table 1. Both the input signal and the output signal contain two parts: the tag and the data payload. The bit width of both can be adjusted according to the actual application scenario. The bit width of the tag may generally be set to log2 k+1, k being the number of whole network input signals. The bit width of the data payload may be adjusted depending on the type of data being transferred, and typical data types include a 8-bit fixed-point number, a 32-bit fixed-point number, a 32-bit single precision floating point number, and the like. When the CW_en signal is 1, the binary switching unit selects the routing method using the signal of the CW_in input. When the CW_en signal is 0, the binary switching unit operates in a self-routing mode, performs routing calculations according to the configuration of the Mod signal, and selects a routing method according to the calculation results. For example, when the Mod signal is set to 010, the binary switching unit compares the tag values of the two input signals and sends the input signal with the larger tag value to the first output port and the input signal with the smaller tag value to the second output port. Thus, if the tag value of the first input port is greater than the tag value of the second input port, the binary switching unit selects a pass-through route; conversely, a “cross-over” route is selected.
In another embodiment, the recursive shuffle network RSN is obtained by successively superimposing a smaller-scale bidirectional perfect shuffle network.
In terms of this embodiment, a plurality of binary switching units constitute a Recursive Shuffle Network (RSN) by way of a hierarchical recursive topology. The topology of the RSN is recursive, with its basic topology in the form of “Perfect Shuffle”. As shown in
An RSN of k=2n scale may be constructed by cascading a bidirectional perfect shuffle network of k=2n scale and two parallel RSNs of k=n scale as shown in
In another embodiment, a SOM transport network is built up recursively based on RSNs, with all RSN networks in each recursion level being treated as a whole functional Block.
With this embodiment, the SOM transport network may be built up recursively based on RSNs. As shown in
In this solution, all the RSN networks in each recursive level are treated as a whole functional block, called Block. As shown in
The internal structure of each Block is shown in
It is further noted that the SOM transport network does not involve reduction function, so the binary switching unit components used in the base network are all basic switching units.
In another embodiment, the SOM transport network or the SOM reduction network respectively provides independent configuration signals for each Block.
For this embodiment, a SOM transport network of scale k=2r contains r Blocks (Block 0, Block 1, Block r−1). For any Block-i, it contains 2r-i−1 parallel 2i+1-scale RSN networks which are independent of each other, and each RSN network contains i+1 stages of switching units (S0, S1, . . . , Si), each stage of switching units containing 2i switching units for a total of (i+1) 2i switching units. The SOM network provides independent configuration signals for each Block separately. For the Block-i, the required control signals are shown in Table 2.
In another embodiment, the SOM transport network or the SOM reduction network further configures the data stream direction between Blocks by configuring a selector at the input port of each Block.
With this embodiment, in addition to the configuration signal of each Block, the OM network also needs to configure the data stream direction between Blocks. As can be seen from
In another embodiment, a specific implementation of an 8-input SOM transport network is shown.
By flexibly configuring the direction of transmission of each Block, as well as the direction of data stream between Blocks, a number of different data reorganization functions can be implemented on the input signal.
In another embodiment, numerical sorting is shown.
In another embodiment, numerical resorting is shown.
In another embodiment, numerical multicasting is shown.
In another embodiment, compression and decompression of non-zero numerical values is displayed.
In an embodiment of non-zero numeric compression, the upper bit of the tag of each non-zero input data is set to 0 and the lower bit is set to the position of that element in the vector. The tag upper bits of the remaining input data are set to 1. The SOM network sorts the tags in an ascending order, thereby rearranging the non-zero elements to the front of the output port, and the relative order of the respective non-zero elements to each other remains unchanged. The data stream goes through the SOM network in order of Block 0-1-2. The propagation direction of each Block is forward. All binary switching units of the SOM network are set in a self-routing mode of operation, the tags of the input signals are compared, and routing is based on the comparison results. The externally input routing signal CW_in is not used.
In embodiments of non-zero numerical value decompression, the tag of each non-zero numerical value is set to its position in the original vector.
In another embodiment, post-multicast packet resorting is shown.
In another embodiment, synchronization of multiple SOM networks is shown.
By sharing the routing signals, multiple SOM networks may simultaneously complete the same data reorganization.
In such a synchronization relationship, the SOM network that provides the routing information is referred to as the Actor Network, and the SOM network that receives the routing information is referred to as the Tracker Network. Since the transmission of Actor Network's routing signals to the Tracker Network requires a one clock cycle delay, the Actor Network's data stream is always one clock cycle earlier than the Tracker Network's data stream. It is further noted that one Actor Network may correspond to multiple Tracker Networks simultaneously. An example of a matrix change is given in
In another embodiment, on the basis of the SOM transport network, for each Block, the SOM reduction network may be constructed by replacing the first switching unit of the first stage of each RSN network contained therein with a reduction switching unit.
With respect to this embodiment, based on the SOM transport network, a SOM reduction network may be further constructed. It is constructed by replacing, for each Block on the basis of the SOM transport network, the first switching unit of the first stage of each RSN network contained therein with a reduction switching unit. This is shown in
In another embodiment, the SOM reduction network is a SOM network with an input size of k having k−1 adders embedded therein, and the position of each adder is shown in
For this embodiment, as shown in
The SOM reduction network may enable packet reduction at different scales by adjusting the data stream order between Blocks. Specifically, for a randomly distributed set of input signals, the SOM network may aggregate elements that are divided into different groups and distributed at random locations by group, then perform reduction calculations for each group separately, and output the calculation results for each group at a specified location. In this “aggregation-reduction” computing mode, the “aggregation” function is achieved by reverse use of RSN and the “reduction” is achieved by forward use of adder trees embedded in the SOM network. By adjusting the RSN scale used for “aggregation” and “reduction”, group reduction at different scales can be achieved.
In another embodiment, a concrete implementation of an 8-input SOM reduction network is shown.
In another embodiment, a group reduction of scale 4 is shown.
In another embodiment, a group reduction of scale 2 is shown.
In another embodiment, the solution is extremely scalable and is mainly embodied in the following points:
(1) The SOM network is recursive setup and its scale can be arbitrarily extended to 2r size.
(2) The configuration signals and routing methods of each binary switching unit in a SOM network may be extended to enable more complex routing methods. For example, the bit width of the tag can be expanded and compressed according to requirements. It is also possible to provide that each switching unit decides its routing method based on the value of a bit in the input signal label, thereby supporting each input data to save its complete routing path in the label bit without requiring each binary switching unit to route based on the comparator result.
(3) The data stream of the SOM network may be further flexibilized. By implementing a flexible data path configuration between stages of switch units similar to that between Blocks, the data stream may be allowed to pass through the Block instead of in a fixed direction, both “forward” and “backward”, but rather the order of passing through the various binary switch unit stages in the Block may be selected more flexibly. With this extension, more topology types can be implemented, thus supporting more complex data rearrangement function.
(4) Reduction function of the SOM network may be further extended. The reduction function of the reduction switching unit can be extended to other operations than addition, such as multiplication, shift, max/min, or AND, OR, NOT, etc. logical operations.
In another embodiment, the present solution is applied broadly, mainly in terms of the following:
(1) The SOM network can be used to transfer data between a Cache and registers in a general SIMD computing architecture. The data during transfer from the Cache to the registers, through the data reorganization function provided by the SOM network, can better adapt to the data format required by SIMD instructions, thus reducing the number of SIMD instructions required for computation. Moreover, the results of the computations are written back from the registers to the Cache, and through the data reassembly and flexible reduction functions provided by the SOM network, flexible post-processing is allowed, such as group summation or non-zero element compression of the results of the computations. The post-processing is done in data transfer, so the SIMD instruction number can be further reduced.
(2) The SOM network may be used for data pre- and post-processing modules of a dedicated Domain-Specific Architecture (DSA). Depending on the data stream needed for specialized computations, a SOM network may be specially adapted and adapted to remove certain unwanted functions and simplify its circuit complexity while better adapting to certain types of specialized data structures.
(3) The SOM network may be used for visit pre- and post-processing of bulk data storage media such as DDR. Due to the high scalability of the SOM, its scale can be scaled to comply with data transfer processing in large blocks. For example, the dynamic compression and decompression functions of the SOM may effectively reduce storage access bandwidth and improve access efficiency.
Although embodiments of the present disclosure have been described above with reference to the accompanying drawings, the disclosure is not limited to the specific embodiments and fields of application described above, which are merely illustrative, instructive, and not restrictive. Those of ordinary skill in the art, in light of the present description and without departing from the scope of the present disclosure as claimed, can take many forms, all of which fall within the scope of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202011585089.4 | Dec 2020 | CN | national |
This application is a bypass continuation application of PCT application no.: PCT/CN2021/073039. This application claims priorities from PCT Application PCT/CN2021/073039, filed Jan. 21, 2021, and from the Chinese patent application 202011585089.4 filed Dec. 28, 2020, the content of which are incorporated herein in the entirety by reference.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2021/073039 | Jan 2021 | US |
| Child | 17685167 | US |