Embodiments of the present invention relate to the field of matrix multiplication technologies, and more specifically, to a system and a method for matrix multiplication.
Dedicated matrix multiplication acceleration is widely used in high performance computing (HPC) and artificial intelligence (AI), but these two area have different requirements for size types—integer32 (int32)/integer64 (int64) is mainly used for the HPC, while int8/int16 is commonly used for the AI. For hardware manufacturers, this provides two strategic directions-creating special accelerators for the AI and HPC or minimizing development costs by creating accelerators suitable for the two areas. A common way to support different size types in the hardware is for each integer length (such as AMD® CDNA™ Matrix Core Technology) to implement a separate multiply accumulator (MAC) unit. However, due to partially unused hardware and higher power consumption, this usually makes a final product less competitive compared with the matrix multiplication accelerator dedicated for a particular area.
Embodiments of the present application provide a system and a method for matrix multiplication, which provide a reconfigurable multi-size matrix multiplication support.
According to a first aspect, provided is a system for matrix multiplication, including a forward conversion unit, a modular arithmetic unit and a reverse conversion unit,
As an example, the size type includes one of the following types: a 64-bit type, a 32-bit type or a 16-bit type.
In addition, as an example, the reverse conversion unit is configured to make a reverse conversion on the operation results according to the control signal and Chinese Remainder Theorem (CRT) to obtain the binary output values. Optional, the CRT also can be replaced with extended CRT (EXCRT) or Mixed-Radix Conversion (MRC).
According to the system proposed by the present application, a residue number system (RNS) is configured to enable a multi-size support in matrix multiplication hardware acceleration. The forward conversion unit and the reverse conversion unit enables the multi-size support under a dedicated control signal. An extension for the forward conversion unit of the RNS allows to interpret a 64-bit input value differently depending on a control signal value, and perform modulo reduction operations for a single 64-bit value, or two 32-bit values or four 16-bit values. An extension for the reverse conversion unit of the RNS based on Chinese Remainder Theorem (CRT) (merely an example) allows to reconstruct different size types of values using efficient hierarchical controlled accumulation. As a result, the proposed system for the matrix multiplication provides the multi-size support for the matrix multiplication without increasing hardware area and implements less power consumption.
In some implementation manners, the forward conversion unit includes M multiplexers which corresponds to M modulo reduction unit groups, where the system supports value processing of multiple size types, the multiple size types include a first size type and a second size type, and the first size type corresponds to a maximum size that supported by the system;
In some implementation manners, each of the M modulo reduction unit groups comprises P modulo reduction units, P is an integer greater than or equal to 2.
As an example, M=4, P=4, that is, the forward conversion units includes four modulo reduction unit groups, and each of four modulo reduction unit groups corresponding to the four multiplexers includes four modulo reduction units respectively.
In addition, M·P is the number of moduli contained in a modulo set corresponding to the RNS of the proposed application.
In some implementation manners, where the forward conversion unit includes four multiplexers and four modulo reduction unit groups, the four multiplexers are in one-to-one correspondence with the four modulo reduction unit groups; each of the four modulo reduction unit groups includes four modulo reduction units, and the four modulo reduction unit groups comprise sixteen modulo reduction units in all, where the sixteen modulo reduction units are in one-to-one correspondence with sixteen moduli of the RNS;
According to this embodiment, an extension for a RNS forward conversion unit allows to interpret each 64-bit input value as a single 64-bit value based on a control signal value, and the forward conversion unit performs a forward conversion on the input 64-bit value to obtain residues of the input 64-bit value.
It should be noted that, sixteen is used as an example in this embodiment as it is easy to be divided by four, and it actually can be any number other than sixteen of the total modulo used. Furthermore, there is no strict requirement for having the same number of moduli between different groups, for example, there may be three, four, five or other number of moduli in a group. The requirements only relate to the multiplication value of the modulo which is supposed to be higher than a binary range covered.
In some implementation manners, where the forward conversion unit includes four multiplexers and four modulo reduction unit groups, the four multiplexers are in one-to-one correspondence with the four modulo reduction unit groups; each of the four modulo reduction unit groups includes the four modulo reduction units, and the four modulo reduction unit groups comprise the sixteen modulo reduction units in all, where the sixteen modulo reduction units are in one-to-one correspondence with sixteen moduli of the RNS;
According to this embodiment, the extension for the RNS forward conversion unit allows to interpret each 64-bit input value as two 32-bit values based on the control signal value, and the forward conversion unit performs the forward conversion on the two 32-bit values, respectively, to obtain the residues of the two 32-bit values.
In some implementation manners, where the forward conversion unit includes four multiplexers and four modulo reduction unit groups, the four multiplexers are in one-to-one correspondence with the four modulo reduction unit groups; each of the four modulo reduction unit groups includes the four modulo reduction units, and the four modulo reduction unit groups comprise the sixteen modulo reduction units in all, where the sixteen modulo reduction units are in one-to-one correspondence with sixteen moduli of the RNS;
According to this embodiment, the extension for the RNS forward conversion unit allows to interpret each 64-bit input value as four 16-bit values based on the control signal value, and the forward conversion unit performs the forward conversion on the four 16-bit values respectively, to obtain the residues of the four 16-bit values.
In some implementation manners, where the modular arithmetic unit includes sixteen multiply accumulator (MAC) units, the sixteen MAC units are in one-to-one correspondence with the sixteen moduli, and the sixteen MAC units are in one-to-one correspondence with the sixteen modulo reduction units;
In some implementation manners, where the reverse conversion unit supports reverse conversion of multiple size types, and the multiple size types further include a third size type, and the third size type corresponds to the minimum size supported by the system;
As an example, M=4, P=4, and V=4, and the first multiplexer set includes 16 multiplexers, and the second multiplexer set includes 4 multiplexer.
In some implementation manners, where the first multiplexer set includes sixteen multiplexers, and the sixteen multiplexers are in one-to-one correspondence with the sixteen MAC units, where the sixteen multiplexers are four multiplexer groups, and each of the four multiplexer groups includes the four multiplexers;
In some implementation manners, where the pre-computed three constant values of a i-th multiplexer of the sixteen multiplexers contained in the first multiplexer set comprise: Mm
In some implementation manners, a set of the sixteen moduli of the RNS is {m1, m2, . . . , m16}, where the sixteen moduli meet the following criteria:
In some implementation manners, where the four modulo reduction unit groups are in one-to-one correspondence with four modulo groups consisting of the sixteen moduli, and the four modulo groups are: {m1, m5, m9, m13}, {m2, m6, m10, m14}, {m3, m7, m11, m15}, and {m4, m8, m12, m16}, where {m1, m2, . . . , m16}={512, 511, 509, 489, 487, 485, 481, 479, 473, 467, 431, 167, 151, 139, 137, 131}.
It should be noted that, values of the sixteen moduli are taken merely as an example, and the criteria the sixteen modulo must meet has been listed as above.
In some implementation manners, where the four multiplexers contained in the second multiplexer set are specifically configured as follows:
In some implementation manners, where the four multiplexers contained in the second multiplexer set are specifically configured as follows:
In some implementation manners, where the four multiplexers comprised in the second multiplexer set are specifically configured to:
According to a second aspect, provided is a method for matrix multiplication, including:
In some implementation manners, where the obtaining the element values according to the input values and the control signal further includes:
In some implementation manners, where a set of sixteen moduli of the RNS is {m1, m2, . . . , M16}, and the sixteen moduli meet the following criteria:
According to a third aspect, a computing device is provided. The computing device has a function of implementing the method in the second aspect and any possible implementation manners of the second aspect. The function may be implemented by using a hardware. The hardware or the software includes one or more units corresponding to the foregoing function. Optional, the computing device is a chip or chip system.
According to a forth aspect, a chip is provided. The chip includes a plurality of circuits, where the plurality of circuits are configured to have functions of units included in a system according to the first aspect and any possible implementation manners of the first aspect.
One or more embodiments are exemplarily described by corresponding accompanying drawings, and these exemplary illustrations and accompanying drawings constitute no limitation on the embodiments. Elements with the same reference numerals in the accompanying drawings are illustrated as similar elements, and the drawings are not limiting to scale, in which:
In order to understand features and technical contents of embodiments of the present disclosure in detail, implementations of the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, and the attached drawings are only for reference and illustration purposes, and are not intended to limit the embodiments of the present disclosure. In the following technical descriptions, for ease of explanation, numerous details are set forth to provide a thorough understanding of the disclosed embodiments. One or more embodiments, however, may be practiced without these details. In other cases, well-known structures and apparatuses may be shown simplified in order to simplify the drawings.
A RNS is defined by a set of k moduli {m1, m2, . . . , mk} called moduli. The k moduli are generally supposed to be pairwise coprime, i.e.,GCD(mi, mj)=1, i, j∈[1, k], i≠j. Let M be a product of all mi: M=Πi=1k mi, then an integer x in a range of [0, M−1] is represented in the RNS by a set of its residues x→{x1, x2, . . . , xk} under Euclidean division by the moduli; that is,
One of the greatest RNS feature is that addition and multiplication between pairs of corresponding remainders are independent and can be performed in parallel. In the present application, it is proposed to use the RNS to enable a hardware-efficient reconfigurable support for matrix multiplication with the integrated support for different integer types, including int64, int32 and int16.
A common solution to support various integer types is to add dedicated multiply accumulator (MAC) units for each size. An advantage of the solution is that it can perform calculations of various integer types and sizes in parallel. When only large (e.g., HPC) or small size (e.g., AI) integer calculations are required, hardware efficiency cannot be improved by equipping a dedicated MAC for each size. A similar idea that supports multi-size computing is to use smaller multiplication\ to create larger one. Many ways can be used to build a D×D multiplier by
multipliers. Implementation of 64-bit multiplication via 4×32-bit multiplications increases 17-20% of additional hardware and consumes 23-25% of power compared with dedicated 64-bit multiplication. The typical implementation of a dedicated 64-bit multiplier is 3.2-3.9 times larger than the area of a 32-bit multiplier, and in order to implement a 64-bit multiplier, an additional 128-bit accumulator is required. A reconfigurable multi-size feature by splitting D×D multiplier into
multipliers requires a final value to be fully reconstructed as result of a multiplication operation, which has a significant additional cost in the hardware.
In view of this, a system for matrix multiplication is proposed, and the proposed system provides a multi-size matrix integer multiplication support without increasing hardware area and implements better power consumption compared with regular 64-bit integer matrix multiplication.
The present application proposes to use a residue independent computation feature of the RNS in order to provide the multi-size support for the matrix multiplication. A main idea of the proposed solution by the present application is to split 64-bit multiplication into 16 modular multiplications, which has a set of advantages. The proposed solution is supposed to be used in a system for matrix multiplication hardware by switching all multiplication and accumulation operations to the RNS. While switching to the RNS requires additional forward conversion and reverse conversion costs, in the case of the matrix multiplication, it is ignored because the number of conversion operations is one order of magnitude less than the number of multiplication and addition.
The following describes the proposed solution of the present application.
An input value A (64 bits) represents a single 64-bit value, or two 32-bit values or four 16-bit values depending on a control signal (as CTRL shown in
the input value A represents a single 64-bit value, which is represented here as A[63:0];
the input value A represents two 32-bit values, which is represented here as {A[63:32], A[31:0]};
the input value A represents four 16-bit values, which is represented here as {A[63:48], A[47:32], A[31:16], A[15:0]}.
It should be understood that, if the input value represents two 32-bit values, the two 32-bit values are low 32 bits and high 32 bits of the input value (64 bits), respectively, and if the input value represents four 16-bit values, the four 16-bit values are four segments of the input value (64 bits), respectively, and each of the four segments includes 16 bits. In addition, A is an example of an input value, actually, there are a large of input values of two matrices that need to be performed matrix multiplication.
As shown in
An example of the set of the sixteen moduli {m1, m2, . . . , M16} is given below.
Consider that the following set of pairwise co-prime moduli m1=512, m2=511, m3=509, m4=489, m5=487, m6=485, m7=481, m8=479, m9=473, m10=467, m11=431, m12=167, m13=151, m14=139, m15=137, m16=131 has the following features:
Therefore, the entire set of moduli can be used for 64-bit multiplication. When CTRL=0, all the sixteen modulo reduction units will get A. In this case, outputs of the four multiplexers, that is, A0[63:0], A1[63:0], A2[63:0] and A3[63:0], are the same, that is, A.
Two subsets of the moduli can be used for 32-bit multiplication, that is, a first subset is represented as {m1, m3, m5, m7, m9, m11, m13, m15}=[512, 509, 487, 481, 473, 431, 151, 137] and a second subset is represented as {m2, m4, m6, m8, m10, m12, m14, m16}=[511, 489, 485, 479, 467, 167, 139, 131]. When CTRL=1, the first subset is handled by the first and the third multiplexer, both of them will return low 32 bits of the input value A, and the second subset is handled by the second and the fourth multiplexer, both of them will return high 32 bits of the input value A. In this case, the first and the third group will get A[31:0], and the second and the fourth group will get A[63:32].
Four subsets of the moduli can be used for 16-bit multiplication, that is, {m1, m5, m9, m13 }=[512, 487, 473, 151], {m3, m7, m11, m15 }=[509, 481, 431, 137], {m2, m6, m10, m14}=[511, 485, 467, 139], {m4, m8, m12, m14}=[489, 479, 167, 131]. When CTRL=2, each multiplexer will return corresponding 16 bits of the input value A, and each group of modulo reduction units will process a single 16-bit value. In this case, the first group will get A[15:0], the second group will get A[31:16], the third group will get A[47:32], and the fourth group will get A[63:48].
In the above different cases, the modulo reduction units perform a modulo reduction operation on values received from the corresponding multiplexer to obtain converted values in the RNS corresponding to input values. It should be noted that the converted values actually are RNS residues of the input values.
For example, when CTRL=0, the input value is a single 64-bit value, and each modulo reduction unit gets the single 64-bit value, and then process the single 64-bit value based on corresponding modulo, respectively. A modulo reduction unit 1 performs the modulo reduction operation on the single 64-bit value based on a modulo corresponding to the modulo reduction unit 1, that is m1, and modulo reduction unit 5 performs the modulo reduction operation on the single 64-bit value based on a modulo corresponding to a modulo reduction unit 5, that is m5, and so on. The sixteen modulo reduction units will return sixteen residues, that is {x1, x2, x3, . . . , x16} shown in
When CTRL=1, the input value is two 32-bit values, the modulo reduction units of the first and the third group will get low 32 bits of the input value, which is called a first 32-bit value, and modulo reduction units of the second and the fourth group will get high 32 bits of the input value, which is called a second 32-bit value. The modulo reduction units of the first and the third group process the first 32-bit value based on the corresponding moduli, respectively. Specifically, each modulo reduction unit of the first and the third group performs a modulo reduction operation on the first 32-bit value based on corresponding modulo to obtain a corresponding residue. Each modulo reduction unit of the second and the fourth group performs the modulo reduction operation on the second 32-bit value based on the corresponding modulo to obtain the corresponding residue. As a result, the first 32-bit value is converted into a set of residues in the RNS by a set of the moduli {m1, m5, m9, m13, m3, m7, m11, m15}, where the set of the residues is {x1, X5, X9, x13, x2, x6, x10, x14}, and the second 32-bit value is also converted into a set of residues in the RNS by the set of moduli {m2, m6, m1, m14, m4, m8, m12, m16}, where the set of the residues is {x2, x6, x10, x14, x4, x8, x12, x16}.
When CTRL=2, the input value is four 16-bit values. Each group of modulo reduction units will get a corresponding single 16-bit value, and then process the single 16-bit value. Specifically, the first group performs the modulo reduction operation by a set of modulo, that is {m1, m5, m9, m13}, to obtain a set of residues, that is {x1, x5, x9, x13}, and the second group performs the modulo reduction operation by a set of modulo, that is {m3, m7, m11, m15}, to obtain a set of residues, that is, {x3, x7, x11, x15}, and so on.
In addition, in
Independent of an operation mode (a CTRL signal), a delay of the forward conversion will remain unchanged. This is mainly because all the modulo reduction units work with 64-bit input values independently on CTRL signal value. When only 16 bits are supposed to be converted, the most significant bits are just filled by 0.
It should be understand that, the architecture shown in
Other structures of the system proposed by the present application will be described below by taking the architecture shown in
It should be understood that, each modulo multiplier makes a calculation of zi=|xi×yi|mi, and each modulo adder makes a calculation of zi|xi+yi|mi, where 1≤i≤16, and i is an integer. Here, xi is a i-th residue of a residue set of an element value of a matrix A, and yi is a i-th residue of an element value of a matrix B.
The sixteen MAC units are implicitly grouped based on the control signal value, processing the single 64-bit value, or two 32-bit values or four 16-bit values. For example, if CTRL=0, the sixteen MAC units form a group, and process the single 64-bit value, and if CTRL=1, the sixteen MAC units form two groups, one of the two groups includes eight MAC units corresponding to 8 moduli, that is, m1, m5, m9, m13, m3, m7, m11, and m15, and the other group includes eight MAC units corresponding to the other 8 moduli, that is, m2, m6, m10, m14, m4, m8, m12 and m16.
according to a control signal. Inputs for a reverse conversion unit are arithmetic operation results zi, which are first multiplied by a pre-computed constant MiMi−1, and the constant is different for each modulo and CTRL signal value. The CRT for reverse conversion is merely taken as an example. The diagram of
Specifically, as shown in
The reverse conversion unit further includes a multiplier set, and the multiplier set includes sixteen multipliers, where the sixteen multipliers are in one-to-one correspondence with the sixteen multiplexers, and the sixteen multipliers are in one-to-one correspondence with the sixteen MAC units. Each multiplier of the sixteen multipliers contained in the multiplier set is configured to perform a multiplication operation on an output of the corresponding multiplexers and an output of the corresponding MAC unit.
The reverse conversion unit further includes three adder sets. A first adder set of the three adder sets includes four adders, where the four adders are in one-to-one correspondence with the four multiplexer groups, and an input of each adder of the four adders includes outputs of four multipliers corresponding to the four multiplexer groups.
A second adder set of the three adder sets includes two adders, where a first adder of the two adders is configured to receive an output of a first adder and an output of a second adder contained in the first adder set, and a second adder of the two adders is configured to receive an output of a third adder and an output of a fourth adder contained in the first adder set.
A third adder set of the three adder sets includes an adder, where the adder is configured to receive outputs of the two adders contained in the second adder set.
The second multiplexer set includes four multiplexers, where an input of each multiplexer is an operation result by performing a modulo reduction operation on outputs of the four adders contained in the first adder set, or outputs of the two adders contained in the second adder set, or an output of the adder contained in the third adder set.
The four multiplexers included in the second multiplexer set are configured to receive the control signal, and determine a corresponding output value according to the control signal value.
As shown in
The final set of multiplexers (that is, the second multiplexer set) forms correct output values according to a CTRL signal value, that is, the single 128-bit value, or two 64-bit values or four 32-bit output values.
It can be seen from
The four multiplexers contained in the second multiplexer set are specifically configured as follows:
Alternatively, an output matrix will have 2× larger size type for the elements, which can be configured to properly handle overflows or implement fixed-point data types.
This application has the following advantages:
There is no any dedicated multi-size support needed in the modular multipliers except dedicated handling of 8-bit multiplication, which can simply disable modular reduction in the present application. The logic of other size types is fully handled by RNS to BNS/BNS to RNS converters. A separate control signal is added to the converters to maintain all possible input/output types that a matrix multiplier can handle. Depending on the control signal value, an input 64-bit value can be interpreted as 1×64-bit value, or 2×32-bit values, or 4×16-bit values or 16×8-bit values.
A RNS forward conversion is implemented in a way to find a residue for 64-bit value for each modulo, this logic can be fully configured for different size types to deal with 64-bit input values, and it needs to simply expand the smaller size. The latency of forward conversion for all the integer size types will remain the same. A proper grouping of moduli is defined architecturally and fixed in the hardware. For 8-bit input type, forward conversion is simply bypassed.
Reverse conversion is based on hierarchical accumulation of intermediate values, where 64-bit size type requires the accumulation across all the layers, while 32-bit size type avoids the last level of the accumulation, 16-bit size type avoids the last two levels of the accumulation, and 8-bit size type bypass the reverse conversion.
Forward conversion and reverse conversion take more area and higher power consumption than modulo multipliers, however, they are only needed to convert all the values of input matrices and convert values of output matrix back. For a n×n matrix, the number of forward converters needed is 2×n2, and the number of reverse converters needed is n2, while assuming that the matrix is multiplied in one cycle, a total number of the modular arithmetic units is n3, which allows all costs of forward/reverse conversion to be hidden under the benefits of small modular multipliers. If a matrix multiplication hardware accelerator operates with a single row and a single column in one cycle, a ratio between converters and the modular arithmetic units will remain the same, that is, 2×n forward converters, n reverse converters, and the total number of the modular arithmetic units is n2.
RNS multipliers (for the whole modulo set converting [0, 2128−1] range) can be implemented by occupying 40% less in area than a single 64-bit multiplier and consuming 50% less power. Assuming hardware area is occupied by 64×64 multiplier and its power consumption is 1, then we can build the following table with the costs of RNS based matrix multiplication (all the numbers below are results of actual experiments using the best known algorithms for forward/reverse conversion and modulo reduction).
Therefore, a RNS matrix multiplication area is 2×0.7n2+0.6n3+1.6n2 versus
When n=8, the RNS matrix multiplication area is 0.975 of 64×64 matrix multiplication area. When n=16, the RNS matrix multiplication area is 0.79 of 64×64 matrix multiplication area.
RNS matrix multiplication power consumption is 2×0.5n2+0.5n3+1.7n2 versus
When n=8, the RNS matrix multiplication power consumption is 0.775 of 64×64 matrix multiplication power consumption. When n=16, the RNS matrix multiplication power consumption is 0.64 of 64×64 matrix multiplication power consumption. It should be noted that these numbers can differ depending on actual technique process.
Accordingly, this application also provides a method for matrix multiplication.
Step 610: receiving binary input values and a control signal, where the control signal is configured to indicate a size type of the element values of two matrices performing matrix multiplication.
As an example, the size type includes one of the following types: a 64-bit type, a 32-bit type or a 16-bit type.
Step 620: obtaining the element values according to the input values and the control signal, for example, obtaining the element values by determining each input value as a single 64-bit element value in case of a control signal value is equal to a first value, or obtaining the element values by splitting each input value into two 32-bit element values in case of a control signal value is equal to a second value, or obtaining the element values by splitting each input value into four 16-bit element values in case of a control signal value is equal to a third value.
Step 630: making a forward conversion on the element values to obtain converted values of the element values, where a converted value of each element value is a set of residues of the element value in a residue number system (RNS).
Step 640: performing a modular arithmetic operation on the converted values to obtain operation results, where the modular arithmetic operation includes matrix multiplication.
Step 650: making a reverse conversion on the operation results according to the control signal to obtain binary output values, where the output values and the input values have the same size type, and an output matrix will consist of the output values.
As an example, the system makes the reverse conversion on the operation results by using the Chinese remainder theorem (CRT), EXCRT or MRC according to the control signal to obtain binary output values.
All related contents of steps in a method embodiment have been described in detail in the foregoing system embodiment, and in order to avoid repetition, details are not repeated here.
The present application also provides a chip or chip system including a plurality of circuits, and the plurality of circuits are configured to have functions of units included in a system for matrix multiplication as provided in the present application.
In the several embodiments provided in this application, it should be understood that the disclosed system and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may be or may not be physically separate, and parts displayed as units may be or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of this application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.
This application is a continuation of International Application No. PCT/RU2022/000293, filed on Sep. 27, 2022, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/RU2022/000293 | Sep 2022 | WO |
Child | 19090058 | US |