MATRIX MULTIPLIER AND OPERATION METHOD OF MATRIX MULTIPLY DEVICE INCLUDING THE SAME

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0019936 filed in the Korean Intellectual Property Office on Feb. 8, 2024, and Korean Patent Application No. 10-2023-0143978 filed in the Korean Intellectual Property Office on Oct. 25, 2023, the entire contents of each of which are incorporated herein by reference.

BACKGROUND

Various example embodiments relate to a semiconductor device. More particularly, the present disclosure relates to a matrix multiplier performing matrix multiplication, and a matrix multiply device including the same.

As artificial intelligence technology has recently developed, a calculation amount of an artificial intelligence model is rapidly increasing. Accordingly, various technologies are being researched to shorten a running time of artificial intelligence models.

Generally, most of or a significant amount of the operation time of the artificial intelligence model is spent on matrix multiplication. For example, the artificial intelligence model may spend most of the running time calculating the output matrix by performing multiplication of an input matrix and a weight matrix. Accordingly, various algorithms, such as binary coding quantization (BCQ), are being researched to perform the multiplication of the input matrix and the weight matrix with fewer calculations.

SUMMARY

Various example embodiments attempt to provide a matrix multiplier configured to perform matrix multiplication at a faster speed and/or with fewer calculations, and/or a matrix multiply device including the same.

According to various example embodiments, a matrix multiplier may include an input vector scaler configured to generate a first quantization scaled input vector based on a first input vector, on a plurality of common scale coefficients, and on first-to-R-th multiplication scale coefficients, where R is an integer greater than or equal to 2, a first data type converter configured to generate a first fixed point quantization scaled input vector based on the first quantization scaled input vector, a processing element array comprising a first processing element configured to generate a first fixed point output element based on the first fixed point quantization scaled input vector and on a first plurality of quantization sign bits, and a second processing element configured to generate a second fixed point output element based on the first fixed point quantization scaled input vector and on a second plurality of quantization sign bits, and a second data type converter configured to generate first and second output elements by converting data types of the first and second fixed point output elements, respectively, and configured to output a first output vector including the first and second output elements.

Alternatively or additionally according to various example embodiments, an operation method of a matrix multiply device may include receiving first-to-Nth weights from an external device, where N is an integer greater than or equal to 2, generating first-to-Nth common scale coefficients, first-to-Rth multiplication scale coefficients, and first-to-(N×R)th quantization sign bits by performing uniform binary coding quantization for the first-to-Nth weights, where, R is an integer greater than or equal to 2, receiving first-to-Nth input elements from the external device, generating first-to-(N×R)th quantization scaled input elements by performing quantization scaling for the first-to-Nth input elements based on the first-to-Nth common scale coefficients and on the first-to-Rth multiplication scale coefficients, and outputting a first output element generated based on the first-to-(N×R)th quantization sign bits and the first-to-(N×R)th quantization scaled input elements.

Alternatively or additionally according to various example embodiments, a matrix multiplier may include an input vector scaler configured to generate a first multiplication scaled input vector based on a first input vector and first-to-Rth multiplication scale coefficients, where R is an integer greater than or equal to 2, a first data type converter configured to generate a first fixed point multiplication scaled input vector based on the first multiplication scaled input vector, a processing element array including a first processing element configured to generate a first fixed point partial product based on the first fixed point multiplication scaled input vector and on a first plurality of quantization sign bits, a second data type converter configured to generate a first partial product by converting data type of the first fixed point partial product, and a common scaler configured to generate a first output element based on a product of the first partial product with a first common scale coefficient, and configured to output a first output vector including the first output element.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a matrix multiply device according to some example embodiments.

FIG. 2 is a diagram illustrating an operation of a matrix multiply device implemented to directly multiply an input matrix and a weight matrix of FIG. 1.

FIGS. 3 and 4 are diagrams illustrating an operation of a uniform BCQ circuit of FIG. 1.

FIG. 5 is a diagram illustrating the operation of the uniform BCQ circuit of FIG. 1, which performs a binary coding quantization operation for each column of the weight matrix.

FIG. 6 is a diagram illustrating an operation of a matrix multiplier according to the embodiment of FIG. 5.

FIG. 7 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 according to the embodiment of FIG. 4.

FIG. 8 is a block diagram illustrating a configuration of an input vector scaler of FIG. 7.

FIG. 9 is a diagram illustrating an operation of a multiplication scaling circuit of FIG. 8 in more detail.

FIG. 10 is a diagram illustrating in more detail a method of performing, by the multiplication scaling circuit of FIG. 8, a multiplication scaling operation.

FIG. 11 is a diagram illustrating a configuration of a first data type converter of FIG. 7.

FIG. 12 is a diagram illustrating an operation of an exponent extraction circuit of FIG. 11.

FIG. 13 is a diagram illustrating an operation of a data type conversion circuit of FIG. 11.

FIG. 14 is a block diagram illustrating in more detail a configuration of a processing element array of FIG. 7.

FIG. 15 is a block diagram illustrating in more detail some operations of the matrix multiplier of FIG. 7.

FIG. 16 is a block diagram illustrating in more detail an operation of a first processing element row of FIG. 15.

FIG. 17 is a diagram illustrating a configuration of one of the processing elements of FIG. 16 implemented according to an embodiment.

FIG. 18 is a diagram illustrating an operation of the second data type converter of FIG. 7.

FIG. 19 is a flowchart illustrating an operation of the matrix multiply device of FIG. 1.

FIG. 20 is a flowchart illustrating in more detail step S150 of FIG. 19.

FIG. 21 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1.

FIG. 22 is a flowchart illustrating in more detail step S250 of FIG. 21.

FIG. 23 is a diagram illustrating the operation of the BCQ circuit of FIG. 1 according to some example embodiments.

FIG. 24 is a diagram illustrating an approximated weight matrix according to the embodiment of FIG. 23.

FIG. 25 is a diagram illustrating quantization sign bit matrices of FIG. 24.

FIG. 26 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 implemented according to some example embodiments.

FIG. 27 is a block diagram illustrating a configuration of an input vector scaler of FIG. 26 according to an embodiment.

FIG. 28 is a diagram illustrating in more detail an operation of a common scaling circuit of FIG. 27.

FIG. 29 is a block diagram illustrating a configuration of the input vector scaler of FIG. 26 according to an embodiment.

FIG. 30 is a diagram illustrating in more detail the operation of the common scaling circuit of FIG. 29.

FIG. 31 is a diagram illustrating in more detail an operation of a multiplication scaling circuit of FIG. 29.

FIG. 32 is a block diagram illustrating a configuration of the input vector scaler of FIG. 26 according to an embodiment.

FIG. 33 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 32.

FIG. 34 is a diagram illustrating in more detail an operation of a quantization scaling circuit of FIG. 32.

FIG. 35 is a block diagram illustrating a configuration of the first data type converter of FIG. 26.

FIG. 36 is a block diagram illustrating in more detail a configuration of a processing element array of FIG. 26.

FIG. 37 is a block diagram illustrating in more detail an operation of the processing elements of FIG. 36.

FIG. 38 is a diagram illustrating a configuration of one of the processing elements of FIG. 36 implemented according to the embodiment.

FIG. 39 is a diagram illustrating an operation of the second data type converter of FIG. 26.

FIG. 40 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1.

FIG. 41 is a flowchart illustrating in more detail step S350 of FIG. 40.

FIG. 42 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1.

FIGS. 43 to 45 are flowcharts illustrating in more detail step S440 of FIG. 42 implemented according to the embodiment.

FIG. 46 is a flowchart illustrating in more detail step S450 of FIG. 42.

FIG. 47 is a diagram illustrating the operation of the BCQ circuit of FIG. 1 according to the embodiment.

FIG. 48 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 implemented according to the embodiment.

FIG. 49 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to the embodiment.

FIG. 50 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 49.

FIG. 51 is a diagram illustrating in more detail an operation of the common scaling circuit of FIG. 49.

FIG. 52 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to the embodiment.

FIG. 53 is a diagram illustrating in more detail an operation of the common scaling circuit of FIG. 52.

FIG. 54 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 52 according to the embodiment.

FIG. 55 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to various example embodiments.

FIG. 56 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 55.

FIG. 57 is a diagram illustrating in more detail an operation of the quantization scaling circuit of FIG. 55.

FIG. 58 is a diagram illustrating in more detail an operation of the processing element array of FIG. 48.

FIG. 59 is a block diagram illustrating the processing element array of FIG. 26 implemented in a systolic array manner.

FIG. 60 is a diagram illustrating in more detail a configuration of the processing element of FIG. 59.

FIG. 61 is a diagram illustrating an operation of the matrix multiply device of FIG. 1 according to various example embodiments.

FIG. 62 is a diagram illustrating a full-input matrix of FIG. 61.

FIG. 63 is a diagram illustrating a full-weight matrix of FIG. 61.

FIG. 64 is a diagram illustrating a full-output matrix of FIG. 61.

FIG. 65 is a block diagram illustrating a neural processing system implemented according to an embodiment.

FIG. 66 is a block diagram illustrating an artificial intelligence model driven by the neural processing system of FIG. 65.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

Hereinafter, some example embodiments will be described clearly and in detail to the extent that those skilled in the art can easily practice the present disclosure. Details such as detailed configurations and structures are provided merely to facilitate a general understanding of various example embodiments. Therefore, modifications of various example embodiments described herein may be made by those of ordinary skill in the art without departing from the scope of inventive concepts. Moreover, descriptions of certain functions and structures, such as well-known functions and structures, may be omitted for clarity and simplicity.

Components in the following drawings or detailed description may be connected with other components other than those shown in the drawings or described in the detailed description. The terms used in the text are terms defined in consideration of the functions of the present disclosure, and are not limited to specific functions. Definitions of terms may be determined based on the details described in the detailed description. Components described with reference to terms such as a driver or a block used in the detailed description may be implemented in the form of software, hardware, or a combination thereof. Illustratively, the software may be machine code, firmware, embedded code, and application software. For example, the hardware may include an electrical circuit, an electronic circuit, a processor, a computer, IC cores, a pressure sensor, an inertial sensor, a micro electromechanical system (MEMS), passive devices, or combinations thereof.

Hereinafter, for a more concise explanation, a matrix is referred to with brackets “[”, “]”, a set is referred to through with curly braces “{”, “}”, and vectors may be referred to with an over-arrow. However, the scope of inventive concepts is not limited to this notation method.

FIG. 1 is a block diagram illustrating a matrix multiply device according to some example embodiments. Referring to FIG. 1, a matrix multiply device MMD (e.g. a matrix multiplier device or matrix multiplication device) may include a matrix multiplier 100 and a uniform BCQ circuit UBC. The matrix multiply device MMD may receive an input matrix XM.

The input matrix XM may include a plurality of input vectors. Each of the plurality of input vectors may include a plurality of input elements. For example, the input matrix XM may be expressed as Equation 1 below.

$\begin{matrix} XM = [\begin{matrix} \vec{X_{1}} \\ \vec{X_{2}} \\ ⋮ \\ \vec{X_{h}} \end{matrix}] = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 n} \\ x_{21} & x_{22} & \dots & x_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{h 1} & x_{h 2} & \dots & x_{hn} \end{matrix}] & (Equation 1) \end{matrix}$

Referring to Equation 1, XM may represent the input matrix XM, {right arrow over (X₁)} to {right arrow over (X_h)} may represent first to (h)-th input vectors, respectively, and x₁₁to x_hnmay represent different input elements. For example, x₁₁to x_1nmay represent input elements included in a first input vector (e.g., {right arrow over (X₁)}), and x_h1to x_hnmay represent input elements included in the (h)-th input vector (e.g., {right arrow over (X_h)}).

Hereinafter, for a more concise description, example embodiments in which dimensions of each of the input vectors included in the input matrix XM is ‘n’ will be representatively described. That is, hereinafter, example embodiments in which each of the input vectors includes ‘n’ input elements will be representatively described. For example, hereinafter, example embodiments in which the input matrix XM includes ‘n’ columns will be representatively described. Here, ‘h’ may be the same as, greater than, or less than ‘n’.

In some example embodiments, each of the input elements included in the input matrix XM may have a 16-bits floating point (FP16) and/or 32-bits floating point (FP32) data type. However, example embodiments are not limited thereto.

The matrix multiply device MMD may receive a weight matrix WM. The weight matrix WM may include a plurality of weights. For example, the weight matrix WM may be expressed as Equation 2 below.

$\begin{matrix} WM = [\begin{matrix} w_{11} & w_{12} & \dots & w_{1 m} \\ w_{21} & w_{22} & \dots & w_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ w_{n 1} & w_{n 2} & \dots & w_{nm} \end{matrix}] & (Equation 2) \end{matrix}$

Referring to Equation 2, WM may represent a weight matrix WM, and each of the w₁₁to w_nmmay represent different weights. For example, w_ijmay be a weight arranged in an (i)-th row and a (j)-th column of the weight matrix WM. The weight matrix WM may have ‘n’ rows and ‘m’ columns. Here, ‘m’ may be the same as, greater than, or less than ‘n’.

In some example embodiments, each of the weights included in the weight matrix WM may have a 16-bits floating point FP16 or 32-bits floating point FP32 data type. However, example embodiments are not limited thereto.

The uniform BCQ circuit UBC may perform uniform binary coding quantization (BCQ) for the weight matrix WM. For example, the uniform BCQ circuit UBC may determine a plurality of common scale coefficients CSC, a plurality of multiplication scale coefficients MSC, and a plurality of quantization sign bits QSB based on the weight matrix WM.

More specifically, the uniform BCQ circuit UBC may convert each of a plurality of weights to a plurality of combinations of a common scale coefficient CSC, multiplication scale coefficients MSC, and quantization sign bits QSB. For example, the uniform BCQ circuit UBC may approximate each of the weights of the weight matrix WM based on the plurality of combinations of a common scale coefficient CSC, multiplication scale coefficients MSC, and quantization sign bits QSB. That is, the uniform BCQ circuit may determine, for each weight, a common scale coefficient CSC, multiplication scale coefficients MSC and quantization sign bits QSB that represent a quantized value of the weight. The common scale coefficient CSC, multiplication scale coefficients MSC and quantization sign bits QSB can be used to determine the quantized value of the weight. This will be described in more detail below.

In some example embodiments, each of the plurality of quantization sign bits QSB may represent ‘0’ or ‘1’.

In various example embodiments, each of the plurality of common scale coefficients CSC may have the same data type as the weight of the weight matrix WM. The data type may define a set of possible values, a set of allowed operations on the values, and/or a representation of the values. For example, each of the plurality of common scale coefficients CSC may have an FP16 or FP32 data type. However, example embodiments are not limited thereto. The operation of the uniform BCQ circuit UBC will be described in more detail below with reference to FIG. 3.

The matrix multiplier 100 may receive the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB. The matrix multiplier 100 may perform matrix multiplication on the input matrix XM and the weight matrix WM based on the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB. For example, the matrix multiplier 100 may multiply the input matrix XM by the weight matrix WM approximated by the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB, to generate an output matrix YM.

The output matrix YM may include a plurality of output vectors. Each of the plurality of output vectors may include a plurality of output elements. For example, the output matrix YM may be expressed as Equation 3 below.

$\begin{matrix} YM = [\begin{matrix} \vec{Y_{1}} \\ \vec{Y_{2}} \\ ⋮ \\ \vec{Y_{h}} \end{matrix}] = [\begin{matrix} y_{11} & y_{12} & \dots & y_{1 m} \\ y_{21} & y_{22} & \dots & y_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ y_{h 1} & y_{h 2} & \dots & y_{hm} \end{matrix}] & (Equation 3) \end{matrix}$

In this case, YM may represent the output matrix YM, and y₁₁to y_hmeach may represent different output elements. For example, y₁₁to y_1mmay represent output elements included in a first output vector (e.g., {right arrow over (Y₁)}), and y_h1to y_hmmay represent output elements included in a (h)-th output vector (e.g., {right arrow over (Y_h)}).

In some example embodiments, ‘n’ and ‘m’ may be the same integer. For example, the weight matrix WM may be implemented as a square matrix. In this case, the dimension of the output vectors included in the output matrix YM may be the same as the dimension of the input vector. However, example embodiments are not limited thereto.

In some example embodiments, when the matrix multiply device MMD calculates the output matrix YM by directly multiplying the input matrix XM and the weight matrix WM, the matrix multiply device MMD may have to process using a very large amount of calculation resources and/or time. In this case, the operation speed of the matrix multiply device MMD may decrease. The operation of the matrix multiply device MMD, which calculates the output matrix YM by directly multiplying the input matrix XM and the weight matrix WM, is described in more detail below with reference to FIG. 2.

On the other hand, when the matrix multiply device MMD calculates the output matrix YM by multiplying the input matrix XM and weight matrix WM which is approximated by the plurality of common scale factors CSC, the plurality of multiplication scale factors MSC, and the plurality of quantization sign bits QSB, the calculation amount of the matrix multiply device MMD may be reduced, e.g., may be greatly reduced. The operation of the matrix multiply device MMD for calculating the output matrix YM based on the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB will be described in more detail below with reference to drawings.

FIG. 2 is a diagram illustrating an operation of the matrix multiply device implemented to directly multiply the input matrix and the weight matrix of FIG. 1. Referring to FIGS. 1 and 2, the matrix multiply device MMD may calculate the output matrix YM by directly multiplying the input matrix XM and the weight matrix WM.

To calculate one output element included in the output matrix YM, the matrix multiply device MMD may perform floating point multiplication ‘n’ times, and then perform floating point summation ‘n−1’ times. For example, the matrix multiply device MMD may calculate one (e.g., y₁₁) of the output elements included in the output matrix YM in a manner similar to Equation 4 below.

$\begin{matrix} y_{11} = x_{11} \times w_{11} + x_{12} \times w_{21} + \dots + x_{1 n} \times w_{n 1} & (Equation 4) \end{matrix}$

In this way, to calculate one output vector (e.g., {right arrow over (Y₁)}) corresponding to one input vector (e.g., {right arrow over (X₁)}), the matrix multiply device MMD may have to perform the floating point multiplication operation ‘m×n’ times, and perform the floating point summation operation ‘m×(n−1)’ times. In this case, the operation speed of the artificial intelligence model driven based on the matrix multiply device MMD may decrease due to the excessive calculation amount performed by the matrix multiply device MMD.

FIGS. 3 and 4 are diagrams illustrating an operation of the uniform BCQ circuit of FIG. 1. FIG. 3 is illustrated as a tree, such as a binary tree; example embodiments are not limited thereto. First, referring to FIGS. 1 and 3, a horizontal axis represents a size of a weight included in the weight matrix WM.

The uniform BCQ circuit UBC may approximate one or more of the weights included in the weight matrix WM to a plurality of quantum levels QL.

The uniform BCQ circuit UBC may determine the number of a plurality of quantum levels QL based on a predetermined BCQ resolution. For example, the uniform BCQ circuit UBC may approximate each of the plurality of weights to 2^R(where R is the BCQ resolution) quantum levels QL. Each of the 2^Rquantum levels QL may be determined based on a combination of a zero point value ZPV (null point value), R quantization scale coefficients QSC, and R quantization sign bits QSB.

Hereinafter, for a more concise explanation, example embodiments in which R is ‘3’ will be representatively described; however, example embodiments are not limited thereto. For example, the uniform BCQ circuit UBC may approximate one or more of the weights included in the weight matrix WM to first to eighth quantum levels QL1 to QL8. In this case, each of the first to eighth quantum levels QL1 to QL8 may be determined based on Equation 5 below.

$\begin{matrix} QL = ZPV + α_{1} \times {(- 1)}^{b_{1}} + α_{2} \times {(- 1)}^{b_{2}} + α_{3} \times {(- 1)}^{b_{3}} & (Equation 5) \end{matrix}$

Referring to Equation 5, QL may represent a quantum level, ZPV may represent the zero point value ZPV of the uniform BCQ, α₁to α₃may represent first to third quantization scale coefficients QSC1 to QSC3, respectively, and b₁to b₃may represent first to third quantization sign bits QSB1 to QSB3, respectively. In this case, the first to eighth quantum levels QL1 to QL8 may correspond to different combinations of the first to third quantization sign bits QSB1 to QSB3.

For example, further referring to FIG. 4, the first quantum level QL1 may correspond to the case where all of the first to third quantization sign bits QSB1 to QSB3 are ‘1’. In this case, the first quantum level QL1 may correspond to “ZPV−α₁−α₂−α₃”. Similarly, the eighth quantum level QL8 may correspond to the case where all of the first to third quantization sign bits QSB1 to QSB3 are ‘0’. In this case, the eighth quantum level QL8 may correspond to “ZPV+α₁+α₂+α₃”. In this way, the seventh quantum level QL7 may correspond to the case where the first to third quantization sign bits QSB1 to QSB3 are ‘1’, ‘0’, ‘0’ respectively. In this case, the seventh quantum level QL7 may correspond to “ZPV−α₁+α₂+α₃”. However, example embodiments are not limited thereto.

Continuing to refer to FIG. 3, the intervals between the plurality of quantum levels QL may be ‘uniform’ to each other. For example, the intervals between the plurality of quantum levels QL may be equal to each other. In this case, the plurality of quantization scale coefficients QSC may form a geometric sequence with a common ratio of ‘2’. For example, the second quantization scale coefficient QSC2 may be twice of the first quantization scale coefficient QSC1, and the third quantization scale coefficient QSC3 may be twice of the second quantization scale coefficient QSC2.

Each of the plurality of quantization scale coefficients QSC may be expressed as the product of the same common scale coefficient CSC but different multiplication scale coefficients MSC. For example, the first quantization scale coefficient QSC1 may be expressed as the product of the common scale coefficient CSC and a first multiplication scale coefficient MSC1, the second quantization scale coefficient QSC2 may be expressed as the product of the common scale coefficient CSC and the second multiplication scale coefficient MSC2, and the third quantization scale coefficient QSC3 may be expressed as the product of the common scale coefficient CSC and the third multiplication scale coefficient MSC3.

The multiplication scale coefficient MSC corresponding to each of the plurality of quantization scale coefficients QSC may be different consecutive powers of 2. For example, a k-th multiplication scale coefficient MSCk may be 2^k-1. For a more detailed example, the first multiplication scale coefficient may be 2⁰, the second multiplication scale coefficient MSC2 may be 2¹, and the third multiplication scale coefficient MSC3 may be 2².

Accordingly, each of the 2^Rquantum levels QL may be determined based on a combination of the zero point value ZPV, one common scale coefficient CSC, R multiplication scale coefficients MSC, and R quantization sign bits QSB. For example, Equation 5 described above may also or alternatively be expressed as Equation 6 below.

$\begin{matrix} QL = ZPV + 2^{0} \times s \times {(- 1)}^{b_{1}} + 2^{1} \times s \times {(- 1)}^{b_{2}} + 2^{2} \times s \times {(- 1)}^{b_{3}} & (Equation 6) \end{matrix}$

Referring to Equations 5 and 6, 2⁰to 2²may represent first to third multiplication scale coefficients MSC1 to MSC3 that are different powers of 2, and s may represent the common scale coefficient CSC.

The uniform BCQ circuit UBC may approximate each of the weights included in the weight matrix WM to the quantum level having the nearest value among the plurality of quantum levels QL. For example, when the size of the weight “w₁₁” is nearest to the seventh quantum level QL7 from among the first to eighth quantum levels QL1 to QL8, the uniform BCQ circuit UBC may approximate the weight “w₁₁” to the seventh quantum level QL7. In this case, the uniform BCQ circuit UBC may represent the weight “w₁₁” as the combination of a zero point value ZPV, a common scale coefficient CSC, a plurality of multiplication scale coefficients MSC, and a plurality of quantization sign bits QSB corresponding to the seventh quantum level QL7. Similarly, the uniform BCQ circuit UBC may approximate the weights included in the weight matrix WM based on the zero point value ZPV, a plurality of common scale coefficients CSC, a plurality of multiplication scale coefficients MSC, and a plurality of quantization sign bits QSB.

However, hereinafter, for a more concise description, example embodiments in which zero point value ZPV is ‘0’ will be representatively described. For example, the following will typically describe example embodiments in which the uniform BCQ circuit UBC performs the uniform binary coding quantization on each of the weights included in the weight matrix WM symmetrically. In particular, example embodiments in which the uniform BCQ circuit UBC approximates the weight matrix WM based on a plurality of common scale coefficients CSC, a plurality of multiplication scale coefficients MSC, and a plurality of quantization sign bits QSB will be described. The operation of the matrix multiply device MMD when the zero point value ZPV is not ‘0’ (e.g., the operation of the matrix multiply device MMD when the uniform BCQ circuit UBC performs uniform binary coding quantization asymmetrically) will be described in more detail with reference to FIGS. 47 to 58 below.

For a more concise description, in FIG. 3, example embodiments in which the BCQ resolution is ‘3’ is representatively described, but example embodiments are not limited thereto. For example, the BCQ resolution may be determined to be an integer greater than or equal to ‘2’ depending on the type of artificial intelligence model driven based on the matrix multiply device MMD.

In example embodiments, when the artificial intelligence model driven based on the matrix multiply device MMD is or includes (or is included in) a large language model (LLM), the BCQ resolution may be ‘3’. However, example embodiments are not limited thereto.

In example embodiments, when the artificial intelligence model driven based on the matrix multiply device MMD is or includes (or is included in) an image object identification model, the BCQ resolution may be ‘2’. However, example embodiments are not limited thereto.

In example embodiments, the uniform BCQ circuit UBC may perform the uniform binary coding quantization operation for each column of the weight matrix WM. For example, the uniform BCQ circuit UBC may determine different common scale coefficients CSC for each column of the weight matrix WM. In this case, the common scale coefficient CSC for the weights included in the first column of the weight matrix WM may be different from the common scale coefficient CSC for the weights included in the second column of the weight matrix WM. Example embodiments in which the uniform BCQ circuit UBC performs the binary coding quantization operation for each column of the weight matrix WM will be described in more detail below with reference to FIGS. 5 to 22. However, example embodiments are not limited thereto.

In various example embodiments, the uniform BCQ circuit UBC may perform the uniform binary coding quantization operation for each row of the weight matrix WM. For example, the uniform BCQ circuit UBC may determine different common scale coefficients CSC for each row of the weight matrix WM. In this case, the common scale coefficient CSC for the weights included in a first row of the weight matrix WM may be different from the common scale coefficient CSC for the weights included in a second row of the weight matrix WM. Example embodiments in which the uniform BCQ circuit UBC performs the binary coding quantization operation for each row of the weight matrix WM will be described in more detail below with reference to FIGS. 23 to 58. However, example embodiments are not limited thereto.

FIG. 5 is a diagram illustrating the operation of the uniform BCQ circuit of FIG. 1, which performs the binary coding quantization operation for each column of the weight matrix. Referring to FIGS. 1 to 5, the uniform BCQ circuit UBC may perform the uniform binary coding quantization operation for each column of the weight matrix WM.

First, the weight matrix WM may be expressed as Equation 7 below.

$\begin{matrix} WM = [\begin{matrix} w_{11} & w_{12} & \dots & w_{1 m} \\ w_{21} & w_{22} & \dots & w_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ w_{n 1} & w_{n 2} & \dots & w_{nm} \end{matrix}] = [\begin{matrix} \vec{w_{c 1}} & \vec{w_{c 2}} & \dots & \vec{w_{cm}} \end{matrix}] & (Equation 7) \end{matrix}$

In this case, {right arrow over (w_cj)} may represent a j-th column vector of the weight matrix WM. For example, {right arrow over (w_cj)} may include w_1jto w_nj. ‘j’ may be an integer greater than or equal to 1 and less than or equal to m.

The BCQ circuit 200 may perform a binary coding quantization operation for each column of the weight matrix WM based on Equation 8 below.

$\begin{matrix} \vec{w_{cj}} \approx \sum_{k = 1}^{R} (α_{k_cj} \times \vec{B_{k_cj}}) & (Equation 8) \end{matrix}$

Referring to Equation 8, R may represent the BCQ resolution. The α_{k_cj}may represent a k-th quantization scale coefficient corresponding to the j-th column vector of the weight matrix WM.

The {right arrow over (B_{k_cj})} may represent a quantization sign vector corresponding to α_{k_cj}. For example, {right arrow over (B_{k_cj})} may represent a quantization sign vector corresponding to the k-th quantization scale coefficient of the j-th column vector of the weight matrix WM. The {right arrow over (B_{k_cj})} may include a plurality of values equal to −1 raised to the power of a plurality of corresponding quantization sign bits corresponding to the k-th quantization scale coefficient of the j-th column vector. These values may be either −1 or 1. More specifically, {right arrow over (B_{k_cj})} may be expressed as Equation 9 below.

$\begin{matrix} \vec{B_{k_cj}} = [\begin{matrix} {(- 1)}^{b_{1_k_cj}} \\ {(- 1)}^{b_{2_k_cj}} \\ ⋮ \\ {(- 1)}^{b_{n_k_cj}} \end{matrix}] & (Equation 9) \end{matrix}$

Each of the b_{1_k_cj}to b_{n_k_cj}may represent different quantization sign bits QSB. More specifically, each of the b_{1_k_cj}to b_{n_k_cj}may be a k^thquantization sign bit for weights arranged in different rows of the weight matrix WM. b_{i_k_cj}may represent the k^thquantization sign bit for the weight at the i^thcolumn and j^throw of the weight matrix WM. Each of the b_{1_k_cj}to b_{n_k_cj}may be ‘0’ or ‘1’. For each weight in the weight matrix WM, there may be R quantization scale coefficients and R quantization sign bits QSB.

The calculation of Equation 8 may be implemented through the use of circuitry (e.g. an arithmetic logic unit) that adds or subtracts values based on the values of the quantization sign bits. Accordingly, to implement this, the circuitry may receive a quantization sign bit vector {right arrow over (QSB_{k_cj})}. The {right arrow over (QSB_{k_cj})} may include a plurality of quantization sign bits corresponding to the k-th quantization scale coefficient of the j-th column vector. More specifically, {right arrow over (QSB_{k_cj})} may be expressed as Equation 9.1 below.

$\begin{matrix} \vec{{QSB}_{k_cj}} = [\begin{matrix} b_{1_k_cj} \\ b_{2_k_cj} \\ ⋮ \\ b_{n_k_cj} \end{matrix}] & (Equation 9.1) \end{matrix}$

Meanwhile, α_{k_cj}may be expressed as the product of the common scale coefficient and the multiplication scale coefficient. For example, α_{k_cj}may be expressed as 2^k-1×s_cj. Accordingly, the above-described Equation 8 may alternatively or additionally be expressed as the following Equation 10.

$\begin{matrix} \vec{w_{cj}} \approx s_{cj} \times \sum_{k = 1}^{R} (2^{k - 1} \times \vec{B_{k_cj}}) & (Equation 10) \end{matrix}$

Referring to Equation 10, 2^k-1may represent the k-th multiplication scale coefficient, and s_cjmay represent the common scale coefficient for the (j)-th column vector of the weight matrix WM.

For example, the uniform BCQ circuit UBC may approximate each weight of the weight matrix WM based on one or more common scale coefficients CSC (e.g., “s” values), the plurality of multiplication scale coefficients MSC (e.g., power of 2), and the plurality of quantization sign bits QSB (e.g., “b” values). The same common scale coefficient CSC may be used for each column, but different common scale coefficients CSC may be used for different columns.

In this way, for various example embodiments the uniform BCQ circuit UBC may generate the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB based on the weight matrix WM. The uniform BCQ circuit UBC may provide the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB to the matrix multiplier 100. Hereinafter, the operation of the matrix multiplier 100 performing the matrix multiplication operation will be described based on the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB.

FIG. 6 is a diagram illustrating an operation of a matrix multiplier according to various example embodiments of FIG. 5. Referring to FIGS. 1 to 6, the matrix multiplier 100 may calculate the output matrix YM based on the plurality of common scale coefficients CSC, the plurality of multiplication scale coefficients MSC, and the plurality of quantization sign bits QSB.

Hereinafter, for a more concise explanation, the operation of the matrix multiplier 100 calculating the output element (e.g., y₁₁) of the first row and first column of the output matrix YM by multiplying the first input vector (e.g., {right arrow over (X₁)}) and the first column vector (e.g., {right arrow over (w_c1)}) of the approximated weight matrix WM based on the plurality of common scale coefficients CSC, the plurality of multiplication scale factors MSC, and the plurality of quantization sign bits QSB will be representatively described. However, example embodiments are not limited thereto.

Referring to Equation 10, the matrix multiplier 100 may calculate y₁₁according to Equation 11 below.

$\begin{matrix} \begin{matrix} y_{11} \approx \vec{X_{1}} \times s_{c 1} \times (2^{0} \times \vec{B_{1_c 1}} + 2^{1} \times \vec{B_{2_c 1}} + \dots + 2^{R - 1} \times \vec{B_{R_c 1}}) \\ \approx s_{c 1} \times {(2^{0} \times \vec{X_{1}}) \times \vec{B_{1_c 1}} + (2^{1} \times \vec{X_{1}}) \times \vec{B_{2_c 1}} + \dots + \\ (2^{R - 1} \times \vec{X_{1}}) \times \vec{B_{R_c 1}}} \end{matrix} & (Equation 11) \end{matrix}$

Referring to Equation 11, first, the matrix multiplier 100 may calculate the product of the plurality of multiplication scale coefficients (e.g., power of 2) for each input vector (e.g., {right arrow over (X₁)}). For example, the matrix multiplier 100 may multiply each multiplication scale coefficient by the plurality of input elements (e.g., x₁₁to x_1n) included in the input vector to generate a multiplication scaled input vector (hereinafter referred to as “multiplication scaled input vector MSX”). In this case, the multiplication scaled input vector MSX may include a plurality of multiplication scaled input elements (hereinafter, referred to as “multiplication scaled input elements MSIE”).

The data type of the plurality of input elements included in the input vector may be floating point. In this case, the matrix multiplier 100 may have to perform a total of (R×n) times of floating point multiplication operations for a power of 2, in order to calculate one output element (for example, y₁₁).

In various example embodiments, the floating point multiplication operation for the power of 2 may be performed with less calculation amount (e.g. reduced computation). For example, the matrix multiplier 100 may perform the floating point multiplication operation for the power of 2 by changing an exponent part of the floating point data type. Where the input elements are represented in floating point using base 2, multiplication with a multiplication scale coefficient that is a power of 2 can be achieved efficiently through simple addition of the respective exponents. This therefore allows efficient multiplication. The operation of the matrix multiplier 100, which performs the floating point multiplication operation for the power of 2, is described in more detail below with reference to FIG. 10.

Thereafter, the matrix multiplier 100 may calculate a partial product PSP by adding the results of multiplying each of the plurality of scaled input vectors MSX by different quantization sign vectors (e.g., one of {right arrow over (B)}). Hereinafter, for a more concise description, the partial product for the j-th column of the weight matrix WM and the i-th input vector (e.g., {right arrow over (X₁)}) will be referred to as “PSPij”.

In some example embodiments, the product of one multiplication scaled input vector MSX and one quantization sign vector (e.g., one {right arrow over (B)}) may be calculated by sequentially adding or subtracting the values of the plurality of multiplication scaled elements MSIE based on the plurality of quantization sign bits QSB. This may be based on a received plurality of quantization sign bits (e.g. in the form of a quantization sign bit vector {right arrow over (QSB)}). Thereafter, the matrix multiplier 100 may calculate a partial product PSP by adding the results of multiplying the multiplication scaled input vector MSX and the quantization sign vector. Therefore, in order to calculate one partial product PSP, the matrix multiplier 100 may have to perform the floating point summation operation ‘R×n’ times and the plurality of floating point summation operation ‘R−1’ times.

In various example embodiments, when the data type of each of the plurality of input elements is converted to fixed point, the matrix multiplier 100 may calculate the partial product PSP with fewer calculations. For example, when the data type of each of the plurality of input elements is the fixed point data type corresponding to the same exponent value, the matrix multiplier 100 may have to perform the fixed point summation operation ‘R×n’ times and the fixed point summation operation ‘R−1’ times to calculate the partial product PSP. The detailed operation of the matrix multiplier 100, which converts the data type of each of the plurality of input elements to fixed point, will be described in more detail with reference to FIGS. 7 to 22 below.

The matrix multiplier 100 may calculate one output element (e.g., y₁₁) by multiplying a partial product PSP11 and a common scale coefficient (e.g., s_c1) for the first column vector of the weight matrix WM. For example, the matrix multiplier 100 may calculate the output element (e.g., y₁₁) by performing the floating point multiplication of the partial product PSP11 and the common scale coefficient (e.g., s_c1) once.

As noted above, the R×N floating point multiplication for the power of 2 can be performed efficiently through simple addition of exponents. As a result, the matrix multiplier 100 may calculate one output element (e.g., y₁₁) by performing full floating point multiplication only once. In this case, unlike previously described with reference to FIG. 2, the number of times of floating point multiplication operation performed by the matrix multiplier 100 may be reduced or minimized. Therefore, according to various example embodiments illustrated in FIGS. 5 and 6, the operation speed of the matrix multiply device MMD may be improved.

FIG. 7 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 according to various example embodiments relative to FIG. 4. Referring to FIGS. 1 to 7, the matrix multiplier 100 may include a multiplication scale coefficient buffer 110, an input vector scaler 120, a first data type converter 130, a quantization sign bit buffer 140, a processing element array 150, a second data type converter 160, a common scale coefficient buffer 170, and a common scaler 180.

The multiplication scale coefficient buffer 110 may store the plurality of multiplication scale coefficients MSC provided from the uniform BCQ circuit UBC. The multiplication scale coefficient buffer 110 may provide the plurality of multiplication scale coefficients MSC to the input vector scaler 120.

The input vector scaler 120 may receive the input matrix XM. For example, the input vector scaler 120 may receive the plurality of input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}) each including a plurality of input elements.

The input vector scaler 120 may perform multiplication scaling on the input matrix XM based on the plurality of multiplication scale coefficients MSC. For example, the input vector scaler 120 may generate the plurality of scaled input vectors MSX based on the plurality of input vectors. In this case, the plurality of scaled input vectors MSX may correspond to the plurality of input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively. Hereinafter, the scaled input vectors corresponding to {right arrow over (X₁)} to {right arrow over (X_h)} will be referred to as to {right arrow over (MSX₁)} to {right arrow over (MSX_h)}, respectively.

In example embodiments, the plurality of multiplication scaled input vectors MSX may be included in a scaled input matrix. In this case, a row size of the multiplication scaled input matrix may be an integer multiple of a row size of the input matrix XM, and a column size of the multiplication scaled input matrix may be the same as a column size of the input matrix XM. However, example embodiments are not limited thereto.

Each of the plurality of multiplication scaled input vectors MSX may be implemented as a row vector having an ‘R’-times dimension of that of the corresponding input vector. For example, the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}) corresponding to the first input vector (e.g., {right arrow over (X₁)}) may include ‘R×n’ scaled input elements MSIE as illustrated in Equation 12 below.

$\begin{matrix} \vec{{MSX}_{1}} ∋ (2^{0} \times x_{11}), (2^{1} \times x_{11}), \dots, (2^{R - 1} \times x_{11}), (2^{0} \times x_{12}), (2^{1} \times x_{12}), \dots, (2^{R - 1} \times x_{12}), \dots, (2^{0} \times x_{1 n}), (2^{1} \times x_{1 n}), \dots, (2^{R - 1} \times x_{1 n}) & (Equation 12) \end{matrix}$

Referring to Equation 12, the plurality of multiplication scaled input elements MSIE included in the first multiplication scaled input vector {right arrow over (MSX₁)} may be generated by multiplying each of the plurality of input elements included in the first input vector with BCQ resolution corresponding number of multiplication scale coefficients (e.g., R multiplication scale coefficients).

In this way, each of the second to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₂)} to {right arrow over (MSX_h)}) may include ‘R×n’ multiplication scaled input elements. For concise description, detailed description of the multiplication scaled input elements included in each of the second to (h)-th scaled input vectors (e.g., {right arrow over (MSX₂)} to {right arrow over (MSX_h)}) will be omitted. The configuration and operation of the input vector scaler 120 will be described in more detail with reference to FIGS. 8 to 10 below.

In example embodiments, the data type of each multiplication scaled input element MSIE may be floating point. However, example embodiments are not limited thereto.

The first data type converter 130 may receive the plurality of multiplication scaled input vectors MSX. For example, the first data type converter 130 may receive first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}). Each of the first to (h)-th multiplication scaled input vectors may include the plurality of multiplication scaled input elements MSIE.

The first data type converter 130 may extract an exponent EXP from each of the plurality of multiplication scaled input vectors MSX. For example, the first data type converter 130 may extract a first exponent from the first multiplication scaled input vector {right arrow over (MSX₁)} and extract a second exponent from a second multiplication scaled input vector {right arrow over (MSX₂)}.

The first data type converter 130 may provide the extracted exponents EXP to the second data type converter 160. The first data type converter 130 may convert the data type of the plurality of multiplication scaled input vectors MSX to fixed point. For example, the first data type converter 130 may receive the plurality of multiplication scaled input vectors MSX and output a plurality of fixed point multiplication scaled input vectors MSXfxp. In some examples, the first data type converter 130 may generate first to (h)-th fixed point multiplication scaled input vectors (e.g., {right arrow over (MSX′₁)} to {right arrow over (MSX′_h)}), respectively, based on the first to (h)-th multiplication scaled input vectors.

More specifically, the first data type converter 130 converts the data type of each of the plurality of multiplication scaled input elements MSIE included in the multiplication scaled input vector MSX to fixed point based on the extracted exponent. In this case, the first fixed point multiplication scaled input vector (e.g., {right arrow over (MSX′₁)}) may include elements of Equation 12 described above that are converted into fixed point form (hereinafter, the elements may be referred to as the fixed point multiplication scaled input elements MSIEfxp). The configuration and operation of the first data type converter 130 will be described in more detail below with reference to FIGS. 11 to 13.

The quantization sign bit buffer 140 may store the plurality of quantization sign bits QSB provided from the uniform BCQ circuit UBC. The quantization sign bit buffer 140 may provide the plurality of quantization sign bits QSB to the processing element array 150.

The processing element array 150 may receive the plurality of quantization sign bits QSB and the plurality of fixed point multiplication scaled input vectors MSXfxp. The processing element array 150 may calculate a plurality of fixed point partial products PSPfxp based on the plurality of fixed point scaled input elements MSIEfxp and the plurality of quantization sign bits QSB included in the plurality of fixed point scaled input vectors MSXfxp.

The processing element array 150 may include the plurality of processing elements arranged in a row direction and a column direction. Each of the plurality of processing elements may calculate different fixed point partial products PSPfxp. More detailed configuration and operation of each of the processing element arrays 150 is described in more detail below with reference to FIGS. 14 to 17.

The second data type converter 160 may receive a plurality of exponents EXP from the first data type converter 130. The second data type converter 160 may receive the plurality of fixed point partial products PSPfxp from the processing element array 150. The second data type converter 160 may convert the data type of the plurality of fixed point partial products PSPfxp to floating point based on the plurality of exponents EXP. In some examples, the second data type converter 160 may output the plurality of partial products PSPs having the floating point data type. The detailed configuration and operation of the second data type converter 160 will be described in more detail below with reference to FIG. 18.

The common scale coefficient buffer 170 may store the plurality of common scale coefficients CSC provided from the uniform BCQ circuit UBC. The common scale coefficient buffer 170 may provide the plurality of common scale coefficients CSC to the common scaler 180.

The common scaler 180 may receive the plurality of common scale coefficients CSC and the plurality of partial products PSP. The common scaler 180 may scale the plurality of partial products PSP based on the plurality of common scale coefficients CSC. For example, the common scaler 180 may generate the plurality of output vectors included in the output matrix YM by multiplying each of the plurality of partial products PSP with corresponding common scale coefficient CSC. The detailed configuration and operation of the common scaler 180 will be described in more detail below with reference to FIG. 15.

FIG. 8 is a block diagram illustrating a configuration of an input vector scaler of FIG. 7. Referring to FIGS. 1 to 8, the input vector scaler 120 may include first to (h)-th multiplication scaling circuits 121 to 12h.

Each of the first to (h)-th multiplication scaling circuits 121 to 12h may receive different input vectors. For example, each of the first to (h)-th multiplication scaling circuits 121 to 12h may receive the first to (h)-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively.

Each of the first to (h)-th multiplication scaling circuits 121 to 12h may sequentially receive the plurality of input elements. For example, the first multiplication scaling circuit 121 may sequentially receive x₁₁to x_1n, and the (h)-th multiplication scaling circuit 12h may sequentially receive x_h1to x_hn. The multiplication scaling circuits 121 to 12h may operate in parallel to each other.

Each of the first to (h)-th multiplication scaling circuits 121 to 12h may sequentially receive the plurality of multiplication scale coefficients MSC from the multiplication scale coefficient buffer 110. For example, each of the first to (h)-th multiplication scaling circuits 121 to 12h may receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR (e.g., 2⁰to 2^R-1).

The first to (h)-th multiplication scaling circuits 121 to 12h may generate first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}) respectively. For example, the first multiplication scaling circuit 121 may generate the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}) based on the x₁₁to x_1n, and the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

FIG. 9 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 8. Hereinafter, for a more concise description, the operation of the first multiplication scaling circuit 121 will be representatively described with reference to FIGS. 1 to 9. However, example embodiments are not limited thereto, and the second to (h)-th multiplication scaling circuits 122 to 12h may operate in a similar manner.

The first multiplication scaling circuit 121 may receive the first input vector (e.g., {right arrow over (X₁)}). In particular, the first multiplication scaling circuit 121 may sequentially receive x₁₁to x_1n. The first multiplication scaling circuit 121 may sequentially receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR. The first multiplication scaling circuit 121 may generate the plurality of multiplication scaled input elements MSIE by multiplying each of the received input elements with the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

For example, the first multiplication scaling circuit 121 may generate the plurality of multiplication scaled input elements MSIE11_1 to MSIE11_R by multiplying the input element “x₁₁” with the first-to-Rth multiplication scale coefficients MSC1 to MSCR, respectively. This is illustrated in FIG. 9 as diagonal stripe.

Similarly, the first multiplication scaling circuit 121 may generate the multiplication scaled input elements MSIE12_1-MSIE12_R by multiplying the input element “x₁₂” by the first-to-Rth multiplication scale coefficients MSC1 to MSCR, respectively. This is illustrated in FIG. 9 as a dot pattern.

In this way, the first multiplication scaling circuit 121 will be able to sequentially calculate the plurality of multiplication scaled input elements MSIE corresponding to x₁₃to x_1n.

The first multiplication scaling circuit 121 may sequentially output the plurality of multiplication scaled input elements MSIE. For example, the first multiplication scaling circuit 121 may provide the plurality of multiplication scaled input elements MSIE to the first data type converter 130.

In particular, according to various example embodiments of the present disclosure, the first multiplication scaling circuit 121 may generate the plurality of multiplication scaled input elements (e.g., multiplication scaled input elements (MSIEs for x₁₁) illustrated as diagonal stripes) based on one input element (e.g., x₁₁). In some cases, the first multiplication scaling circuit 121 may generate the plurality of multiplication scaled input elements MSIE by repeatedly using one input element. This may be achieved through the use of a buffer or memory. Therefore, according to various example embodiments, the input reuse of the matrix multiplier 100 may be increased or maximized, and thus, the number of times the matrix multiplier 100 receives input elements from the outside may be reduced or minimized. In this case, the number of times the matrix multiplier 100 accesses an external memory device that stores the input elements may be reduced or minimized, so the operation efficiency and operating speed of the matrix multiply device MMD may be improved.

FIG. 10 is a diagram illustrating in more detail a method of performing, by the multiplication scaling circuit of FIG. 8, a multiplication scaling operation. Hereinafter, for a more concise description, a method in which the first multiplication scaling circuit 121 generates one multiplication scaled input element MSIE will be representatively described with reference to FIGS. 1 to 10. However, example embodiments are not limited thereto.

The first multiplication scaling circuit 121 may receive an input element IE and the multiplication scale coefficient MSC.

The input element IE may have the floating point data type. Floating point arithmetic may represent a number using an integer of fixed precision (a significand or mantissa) and an integer exponent of a fixed base (e.g. base 2). For example, the input element IE may include a sign part SP (e.g. representing whether the input element IE is positive or negative), an exponent part EXPP (e.g. representing an exponent of the input element IE), and a mantissa part MTSP (e.g. representing a mantissa (otherwise known as a significand) of the input element IE). The input element IE may be in a floating point data type of base 2.

The multiplication scale coefficient MSC may be a power of 2. For example, the multiplication scale coefficient MSC may be one of the 2⁰to 2^R-1.

The first multiplication scaling circuit 121 may generate the multiplication scaled input element MSIE by multiplying the input element IE and the multiplication scale coefficient MSC. In this case, the first multiplication scaling circuit 121 may change the value of the exponent part EXPP of the input element IE to generate the multiplication scaled input element MSIE. For example, the first multiplication scaling circuit 121 may perform the floating point multiplication operation on the power of 2 by increasing the value of the exponent part EXPP of the input element IE as much as the size corresponding to the multiplication scale coefficient MSC.

For example, when the first multiplication scaling circuit 121 calculates the product of the input element IE and the k-th multiplication scale coefficient MSCk (e.g., 2^k-1), the first multiplication scaling circuit 121 may generate the multiplication scaled input element MSIE by increasing the value of the exponent EXPP of the input element IE by ‘k−1’.

For a more detailed example, when the lowest three bits of the exponent EXPP of the input element IE are ‘100’ (e.g. representing 2⁴) and the first multiplication scaling circuit 121 calculates the product of the input element IE and the second multiplication scale coefficient MSC2 (e.g., 2¹), the multiplication scaled input element MSIE may be generated by changing the lowest three bits of the exponent EXPP of the input element IE of the first multiplication scaling circuit 121 to ‘101’ (e.g., by increasing the value of the exponent part EXPP by ‘1’), representing 2⁵.

Therefore, according to various example embodiments of the present disclosure, the input vector scaler 120 will be able to generate the first to (h)-th multiplication scaled input vectors (that is, {right arrow over (MSX₁)} to {right arrow over (MSX_h)}) with a very small computational amount. However, example embodiments are not limited thereto.

FIG. 11 is a diagram illustrating a configuration of a first data type converter of FIG. 7. Referring to FIGS. 1 to 11, the first data type converter 130 may include first to (h)-th exponent extraction circuits 131_1 to 131_h and first to (h)-th data type conversion circuits 132_1 to 132_h.

Each of the first to (h)-th exponent extraction circuits 131_1 to 131_h may receive different multiplication scaled input vectors MSX. For example, the first to (h)-th exponent extraction circuits 131_1 to 131_h may receive first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}) respectively.

Each of the first to (h)-th exponent extraction circuits 131_1 to 131_h may extract an exponent from a plurality of received multiplication scaled input elements MSIE. For example, the first exponent extraction circuit 131_1 may extract a first exponent EXP1 from the multiplication scaled input elements MSIE included in the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}), and the second exponent extraction circuit 131_2 may extract a second exponent EXP2 from the multiplication scaled input elements MSIE included in the second multiplication scaled input vector (e.g., {right arrow over (MSX₂)}). In this way, the first to (h)-th exponent extraction circuits 131_1 to 131_h may extract first to (h)-th exponents EXP1 to EXPh, respectively.

The first to (h)-th exponent extraction circuits 131_1 to 131_h may provide the extracted exponents to the first to (h)-th data type conversion circuits 132_1 to 132_h, respectively. Alternatively or additionally, each of the first to (h)-th exponent extraction circuits 131_1 to 131_h may provide the extracted exponent to the second data type converter 160. The detailed operation of the first to (h)-th exponent extraction circuits 131_1 to 131_h will be described in more detail below with reference to FIG. 12.

The first to (h)-th data type conversion circuits 132_1 to 132_h may receive the first to (h)-th exponents EXP1 to EXPh, respectively. The first to (h)-th data type conversion circuits 132_1 to 132_h may receive the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}), respectively.

Each of the first to (h)-th data type conversion circuits 132_1 to 132_h may convert the data type of the received multiplication scaled input vector into fixed point based on the received exponent. The first to (h)-th data type conversion circuits 132_1 to 132_h may output the first to (h)-th fixed point multiplication scaled input vectors (e.g., {right arrow over (MSX′₁)} to {right arrow over (MSX′_h)}), respectively. For example, the first data type conversion circuit 132_1 may convert each of the multiplication scaled input elements MSIE included in the first multiplication scaled input vector MSX into the fixed point data type based on the first exponent EXP1. The detailed operation of the first to (h)-th data type conversion circuits 132_1 to 132_h will be described in more detail below with reference to FIG. 13.

FIG. 12 is a diagram illustrating an operation of an exponent extraction circuit of FIG. 11. Hereinafter, for a more concise description, the operation of the first exponent extraction circuit 131_1 will be representatively described. However, example embodiments are not limited thereto.

Referring to FIGS. 1 to 12, the first exponent extraction circuit 131_1 may receive the plurality of multiplication scaled input elements MSIE. For example, the first exponent extraction circuit 131_1 may receive multiplication scaled input elements MSIE11_1 to MSIE1n_R described above with reference to FIG. 9.

The data type of each of the plurality of multiplication scaled input elements MSIE may be floating point. For example, each of the multiplication scaled input elements MSIE11_1 to MSIE1n_R may include the sign part SP, the exponent part EXPP, and the mantissa part MTSP. The floating point format may use base 2.

The first exponent extraction circuit 131_1 may identify the largest value among the values of the exponent part EXPP of each of the received plurality of multiplication scaled input elements MSIE. In this case, the first exponent extraction circuit 131_1 may determine the identified value as the first exponent EXP1. That is, the first exponent extraction circuit 131_1 may extract the largest exponent among the exponents of the multiplication scaled input elements MSIE11_1 to MSIE1n_R as the first exponent EXP1.

For a more concise description, various example embodiments in which the first exponent extraction circuit 131_1 extracts the largest value among the values of the exponent part EXPP of the plurality of multiplication scaled input elements MSIE is representatively described in FIG. 12, but the scope of the disclosure is not limited thereto.

For example, the first exponent extraction circuit 131_1 may be implemented to extract the smallest value among the values of the exponent part EXPP of the plurality of multiplication scaled input elements MSIE.

FIG. 13 is a diagram illustrating an operation of a data type conversion circuit of FIG. 11. Hereinafter, for a more concise description, the operation of the first data type conversion circuit 132_1 will be representatively described. However, example embodiments are not limited thereto.

Referring to FIGS. 1 to 13, the first data type conversion circuit 132_1 may receive the plurality of multiplication scaled input elements MSIE. For example, the first data type conversion circuit 132_1 may receive the multiplication scaled input elements MSIE11_1 to MSIE1n_R.

The first data type conversion circuit 132_1 may receive the first exponent EXP1. The first data type conversion circuit 132_1 may convert the data type of the plurality of received multiplication scaled input elements MSIE to fixed point based on the first exponent EXP1. That is, the first data type conversion circuit 132_1 may convert each of the multiplication scaled input elements MSIE11_1 to MSIE1n_R to fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp1n_R. However, hereinafter, for a more concise description, the operation of the first data type conversion circuit 132_1 that converts the multiplication scaled input element MSIE11_1 to a fixed point multiplication scaled input element MSIEfxp11_1 will be representatively described.

The first data type conversion circuit 132_1 may shift the mantissa part of the multiplication scaled input element MSIE11_1 (hereinafter, referring to a first mantissa part MTSPa) towards a least significant bit (LSB) direction, by as much as the difference between the exponent of the multiplication scaled input element MSIE11_1 and the first exponent EXP1. For example, when the difference between the exponent of the multiplication scaled input element MSIE11_1 and the first exponent EXP1 is ‘4’, the first data type conversion circuit 132_1 may insert ‘four’ ‘0’ bits into a most significant bit (MSB) place of the first mantissa part MTSPa.

The first data type conversion circuit 132_1 may determine a mantissa part of the fixed point multiplication scaled input element MSIEfxp11_1 (hereinafter, referred to as a second mantissa part MTSPb) based on the shifted first mantissa part MTSPa. For example, the first data type conversion circuit 132_1 may cut-off lower bits of the shifted first mantissa part MTSPa according to a code length of the second mantissa part MTSPb. Alternatively, the first data type conversion circuit 132_1 may round based on various types of rounding algorithms such as “nearest even rounding” on the lower bits of the shifted first mantissa part MTSPa according to the code length of the second mantissa part MTSPb. However, example embodiments are not limited thereto. That is, the mantissa part of the fixed point multiplication scaled input element MSIEfxp11_1 may be set to a predefined maximum length. This may require shifted first mantissa part MTSPa to be shortened and/or rounded.

In example embodiments, the code length of the first mantissa part MTSPa may be ‘10-bit’ or ‘23-bit’. However, example embodiments are not limited thereto.

In example embodiments, the code length of the second mantissa part MTSPb may be ‘7-bit’. However, example embodiments are not limited thereto.

In example embodiments, the data type of each of the plurality of fixed point multiplication scaled input elements MSIEfxp may be INT8 (8-bit integer). However, example embodiments are not limited thereto. The fixed point multiplication scaled input elements MSIEfxp may each include the sign part of the corresponding multiplication scaled input elements MSIE11_1 to MSIE1n_R and the second mantissa part.

FIG. 14 is a block diagram illustrating in more detail a configuration of a processing element array of FIG. 7. Referring to FIGS. 1 to 14, the processing element array 150 may include a plurality of processing elements PE arranged in a row direction and a column direction. Hereinafter, for a more concise description, it is assumed that the plurality of processing elements PE are arranged along (h) rows and (m) columns. In addition, the processing element disposed in the (i)-th row and (j)-th column of the processing element array 150 will be referred to as “PEij”. For example, the processing element arranged in the first row and second column of processing element array 150 will be referred to as “PE12.”

The processing element array 150 may include first to (h)-th processing element rows PER1 to PERh. Each of the first to (h)-th processing element rows PER1 to PERh may include the plurality of processing elements PE. For example, the first processing element row PER1 may include processing elements PE11 to PE1m.

The processing element array 150 may include first to (m)-th processing element columns PEC1 to PECm. Each of the first to (m)-th processing element columns PEC1 to PECm may include the plurality of processing elements PE. For example, the first processing element column PEC1 may include processing elements PE11 to PEh1.

Each of the plurality of processing element rows PER may receive a different fixed point multiplication scaled input vector MSXfxp. For example, the first to (h)-th processing element rows PER1 to PERh may receive the first to (h)-th fixed point multiplication scaled input vectors (that is, MSX′1 to MSX′n) respectively.

The processing elements included in the same processing element row may receive the same fixed point multiplication scaled input vector. For example, each of the processing elements PE11 to PE1m may receive the first fixed point multiplication scaled input vector (e.g., MSX′1).

Each of the plurality of processing element columns PEC may receive a different quantization sign bit vector {right arrow over (QSB)} from the quantization sign bit buffer 140. For example, the first to (m)-th processing element columns PEC1 to PECm may receive the first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}) respectively.

The first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}) may include a plurality of quantization sign bits QSB for different columns of the weight matrix WM. For example, the j-th quantization sign bit vector (e.g., {right arrow over (QSB_j)}) may include a plurality of quantization sign bits QSB for the (j)-th column vector of the weight matrix WM. More specifically, the j-th quantization sign bit vector (e.g., {right arrow over (QSB_j)}) may include the quantization sign bits QSB included in {right arrow over (QSB_{1_cj})} to {right arrow over (QSB_{R_cj})} described with reference to Equation 9.1 above. Each of {right arrow over (QSB_{1_cj})} to {right arrow over (QSB_{R_cj})} may include n quantization sign bits QSB. Each quantization sign bit vector (e.g., {right arrow over (QSB_j)}) may therefore include n×R quantization sign bits QSB. The first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}) are described in more detail below with reference to FIG. 16.

The processing elements included in the same processing element column may receive the same quantization sign bit vector. For example, each of the processing elements PE11 to PEh1 may receive the first quantization sign bit vector (e.g., {right arrow over (QSB₁)}), and each of the processing elements PE12 to PEh2 may receive the second quantization sign bit vector (e.g., {right arrow over (QSB₂)}). That is, the processing elements included in the same processing element column may receive the same quantization sign bits QSB.

Each of the plurality of processing elements PE may calculate a different fixed point partial product PSPfxp based on the received fixed point multiplication scaled input vector MSXfxp and the quantization sign bit vector. That is, according to some example embodiments, one processing element PE may calculate one fixed point partial product PSPfxp. For example, the processing element PEij may calculate a fixed point partial product PSPfxpij. Hereinafter, the method in which the fixed point partial product PSPfxp is calculated in each processing element PE will be described in more detail.

FIG. 15 is a block diagram illustrating in more detail some operations of the matrix multiplier of FIG. 7. Hereinafter, the operation of the matrix multiplier 100 on the first fixed point multiplication scaled input vector (e.g., {right arrow over (MSX′₁)}) will be representatively described, but example embodiments are not limited thereto.

For example, the matrix multiplier 100 may operate in a similar manner for the second to (h)-th fixed point multiplication scaled input vectors (e.g., {right arrow over (MSX′₂)} to {right arrow over (MSX′_h)}).

Referring to FIGS. 1 to 15, the first processing element row PER1 may receive the first fixed point multiplication scaled input vector (that is, {right arrow over (MSX′₁)}). For example, each of the processing elements PE11 to PE1m may receive fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp1n_R.

Each processing element PE11 to PE1m may receive a corresponding quantization sign bit vector. For example, the processing elements PE11 to PE1m may receive the first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}) respectively.

Each processing element PE11 to PE1m may calculate a corresponding fixed point partial product PSPfxp. For example, the processing elements PE11 to PE1m may calculate fixed point partial products PSPfxp11 to PSPfxp1m respectively.

The second data type converter 160 may receive the first exponent EXP1. The second data type converter 160 may receive the fixed point partial products PSPfxp11 to PSPfxp1m from the first processing element row PER1. The second data type converter 160 may convert the fixed point partial products PSPfxp11 to PSPfxp1m to the floating point data type based on the first exponent EXP1. For example, the second data type converter 160 may convert the fixed point partial products PSPfxp11 to PSPfxp1m to the partial products PSP11 to PSP1m respectively.

The common scaler 180 may include first to m-th multiplication operation circuits MUL1 to MULm. The first to m-th multiplication operation circuits MUL1 to MULm may receive partial products PSP11 to PSP1m, respectively. The first to m-th multiplication operation circuits MUL1 to MULm may receive first to m-th common scale coefficients CSC1 to CSCm, respectively, from the common scale coefficient buffer 170. Each of the first to m-th common scale coefficients CSC1 to CSCm may correspond to a different column vector of the weight matrix WM. For example, the first to m-th common scale coefficients CSC1 to CSCm may correspond s_c1to s_cmrespectively, described with reference to Equations 9 to 11 above.

Each of the first to m-th multiplication operation circuits MUL1 to MULm may calculate an output element based on the received partial product and common scale coefficient. For example, the first multiplication operation circuit MUL1 may calculate y₁₁by multiplying a partial product PSP11 and a first common scale coefficient CSC1. Similarly, the second multiplication operation circuit MUL2 may calculate y₁₂by multiplying a partial product PSP12 and a second common scale coefficient CSC2.

In this way, the matrix multiplier 100 may use the processing elements included in the first processing element row PER1 to calculate the output elements (e.g., y₁₁to y_1m) corresponding to the first input vector (e.g., {right arrow over (X₁)}). Similarly, the matrix multiplier 100 may use the processing elements included in the second processing element row PER2 to calculate the output elements (e.g., y₂₁to y_2m) corresponding to the second input vector (e.g., {right arrow over (X₂)}).

In example embodiments, the first to m-th multiplication operation circuits MUL1 to MULm may also generate a plurality of output elements by multiplying the first to m-th common scale coefficients CSC1 to CSCm with partial products generated based on the second processing element row PER2 (e.g., partial products PSP21 to PSP2m) respectively. For example, the first multiplication operation circuit MUL1 may generate the output element “y₂₁” by multiplying the partial product PSP21 and the first common scale coefficient CSC1. In some examples, the first to m-th multiplication operation circuits MUL1 to MULm may multiply the first to m-th common scale coefficients CSC1 to CSCm by the partial products generated based on the first to m-th processing element columns PEC1 to PECm, respectively. That is, the first two m-th multiplication circuits MUL1 to MULm may operate on the partial products output from processing element each row (e.g. each multiplication circuit may operate on a corresponding column of the processing element array). However, example embodiments are not limited to a specific implementation method of the common scaler 180.

In example embodiments, when the number of columns of the weight matrix WM is greater than the number of columns of the processing element array 150 or the number of rows of the weight matrix WM is greater than the number of rows of the processing element array 150, the matrix multiplier 100 may calculate the output elements based on various tiling techniques. The tiling technique will be described in more detail below with reference to FIGS. 61 to 64.

FIG. 16 is a block diagram illustrating in more detail an operation of a first processing element row of FIG. 15. That is, hereinafter, the operation of the first processing element row PER1 will be representatively described with reference to FIGS. 1 to 16. However, example embodiments are not limited thereto, and the second to (h)-th processing element rows PER2 to PERh may also operate in a similar manner.

The first processing element row PER1 may receive the first fixed point multiplication scaled input vector (e.g., {right arrow over (MSX′₁)}). For example, the first processing element row PER1 may sequentially receive the fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp1n_R. In this case, the plurality of fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp1n_R may correspond to each of the plurality of multiplication scaled input elements MSIE11_1 to MSIE1n_R described above with reference to FIG. 9.

The processing elements PE11 to PE1m may receive the first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}), respectively. Each of the first to (m)-th quantization sign bit vectors (e.g., {right arrow over (QSB₁)} to {right arrow over (QSB_m)}) may include a plurality of quantization sign bits QSB generated based on different column vectors of the weight matrix WM.

For a more detailed example, the first quantization sign bit vector (e.g., {right arrow over (QSB₁)}) may include b_{1_1_c1}to b_{n_R_c1}. In this case, b_{1_1_c1}to b_{1_R_c1}may respectively correspond to fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp11_R (e.g., fixed point multiplication scaled input elements for x₁₁); and b_{2_1_c1}to b_{2_R_c1}may respectively correspond to fixed point multiplication scaled input elements MSIEfxp12_1 to MSIEfxp12_R (e.g., fixed point multiplication scaled input elements for x₁₂). Similarly, b_{n_1_c1}to b_{n_R_c1}may respectively correspond to fixed point multiplication scaled input elements MSIEfxp1n_1 to MSIEfxp1n_R (e.g., fixed point multiplication scaled input elements for x_1n).

The processing element PE11 may sequentially receive a plurality of pairs of quantization sign bit QSB and fixed point multiplication scaled input element MSIEfxp, corresponding to each other. Each pair may comprise a quantization sign bit QSB and a corresponding fixed point multiplication scaled input element MSIEfxp. For example, the processing element PE11 may sequentially receive b_{1_1_c1}to b_{1_R_c1}corresponding to the fixed point scale multiplication scaled input elements MSIEfxp for x₁₁, and then sequentially receive b_{2_1_c1}to b_{2_R_c1}corresponding to the fixed point multiplication scaled input elements MSIEfxp for x₁₂. In this way, the processing element PE11 may sequentially receive b_{1_1_c1}to b_{n_R_c1}corresponding to the fixed point multiplication scaled input elements MSIEfxp for x_1n.

The processing element PE11 may calculate the fixed point partial product PSPfxp11 based on the order of the quantization sign bits QSB and the plurality of fixed point multiplication scaled input elements MSIEfxp being received. For example, the processing element PE11 may calculate the fixed point partial product PSPfxp11 by subtracting a value obtained by summing the fixed point multiplication scaled input elements corresponding to the quantization sign bits QSB representing ‘1’ among the fixed point multiplication scaled input elements MSIEfxp11_1 to MSIEfxp1n_R, from a value obtained by summing the fixed point scaled input elements corresponding to the quantization sign bit QSB representing ‘0’ among the fixed point scaled input elements MSIEfxp11_1 to MSIEfxp1n_R.

Similarly, the processing element PE1j may receive the quantization sign bits (e.g., b_{1_1_cj}to b_{n_R_cj}) included in the j-th quantization sign bit vector (e.g., {right arrow over (QSB_j)}). The processing element PE1j may receive the plurality of fixed point multiplication scaled input elements MSIEfxp included in the first fixed point multiplication scaled input vector (e.g., {right arrow over (MSX′₁)}). The processing element PE1j may calculate the fixed point partial product PSPfxp1j based on the j-th quantization sign bit vector (e.g., {right arrow over (QSB_j)}) and the first fixed point multiplication scaled input vector (e.g., {right arrow over (MSX′₁)}).

For a more detailed example, the processing element PE may calculate the fixed point partial product PSPfxp according to Equation 13 below.

$\begin{matrix} PSPfxp = \sum MSIEfxp \times {(- 1)}^{QSB} & (Equation 13) \end{matrix}$

Referring to Equation 13, MSIEfxp and QSB may represent a fixed point multiplication scaled input element MSIEfxp and quantization sign bit QSB pair provided to the processing element PE. That is, the processing element PE may calculate the fixed point partial product PSPfxp by sequentially performing calculations of adding or subtracting the value of the fixed point multiplication scaled input element MSIEfxp based on the received quantization sign bit QSB. However, example embodiments are not limited thereto.

FIG. 17 is a diagram illustrating a configuration of one of the processing elements of FIG. 16 implemented according to various example embodiments. Referring to FIGS. 1 to 17, the processing element PE may include an arithmetic logic unit ALU and an accumulation register REG_ACC.

The arithmetic logic unit ALU may include first to third input terminals TI1 to TI3 and an output terminal TO. The first input terminal TI1 may sequentially receive the plurality of fixed point multiplication scaled input elements MSIEfxp. The second input terminal TI2 may sequentially receive the plurality of quantization sign bits QSB. The third input terminal TI3 may be connected to the accumulation register REG_ACC.

When the value received by the second input terminal TI2 is ‘0’, the arithmetic logic unit ALU may store the sum of the values received by the first input terminal TI1 and the third input terminal TI3 in the accumulation register REG_ACC through the output terminal TO. On the other hand, when the value received by the second input terminal TI2 is ‘1’, the arithmetic logic unit ALU may store a value obtained by subtracting the value provided to the first input terminal TI1 from the value provided to the third input terminal TI3 in the accumulation register REG_ACC through the output terminal TO. That is, when receiving a quantization sign bit QSB at the second terminal TI2 equal to ‘0’ the arithmetic logic unit ALU may calculate the sum of the values input into the first and third terminals TI1 and TI3 and output the sum through output terminal TO. Similarly, when receiving when receiving a quantization sign bit QSB at the second terminal TI2 equal to ‘1’ the arithmetic logic unit ALU may subtract the value input into the first terminal TI1 from the value input into the third terminal TI3 and output the resulting value through output terminal TO. The accumulation register may store the most recently received value output from the arithmetic logic unit ALU and feed that value back to the third input terminal TI3.

In this way, the processing element PE will be able to store the fixed point partial product PSPfxp calculated according to Equation 13 described above in the accumulation register REG_ACC. In this case, the fixed point partial product PSPfxp stored in the accumulation register REG_ACC may be provided to the second data type converter 160. However, example embodiments are not limited to a specific method in which the processing element PE performs the calculation and a specific configuration method of the processing element PE.

In example embodiments, the processing element PE may not include any ‘1-bit adder’ to perform the multiplication operation on the multiplication scale coefficient MSC. That is, according to some example embodiments, since the multiplication scaled input element MSIE is provided to the processing element array 150, each processing element PE may be implemented not to perform the calculation on the scale coefficient MSC. In this case, instead of each processing element PE including the circuit element for performing the multiplication operation for the multiplication scale coefficient MSC, only one multiplication scaling circuit 121 may be included for each processing element row PER, so the size and production cost of the matrix multiplier 100 may be reduced.

FIG. 18 is a diagram illustrating an operation of the second data type converter of FIG. 7. Hereinafter, for a more concise description, the operation of the second data type converter 160, which receives one fixed point partial product PSPfxp generated based on the first input vector (e.g., {right arrow over (X₁)}) and outputs one partial product PSP, will be representatively described. However, example embodiments are not limited thereto.

Referring to FIGS. 1 to 18, the second data type converter 160 may receive the first exponent EXP1 from the first data type converter 130. The second data type converter 160 may receive a fixed point partial product PSPfxp from one processing element PE. In this case, the fixed point partial product PSPfxp may include a sign part SP and a mantissa part MTSP.

The second data type converter 160 may generate a partial product PSP by converting the data type of the fixed point partial product PSPfxp to floating point. For example, the second data type converter 160 may add (e.g. insert) an exponent part EXPP to the fixed point partial product PSPfxp. The second data type converter 160 may determine the exponent part EXPP of the partial product PSP so that the exponent part EXPP of the partial product PSP corresponds to the first exponent EXP1.

The second data type converter 160 may add a plurality of ‘0’ bits to a least significant bit (LSB) place of the mantissa part MTSP of the partial product PSP according to the difference in code lengths of the mantissa part of the partial product PSP and the mantissa part of the fixed point partial product PSPfxp. However, example embodiments are not limited thereto.

FIG. 19 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1. Hereinafter, referring to FIGS. 1 to 19, the operation of the matrix multiply device MMD that receives one input vector (e.g., {right arrow over (X₁)}) and outputs one output vector (e.g., {right arrow over (Y₁)}) will be described.

In operation S110, the matrix multiply device MMD may receive the weight matrix WM. For example, the uniform BCQ circuit UBC may receive a plurality of weights (e.g., w₁₁to w_nm) included in the weight matrix WM.

In operation S120, the matrix multiply device MMD performs the uniform binary coding quantization for the weight matrix WM to generate the plurality of multiplication scale coefficients MSC, the plurality of common scale coefficients CSC, and the plurality of quantization sign bits QSB. For example, the uniform BCQ circuit UBC may generate a plurality of multiplication scale coefficients MSC, one common scale coefficient CSC, and a plurality of quantization sign bits QSB for each weight. The uniform BCQ circuit UBC may provide the plurality of multiplication scale coefficients MSC, the plurality of common scale coefficients CSC, and the plurality of quantization sign bits QSB to the matrix multiplier 100. In this case, the plurality of multiplication scale coefficients MSC may be stored in the multiplication scale coefficient buffer 110, the plurality of common scale coefficients CSC may be stored in the common scale coefficient buffer 170, and the plurality of quantization sign bits QSB may be stored in the quantization sign bit buffer 140. However, example embodiments are not limited thereto.

In operation S130, the matrix multiply device MMD may receive an input vector (e.g., {right arrow over (X₁)}). For example, the matrix multiplier 100 may receive the plurality of input elements (e.g., x₁₁to x_1n) included in the input vector.

In example embodiments, the matrix multiply device MMD may perform operation S130 regardless of the order of steps S110 to S120. For example, the matrix multiply device MMD may perform operation S130 followed by steps S110 to S120, or perform operation S130 between steps S110 and S120.

In operation S140, the matrix multiply device MMD may generate the multiplication scaled input vector MSX by performing the multiplication scaling for the input vector based on the plurality of multiplication scale coefficients MSC. For example, the input vector scaler 120 may perform multiplication scaling on each of the plurality of input elements based on the plurality of multiplication scale coefficients MSC provided from the multiplication scale coefficient buffer 110. More specifically, the input vector scaler 120 may multiply each of the plurality of input elements with the plurality of multiplication scale coefficients MSC to generate the plurality of multiplication scaled input elements MSIE.

In operation S150, the matrix multiply device MMD may generate the plurality of partial products PSP based on the elements included in the multiplication scaled input vector MSX (e.g., the plurality of multiplication scaled input elements MSIE) and the plurality of quantization sign bits QSB. Operation S150 is described in more detail below with reference to FIG. 20.

In operation S160, the matrix multiply device MMD may generate the output vector (e.g., {right arrow over (Y₁)}) by multiplying the plurality of partial products PSP and the plurality of common scale coefficients CSC, respectively. For example, the matrix multiplier 100 may generate one output element (e.g., y₁₁) by multiplying the partial product PSP11 with the first common scale coefficient CSC1. Similarly, the matrix multiplier 100 may generate one output element (e.g., y₁₂) by multiplying the partial product PSP12 with the second common scale coefficient CSC2. In this way, the matrix multiplier 100 may generate the plurality of output elements included in the output vector.

FIG. 20 is a flowchart illustrating in more detail operation S150 of FIG. 19. Referring to FIGS. 1 to 20, operation S150 may include the following operations S151 to S153.

In operation S151, the matrix multiplier 100 may convert the data type of the multiplication scaled input vector MSX to fixed point and generate the fixed point multiplication scaled input vector MSXfxp. For example, the first data type converter 130 may convert each data type of the plurality of multiplication scaled input elements MSIE to fixed point, and generate the plurality of fixed point multiplication scaled input elements MSIEfxp.

In operation S152, the matrix multiplier 100 may generate the plurality of fixed point partial products PSPfxp based on the plurality of quantization sign bits QSB and the fixed point multiplication scaled input vector MSXfxp. For example, the matrix multiplier 100 may generate the fixed point partial product PSPfxp by sequentially adding or subtracting each of the fixed point multiplication scaled input elements MSIEfxp based on the plurality of quantization sign bits QSB.

In operation S153, the matrix multiplier 100 may change the data type of the plurality of fixed point partial products PSPfxp to floating point. For example, the second data type converter 160 may generate the plurality of partial products PSP based on the plurality of fixed point partial products PSPfxp.

FIG. 21 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1. Hereinafter, referring to FIGS. 1 to 18 and 21, the operation of the matrix multiply device MMD that receives one input vector (e.g., {right arrow over (X₁)}) and generates one input element (e.g., y₁₁) will be described.

In operation S210, the matrix multiply device MMD may receive the first to (n)-th weights. For example, the uniform BCQ circuit UBC may receive the weights (e.g., w₁₁to w_n1) corresponding to one column of the weight matrix WM.

In operation S220, the matrix multiply device MMD may generate the first to (R)-th multiplication scale coefficients MSC, one common scale coefficient CSC, and the first-to-(N×R)th quantization sign bits QSB by performing the uniform binary coding quantization for the first to (n)-th weights.

In operation S230, the matrix multiply device MMD may receive the first to (n)-th input elements. For example, the matrix multiplier 100 may receive the plurality of input elements (e.g., x₁₁to x_1n) included in one input vector.

In example embodiments, the matrix multiply device MMD may perform operation S230 regardless of the order of operations S210 and S220. For example, the matrix multiply device MMD may perform operation S230 followed by operations S210 and S220, or perform operation S230 between operations S210 and S220.

In operation S240, the matrix multiply device MMD may generate the first-to-(N×R)th multiplication scaled input elements by performing the multiplication scaling on the first to (n)-th input elements based on the first-to-Rth multiplication scale coefficients MSC. For example, the input vector scaler 120 may generate ‘R’ multiplication scaled input elements MSIE for each single input element based on ‘R’ scale coefficients MSC.

In operation S250, the matrix multiplication device MMD may generate one partial product PSP based on the first-to-(N×R)th scaled input elements and the first-to-(N×R)th quantization sign bits QSB. For example, the processing element PE may calculate the partial product PSP by adding or subtracting each of the first-to-(N×R)th scaled input elements based on the first-to-(N×R)th quantization sign bits QSB.

In operation S260, the matrix multiply device MMD may output the output element (e.g., y₁₁) generated by multiplying the partial product PSP by the common scale coefficient CSC. For example, the common scaler 180 may generate one output element by multiplying the partial product PSP by the common scale coefficient CSC. In this case, the generated output element may correspond to a summation of multiplications of the weights and input elements previously received through operations S210 and S230, respectively.

FIG. 22 is a flowchart illustrating in more detail operation S250 of FIG. 21. Referring to FIGS. 1 to 18 and 21 to 22, operation S250 may include operations S251 to S253.

In operation S251, the matrix multiplier 100 may generate the first-to-(N×R)th fixed point multiplication scaled input elements MSIEfxp by converting the data type of the first-to-(N×R)th multiplication scaled input elements MSIE to fixed point. For example, the first data type converter 130 may convert the first-to-(N×R)th multiplication scaled input elements MSIE to the first-to-(N×R)th fixed point multiplication scaled input elements MSIEfxp, respectively.

In operation S252, the matrix multiplier 100 may calculate the fixed point partial product PSPfxp by subtracting a value obtained by adding the corresponding fixed point multiplication scaled input elements MSIEfxp to the quantization sign bits QSB representing ‘1’, from a value obtained by summing the corresponding fixed point multiplication scaled input elements MSIEfxp to the quantization sign bits QSB representing ‘0’. For example, the processing element PE may calculate the fixed point partial product PSPfxp in the manner described above with reference to Equation 13.

In operation S253, the matrix multiplier 100 may output partial product PSP generated by converting the data type of the fixed point partial product PSPfxp to floating point. For example, the second data type converter 160 may provide the partial product PSP to the common scaler 180.

FIG. 23 is a diagram illustrating the operation of the BCQ circuit of FIG. 1 according to some example embodiments. Referring to FIGS. 1 to 4 and 15, the uniform BCQ circuit UBC may perform the binary coding quantization operation for each row of the weight matrix WM.

First, the weight matrix WM may be expressed as Equation 14 below.

$\begin{matrix} WM = [\begin{matrix} w_{11} & w_{12} & \dots & w_{1 m} \\ w_{21} & w_{22} & \dots & w_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ w_{n 1} & w_{n 2} & \dots & w_{nm} \end{matrix}] = [\begin{matrix} \vec{w_{r 1}} \\ \vec{w_{r 2}} \\ ⋮ \\ \vec{w_{rn}} \end{matrix}] & (Equation 14) \end{matrix}$

Referring to Equation 14, {right arrow over (w_r1)} may represent an i-th row vector of the weight matrix WM. For example, {right arrow over (w_r1)} may include w_i1to w_im.

The uniform BCQ circuit UBC may perform the binary coding quantization operation for each row of the weight matrix WM based on Equation 15 below.

$\begin{matrix} \vec{w_{r ι}} \approx \sum_{k = 1}^{R} (α_{k_ri} \times \vec{B_{k_r ι}}) & (Equation 15) \end{matrix}$

Referring to Equation 15, R may represent the BCQ resolution. The α_{k_ri}may represent a k-th quantization scale coefficient for an i-th row vector of the weight matrix WM. The {right arrow over (B_{k_rl})} may represent the quantization sign vector corresponding to α_{k_ri}. For example, {right arrow over (B_{k_rl})} may represent the quantization sign vector corresponding to the k-th quantization scale coefficient for the i-th row vector of the weight matrix WM. {right arrow over (B_{k_rl})} may be calculated in a similar manner to {right arrow over (B_{k_cj})} (see Equation 8); however, instead of being calculated for a column of WM, it is calculated for a row of WM.

Meanwhile, α_{k_ri}may be expressed as the product of the common scale coefficient and the multiplication scale coefficient. For example, α_{k_ri}may be expressed as 2^k-1×s_ri. Accordingly, the above-described Equation 15 may be expressed as the following Equation 16.

$\begin{matrix} \vec{w_{r ι}} \approx s_{ri} \times \sum_{k = 1}^{R} (2^{k - 1} \times \vec{B_{k_r ι}}) & (Equation 16) \end{matrix}$

Referring to Equation 16, 2^k-1may represent the k-th multiplication scale coefficient MSCk, and s_rimay represent the common scale coefficient CSC for the (i)-th row vector.

FIG. 24 is a diagram illustrating an approximated weight matrix according to various example embodiments of FIG. 23. Referring to FIGS. 1 to 4 and 23 and 24, the uniform BCQ circuit UBC may perform the uniform binary coding quantization for each row of the weight matrix WM.

The quantization sign vector {right arrow over (B_{k_rι})} may be implemented as a row vector having the same dimension as the number of columns of the weight matrix WM. For example, the quantization sign vector {right arrow over (B_{k_rι})} may be expressed as Equation 17 below.

$\begin{matrix} \vec{B_{k_r ι}} = [\begin{matrix} {(- 1)}^{b_{1_k_ri}} & {(- 1)}^{b_{2_k_ri}} & \dots & {(- 1)}^{b_{m_k_ri}} \end{matrix}] & (Equation 17) \end{matrix}$

Each of the b_{1_k_ri}to b_{m_k_ri}may represent different quantization sign bit QSB. More specifically, each of the b_{1_k_ri}to b_{m_k_ri}may be the quantization sign bit QSB for weight arranged in different column of the weight matrix WM each other. That is, each of the b_{1_k_ri}to b_{m_k_ri}may be ‘0’ or ‘1’.

As with the embodiment of FIG. 5 and FIG. 6, the multiplication based on the quantization sign bits QSB may be implemented through adding or subtracting values based on the corresponding quantization sign bits QSB. Accordingly, this can receive as an input, a quantization sign bit vector {right arrow over (QSB_{k_rι})}. The {right arrow over (QSB_{k_rl})} may include a plurality of quantization sign bits corresponding to the k-th quantization scale coefficient of the i-th row vector. More specifically, {right arrow over (QSB_{k_rl})} may be expressed as Equation 17.1 below.

$\begin{matrix} \vec{{QSB}_{k_r ι}} = [\begin{matrix} b_{1_k_ri} & b_{2_k_ri} & \dots & b_{m_k_ri}] \end{matrix} & (Equation 17.1) \end{matrix}$

That is, the uniform BCQ circuit UBC may approximate each row vector included in the weight matrix WM based on the combinations of the plurality of quantization sign bits QSB, one common scale coefficient CSC, and the plurality of multiplication scale coefficients MSC. In this case, the matrix multiplier 100 may calculate the output matrix YM corresponding to the product of the input matrix XM and the weight matrix WM with a smaller computational amount.

For a more detailed example, hereinafter, for a more concise description, the operation of the matrix multiplier 100 that calculates the output elements (e.g., y₁₁) of the first row and first column of the output matrix YM will be representatively described. However, example embodiments are not limited thereto.

The uniform BCQ circuit UBC may provide the plurality of quantization sign bits QSB (e.g., “b” values), the plurality of common scale coefficients CSC (e.g., “s” values), and the plurality of multiplication scale coefficients MSC (e.g., powers of 2) to the matrix multiplier 100.

The matrix multiplier 100 may calculate the output matrix YM based on the plurality of quantization sign bits QSB, the plurality of common scale coefficients CSC, and the plurality of multiplication scale coefficients MSC. For example, the matrix multiplier 100 may calculate the output matrix YM by multiplying the weight matrix WM approximated in the manner described above with reference to Equations 14 to 17.1 and FIG. 24 with the input matrix XM.

For a more detailed example, the matrix multiplier 100 may calculate y₁₁according to Equation 18 below.

$\begin{matrix} (Equation 18) \end{matrix}$

$y_{11} \approx x_{11} \times s_{r 1} \times (2^{0} \times {(- 1)}^{b_{1_1_r 1}} + 2^{1} \times {(- 1)}^{b_{1_2_r 1}} + \dots + 2^{R - 1} \times {(- 1)}^{b_{1_R_r 1}}) + x_{12} \times s_{r 2} \times (2^{0} \times {(- 1)}^{b_{1_1_r 2}} + 2^{1} \times {(- 1)}^{b_{1_2_r 2}} + \dots + 2^{R - 1} \times {(- 1)}^{b_{1_R_r 2}}) + \dots + ⁠ x_{1 n} \times s_{rn} \times (2^{0} \times {(- 1)}^{b_{1_1_rn}} + 2^{1} \times {(- 1)}^{b_{1_2_rn}} + \dots + 2^{R - 1} \times {(- 1)}^{b_{1_R_rn}})$

In example embodiments, the matrix multiplier 100 may calculate the output element by performing the quantization scaling on the input element based on the common scale coefficient CSC firstly, and then accumulating the quantization scaled input element (hereinafter, referred to as “quantization scaled input element QSIE”) based on the plurality of quantization sign bits QSB. In this case, the matrix multiplier 100 may use the same ‘quantization scaled input element (QSIE)’ to calculate different output elements included in one output vector (e.g., {right arrow over (Y₁)}). For example, the matrix multiplier 100 may use “2⁰×s_r1×x₁₁” for each of the calculations of y₁₁to y_1n. Accordingly, according to various example embodiments of the present disclosure, the computational amount of the matrix multiplier 100 and the number of accesses to memory devices external from the matrix multiplier 100 may be reduced or minimized. The detailed configuration and operation of the matrix multiplier 100 are described in more detail below with reference to FIGS. 25 to 46.

Meanwhile, continuing to refer to FIG. 24 and Equations 14 to 17.1, the quantization sign bit vectors (e.g., {right arrow over (QSB_{1_r1})} to {right arrow over (QSB_{1_rn})}) used for the multiplication of the quantization sign vectors (e.g., {right arrow over (B_{1_r1})} to {right arrow over (B_{1_rn})}) with the first quantization scale coefficients (e.g., α_{1_r1}to α_{1_rn}) for the first to (n)-th rows of the weight matrix WM may be expressed as a first quantization sign bit matrix QSBM_1. For example, the first quantization sign bit matrix QSBM_1 may be expressed as Equation 19 below.

$\begin{matrix} QSBM_1 = [\begin{matrix} \vec{{QSB}_{1_r 1}} \\ \vec{{QSB}_{1_r 2}} \\ ⋮ \\ \vec{{QSB}_{1_rn}} \end{matrix}] & (Equation 19) \end{matrix}$

In a similar manner, the quantization sign bit vectors (e.g., {right arrow over (QSB_{R_r1})} to {right arrow over (QSB_{R_rn})}) that are used for the multiplication of the quantization sign vectors (e.g., {right arrow over (B_{R_r1})} to {right arrow over (B_{R_rn})}) with the (R)-th quantization scale coefficients (e.g., α_{R_r1}to α_{R_rn}) for the first to the (n)-th weight row may be expressed as a (R)-th quantization sign bit matrix QSBM_R.

In example embodiments, the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R may be implemented with the same number of rows and columns as the weight matrix WM. For example, the number of rows of the first quantization sign bit matrix QSBM_1 may be ‘n’ and the number of columns may be ‘m’. The first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R are described in more detail below with reference to FIG. 25.

FIG. 25 is a diagram illustrating the quantization sign bit matrices of FIG. 24. Referring to FIGS. 1 to 4 and 23 to 25, the uniform BCQ circuit UBC may perform the uniform binary coding quantization for the weight matrix WM with the plurality of quantization sign bit matrices QSBM, the plurality of common scale coefficients CSC, and the plurality of multiplication scale coefficients MSC.

The number of quantization sign bit matrices QSBM may be determined according to the BCQ resolution (e.g., ‘R’). For example, the uniform BCQ circuit UBC may generate the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R by performing the binary coding quantization for the weight matrix WM.

Each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R may be implemented with the same number of rows and columns as the weight matrix WM. For example, the number of rows of each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R may be ‘n’, and the number of columns may be ‘m’. In this case, the quantization sign bit QSB arranged in the (i)-th row and (j)-th column of the (k)-th quantization sign bit matrix QSBM_k may be “b_{j_k_ri}”.

Each weight of the weight matrix WM may be approximated by the quantization sign bits QSB arranged at corresponding positions of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R. For example, the weight arranged in the (i)-th row and (j)-th column of the weight matrix WM (e.g., w_ij) may be approximated based on the quantization sign bits arranged in the (i)-th row and (i)-th column of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R (e.g., b_{j_1_ri}to b_{j_R_ri}).

Each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R may correspond to a plurality of common scale coefficients QSC. That is, as described with reference to FIGS. 23 and 24, when the uniform BCQ circuit UBC performs the uniform binary coding quantization operation for each row of the weight matrix WM, different rows of the weight matrix WM will be approximated based on different common scale coefficients QSC and the weights included in the same row of the weight matrix WM will be approximated based on the same common scale coefficient QSC.

Accordingly, all of the quantization sign bits included in the (i)-th row of the (k)-th quantization sign bit matrix QSBM_k (e.g., b_{1_k_ri}to b_{m_k_ri}) may correspond to the common scale coefficient “s_ri”. In this way, the first to (n)-th rows of each of the first to (k)-th quantization sign bit matrices QSBM_1 to QSBM_k may correspond to the common scale coefficients “s_r1” to “s_rn” respectively. For a more detailed example, the first to (n)-th rows of the first quantization sign bit matrix QSBM_1 may correspond to “s_r1” to “s_rn” respectively.

Each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R may correspond to different multiplication scale coefficients MSC. For example, each of the plurality of quantization sign bits QSB included in the first quantization sign bit matrix QSBM_1 may correspond to the first multiplication scale coefficient MSC1 (e.g., 2⁰), and each of the plurality of quantization sign bits QSB included in the second quantization sign bit matrix QSBM_2 may correspond to the second multiplication scale coefficient MSC2 (e.g., 2¹).

In this way, the weight arranged in the (i)-th row and the (j)-th column of the weight matrix WM (e.g., w_ij) may be approximated based on the quantization sign bits arranged in the (i)-th row and the (j)-th column of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R; one common scale coefficient CSC corresponding to the (i)-th row of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R; and the plurality of multiplication scale coefficients MSC respectively corresponding to the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R.

More specifically, each of the plurality of weights in the (i)-th row may be approximated based on R scale coefficients MSC, one common scale coefficient CSC (e.g. corresponding to the (i)-th row), and R quantization sign bits QSB, according to Equation 20 below

$\begin{matrix} (Equation 20) \end{matrix}$

$W_{ij} \approx 2^{0} \times s_{ri} \times {(- 1)}^{b_{j_1_ri}} + 2^{1} \times s_{ri} \times {(- 1)}^{b_{j_2_ri}} + \dots + 2^{R - 1} \times s_{ri} \times {(- 1)}^{b_{j_R_ri}}$

The operation of the matrix multiplier 100 based on weights approximated based on the multiplication scale coefficient MSC, the common scale coefficient CSC, and the quantization sign bit QSB will be described in more detail below with reference to FIGS. 26 to 46.

FIG. 26 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 implemented according to some example embodiments. Referring to FIGS. 1 to 4 and 23 to 26, the matrix multiplier 100 of FIG. 1 may be implemented as a matrix multiplier 200 of FIG. 26.

The matrix multiplier 200 may include a multiplication scale coefficient buffer 210, a common scale coefficient buffer 270, an input vector scaler 220, a first data type converter 230, a quantization sign bit buffer 240, a processing element array 250, and a second data type converter 260.

The multiplication scale coefficient buffer 210 may store the plurality of multiplication scale coefficients MSC provided from the uniform BCQ circuit UBC. The multiplication scale coefficient buffer 210 may provide the plurality of multiplication scale coefficients MSC to the input vector scaler 220.

The common scale coefficient buffer 270 may store the plurality of common scale coefficients CSC provided from the uniform BCQ circuit UBC. The common scale coefficient buffer 270 may provide the plurality of common scale coefficients CSC to the input vector scaler 220.

The input vector scaler 220 may receive the input matrix XM. For example, the input vector scaler 120 may receive the plurality of input vector (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}) including the plurality of input elements.

The input vector scaler 220 may perform the quantization scaling for the input matrix XM based on the plurality of common scale coefficients CSC and the plurality of multiplication scale coefficients MSC. For example, the input vector scaler 220 may generate the plurality of quantization scaled input vectors QSX based on the plurality of input vectors. In this case, the plurality of quantization scaled input vectors QSX may correspond to the plurality of input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_n)}), respectively. Hereinafter, the quantization scaled input vectors QSX corresponding to {right arrow over (X₁)} to {right arrow over (X_h)} may be referred to as to {right arrow over (QSX₁)} to {right arrow over (QSX_h)}, respectively.

In example embodiments, the plurality of quantization scaled input vectors QSX may be included in the quantization scaled input matrix. In this case, a row size of the quantization scaled input matrix may be an integer multiple of a row size of the input matrix XM, and a column size of the quantization scaled input matrix may be the same as a column size of the input matrix XM. However, example embodiments are not limited thereto.

Each of the plurality of quantization scaled input vectors QSX may be implemented as a row vector having a ‘R’-times dimension that of the corresponding input vector. For example, the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) corresponding to the first input vector (e.g., {right arrow over (X₁)}) may include “R×n” quantization scaled input elements QSIE as illustrated in Equation 21 below.

$\begin{matrix} \vec{{QSX}_{1}} ∋ (2^{0} \times s_{r 1} \times x_{1 1}), (2^{1} \times s_{r 1} \times x_{1 1}) (2^{R - 1} \times s_{r 1} \times x_{1 1}), & (Equation 21) \end{matrix}$

$(2^{0} \times s_{r 2} \times x_{1 2}), (2^{1} \times s_{r 2} \times x_{1 2}), \dots, (2^{R - 1} \times s_{r 2} \times x_{1 2}),$

$\dots,$

$(2^{0} \times s_{rn} \times x_{1 n}), (2^{1} \times s_{rn} \times x_{1 n}), \dots, (2^{R - 1} \times s_{rn} \times x_{1 n})$

Referring to Equation 21, the plurality of quantization scaled input elements QSIE included in the first quantization scaled input vector {right arrow over (QSX₁)} may be generated by multiplying each of the plurality of input elements included in the first input vector with corresponding multiplication scale coefficient MSC and corresponding common scale coefficient CSC.

In this way, each of the second to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₂)} to {right arrow over (QSX_h)}) may include ‘R×n’ quantization scaled input elements QSIE. For concise description, detailed description of the quantization scaled input elements QSIE included in each of the second to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₂)} to {right arrow over (QSX_h)}) will be omitted. The configuration and operation of the input vector scaler 220 will be described in more detail below with reference to FIGS. 27 to 34.

In example embodiments, the data type of the plurality of input elements may be floating point. In this case, the data type of each of the plurality of quantization scaled input elements QSIE may be floating point. However, example embodiments are not limited thereto.

In example embodiments, the code lengths of each of the plurality of input elements may be 16-bit or 32-bit. However, example embodiments are not limited thereto.

In example embodiments, the code lengths of each of the plurality of common scale coefficients CSC may be 16-bit or 32-bit. However, example embodiments are not limited thereto.

The first data type converter 230 may receive the plurality of quantization scaled input vectors QSX. For example, the first data type converter 230 may receive first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}). Each of the first to (h)-th quantization scaled input vectors may include the plurality of quantization scaled input elements QSIE.

The first data type converter 230 may extract an exponent EXP from each of the plurality of quantization scaled input vectors QSX. For example, the first data type converter 230 may extract the first exponent from the first quantization scaled input vector {right arrow over (QSX₁)} and extract the second exponent from the second quantization scaled input vector {right arrow over (QSX₂)}. The first data type converter 230 may provide the extracted exponents EXP to the second data type converter 260.

The first data type converter 230 may convert the data type of the plurality of quantization scaled input vectors QSX to fixed point. For example, the first data type converter 230 may receive the plurality of quantization scaled input vectors QSX and output the plurality of fixed point quantization scaled input vectors QSXfxp. That is, the first data type converter 230 may generate the first to (h)-th fixed point quantization scaled input vectors (e.g., {right arrow over (QSX′₁)} to {right arrow over (QSX′_h)}), respectively, based on the first to (h)-th quantization scaled input vectors.

More specifically, the first data type converter 230 may convert the data type of each of the plurality of quantization scaled input elements QSIE included in the plurality of quantization scaled input vector QSX to fixed point based on the extracted exponent. In this case, the first fixed point quantization scaled input vector (e.g., {right arrow over (QSX′₁)}) may include elements of Equation 21 described above converted into fixed point form (hereinafter, they may be referred to as the fixed point quantization scaled input elements QSIEfxp). The configuration and operation of the first data type converter 230 will be described in more detail below with reference to FIG. 35.

The quantization sign bit buffer 240 may store the plurality of quantization sign bits QSB provided from the uniform BCQ circuit UBC. The quantization sign bit buffer 240 may provide the plurality of quantization sign bits QSB to the processing element array 250.

The processing element array 250 may receive the plurality of quantization sign bits QSB and the plurality of fixed point quantization scaled input vectors QSXfxp. The processing element array 250 may generate the fixed point output matrix YMfxp based on the plurality of fixed point quantization scaled input elements QSIEfxp and the plurality of quantization sign bits QSB. The fixed point output matrix YMfxp may be expressed as Equation 22 below.

$\begin{matrix} YMfxp = [\begin{matrix} \vec{Y_{1}^{'}} \\ \vec{Y_{2}^{'}} \\ ⋮ \\ \vec{Y_{h}^{'}} \end{matrix}] = [\begin{matrix} y_{11}^{'} & y_{12}^{'} & \dots & y_{1 m}^{'} \\ y_{21}^{'} & y_{22}^{'} & \dots & y_{2 m}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ y_{h 1}^{'} & y_{h 2}^{'} & \dots & y_{hm}^{'} \end{matrix}] & (Equation 22) \end{matrix}$

Referring to Equation 22, YM_fxp may represent the fixed-point output matrix YM_fxp, and {right arrow over (Y′₁)}′ to {right arrow over (Y′_h)} may represent different output vectors having fixed-point data types. Each of y′₁₁to y′_hmmay represent different output elements having fixed-point data types.

The processing element array 250 may include the plurality of processing elements arranged in the row direction and the column direction. Each of the plurality of processing elements may calculate and output different fixed point output elements of Equation 22 described above. More detailed configurations and operations of each of the processing element arrays will be described in more detail below with reference to FIGS. 36 to 38.

The second data type converter 260 may receive a plurality of exponents EXP from the first data type converter 230. The second data type converter 260 may receive the fixed point output matrix YMfxp from the processing element array 250. For example, the second data type converter 260 may receive the plurality of fixed point output elements from the processing element array 250.

The second data type converter 260 may convert a data type of the fixed point output matrix YMfxp to floating point based on the plurality of exponents EXP. That is, the second data type converter 260 may output the output matrix YM having the floating point data type. For example, the second data type converter 260 may generate the plurality of output elements by converting the data type of each of the plurality of received fixed point output elements to floating point. The detailed configuration and operation of the second data type converter 160 will be described in more detail below with reference to FIG. 39.

FIG. 27 is a block diagram illustrating a configuration of the input vector scaler of FIG. 26 according to various example embodiments. Referring to FIGS. 1 to 4 and 23 to 27, the input vector scaler 220 may include a multiplication scaler SCL_multiple and a common scaler SCL_common. The multiplication scaler SCL_multiple may include first to (h)-th multiplication scaling circuits 221_1 to 221_h. The common scaler SCL_common may include first to (h)-th common scaling circuits 222_1 to 222_h.

Each of the first to (h)-th multiplication scaling circuits 221 to 221_h may receive different input vectors. For example, the first to (h)-th multiplication scaling circuits 221 to 221_h may receive the first to (h)-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively.

That is, each of the first to (h)-th multiplication scaling circuits 221_1 to 221_h may sequentially receive the plurality of input elements for the corresponding input vector. For example, the first multiplication scaling circuit 221_h may sequentially receive x₁₁to x_1n, and the (h)-th multiplication scaling circuit 221_h may sequentially receive x_h1to x_hn.

The multiplication scaler SCL_multiple may sequentially receive the plurality of multiplication scale coefficients MSC from the multiplication scale coefficient buffer 210. For example, each of the first to (h)-th multiplication scaling circuits 221_1 to 221_h may sequentially receive the plurality of multiplication scale coefficients MSC from the multiplication scale coefficient buffer 210. For a more detailed example, each of the first to (h)-th multiplication scaling circuits 221_1 to 221_h may receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR (e.g., 2⁰to 2^R-1).

Each of the first to (h)-th multiplication scaling circuits 221_1 to 221_h may perform multiplication scaling for the received input element based on the plurality of multiplication scale coefficients MSC. That is, the first to (h)-th multiplication scaling circuits 221_1 to 221_h may generate the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}), respectively. For example, the first multiplication scaling circuit 221_1 may generate the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}) based on the x₁₁to x_1nand the first-to-Rth multiplication scale coefficients MSC1 to MSCR. In this case, each of the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}) may include the plurality of multiplication scaled input elements (hereinafter referred to as “MSIE”). The specific configuration and operation of each of the first to (h)-th multiplication scaling circuits 221_1 to 221_h are similar to the configuration and operation of the first multiplication scaling circuit 121 as described above with reference to FIGS. 9 to 10, and therefore, detailed description thereof will be omitted.

In example embodiments, the first to (h)-th multiplication scaling circuits 221_1 to 221_h may calculate the multiplication scaled input elements by increasing the exponent part EXPP of the input element, as above described with reference to FIG. 10. In this case, the computational amount of the first to (h)-th multiplication scaling circuits 221_1 to 221_h may be reduced or minimized.

Each of the first to (h)-th common scaling circuits 222_1 to 222_h may receive different multiplication scaled input vectors. For example, the first to (h)-th common scaling circuits 222_1 to 221_h may receive the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}), respectively.

Each of the first to (h)-th common scaling circuits 222_1 to 222_h may sequentially receive the plurality of multiplication scaled input elements for the respective multiplication scaled input vector. For example, the first common scaling circuit 222_1 may sequentially receive multiplication scaled input elements MSIE11_1 to MSIE1n_R described above with reference to FIG. 9.

The common scaler SCL_common may sequentially receive the plurality of common scale coefficients CSC from the common scale coefficient buffer 270. For example, each of the first to (h)-th common scaling circuits 222_1 to 222_h may sequentially receive the plurality of common scale coefficients CSC from the common scale coefficient buffer 270.

For example, each of the first to (h)-th common scaling circuits 222_1 to 222_h may be the same as the common scale coefficient CSC received from the common scale coefficient buffer 270. For example, the common scale coefficients CSC sequentially received by the first common scaling circuit 222_1 may be the same as the common scale coefficients CSC sequentially received by the second common scaling circuit 222_2.

The order in which each of the first to (h)-th common scaling circuits 222_1 to 222_h receives the common scale coefficients CSC may be same. For example, the common scale coefficients CSC firstly received by the first common scaling circuit 222_1 may be same as the common scale coefficient CSC firstly received by the second common scaling circuit 222_2. Similarly, the common scale coefficients CSC secondly received by the first common scaling circuit 222_1 may be the same as the common scale coefficient CSC secondly received by the second common scaling circuit 222_2.

Each of the first to (h)-th common scaling circuits 222_1 to 222_h may perform common scaling on the received multiplication scaled input element MSIE based on the plurality of common scale coefficients CSC. That is, each of the first to (h)-th common scaling circuits 222_1 to 222_h may generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively, based on the plurality of received multiplication scaled input elements MSIE and the plurality of common scale coefficients CSC. For example, the first common scaling circuit 222_1 may generate the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) based on the multiplication scaled input elements MSIE11_1 to MSIE1n_R and the plurality of common scale coefficients CSC described above with reference to FIG. 9. The operation of the first to (h)-th common scaling circuits 222_1 to 222_h will be described in more detail below with reference to FIG. 28.

FIG. 28 is a diagram illustrating in more detail the operation of the common scaling circuit of FIG. 27. Hereinafter, for a more concise description, the operation of the first common scaling circuit 222_1 will be representatively described with reference to FIGS. 1 to 4 and 23 to 28. However, example embodiments are not limited thereto, and the second to (h)-th common scaling circuits 222_2 to 222_h may also operate in a similar manner.

The first common scaling circuit 222_1 may receive the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}). That is, the first common scaling circuit 222_1 may sequentially receive the multiplication scaled input elements MSIE11_1 to MSIE1n_R.

The first common scaling circuit 222_1 may sequentially receive the plurality of common scale coefficients CSC. For example, the first common scaling circuit 222_1 may sequentially receive S_r1to S_rn. In this case, each of the S_r1to S_rnmay correspond to different input elements. For example, S_r1to S_rnmay correspond to x₁₁to x_1n, respectively.

The first common scaling circuit 222_1 may perform the common scaling for the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}) based on the plurality of common scale coefficients CSC to generate the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}). That is, the first common scaling circuit 222_1 may multiply each of the multiplication scaled input elements MSIE11_1 to MSIE1n_R with corresponding common scale coefficient CSC to generate the plurality of quantization scaled input elements QSIE. For example, the first common scaling circuit 222_1 may multiply each of the multiplication scaled input elements MSIE11_1 to MSIE11_R with the common scale coefficient “S_r1” to generate a plurality of quantization scaled input elements QSIE for x₁₁. In more detail, the first common scaling circuit 222_1 may multiply each of the multiplication scaled input elements MSIE11_1 to MSIE11_R by the common scale coefficient “S_r1” to generate the plurality of quantization scaled input elements QSIE (e.g., α_{1_r1}×x₁₁″ to “α_{R_r1}×x₁₁”) (illustrated as diagonal stripe).

Similarly, the first common scaling circuit 222_1 may multiply each of the multiplication scaled input elements MSIE12_1 to MSIE12_R with the common scale coefficient “s_r2” to generate the plurality of quantization scaled input elements QSIE for x₁₂(e.g., “α_{1_r2}×x₁₂” to “α_{R_r2}×x₁₂”) (illustrated as a dot pattern).

In this way, the first common scaling circuit 222_1 may be able to sequentially calculate the plurality of quantization scaled input elements QSIE corresponding to x₁₃to x_1n.

The first common scaling circuit 222_1 may sequentially output the plurality of quantization scaled input elements QSIE. For example, the first common scaling circuit 222_1 may provide the plurality of quantization scaled input elements QSIE to the first data type converter 230.

That is, according to various example embodiments of FIGS. 27 and 28, the input vector scaler 220 may generate the quantization scaled input vector by performing the multiplication scaling for each of the plurality of input vectors, and then performing the common scaling.

FIG. 29 is a block diagram illustrating a configuration of the input vector scaler of FIG. 26 according to various example embodiments. Referring to FIGS. 1 to 4, 23 to 26, and 29, the input vector scaler 220 may include a common scaler SCL_common and a multiplication scaler SCL_multiple. The common scaler SCL_common may include first to (h)-th common scaling circuits 223_1 to 223_h. The multiplication scaler SCL_multiple may include first to (h)-th multiplication scaling circuits 224_1 to 224_h.

Each of the first to (h)-th common scaling circuits 223_1 to 223_h may receive different input vectors. For example, the first to (h)-th common scaling circuits 223_1 to 223_h may receive the first to (h)-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively.

Each of the first to (h)-th common scaling circuits 223_1 to 223_h may sequentially receive the plurality of input elements for the corresponding input vector. For example, the first common scaling circuit 223_1 may sequentially receive x₁₁to x_1n, and the h-th common scaling circuit 223_h may sequentially receive x_h1to x_hn.

The common scaler SCL_multiple may sequentially receive the plurality of common scale coefficients CSC from the common scale coefficient buffer 270. For example, each of the first to (h)-th common scaling circuits 223_1 to 223_h may sequentially receive the plurality of common scale coefficients CSC from the common scale coefficient buffer 270.

Common scale coefficients CSC that each of the first to h-th common scaling circuits 223_1 to 223_h receive from the common scale coefficient buffer 270 may be same. An order in which each of the first to (h)-th common scaling circuits 223_1 to 223_h receives the common scale coefficients CSC may be same. The relationship between the common scale coefficients CSC received by each of the first to (h)-th common scaling circuits 223_1 to 223_h is similar to that described above with reference to FIG. 27, and therefore, detailed description thereof will be omitted.

Each of the first to (h)-th common scaling circuits 223_1 to 223_h may perform common scaling on the received input vector based on the plurality of common scale coefficients CSC. That is, each of the first to (h)-th common scaling circuits 223_1 to 223_h may generate the first to (h)-th common scaled input vectors (e.g., {right arrow over (CSX₁)} to {right arrow over (CSX_h)}), respectively, based on the plurality of received input elements and the plurality of common scale coefficients CSC. For example, the first common scaling circuit 223_1 may generate the plurality of common scaled input elements included in the first common scaled input vector (e.g., {right arrow over (CSX₁)}) (hereinafter, referred to as “CSIE”) by multiplying x₁₁to x_1nwith different common scale coefficients CSC respectively. The operation of the first to (h)-th common scaling circuits 223_1 to 223_h will be described in more detail below with reference to FIG. 30.

Each of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may receive different common scaled input vectors. For example, the first to (h)-th multiplication scaling circuits 224_1 to 224_h may receive the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (CSX₁)} to {right arrow over (CSX_h)}), respectively.

Each of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may sequentially receive the plurality of common scaled input elements for the corresponding scaled input vector. For example, the first multiplication scaling circuit 224_1 may sequentially receive the plurality of common scaled input elements CSIE included in the first common multiplication scaled input vector (e.g., {right arrow over (CSX₁)}).

The multiplication scaler SCL_multiple may sequentially receive the plurality of multiplication scale coefficients MSC from the multiplication scale coefficient buffer 210. For example, each of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may sequentially receive the plurality of multiplication scale coefficients MSC from the multiplication scale coefficient buffer 210. For a more detailed example, each of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR (e.g., 2⁰to 2^R-1).

Each of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may perform multiplication scaling on the received common scaled input vector based on the plurality of multiplication scale coefficients MSC. That is, the first to (h)-th multiplication scaling circuits 224_1 to 224_h may perform the multiplication scaling on the first to (h)-th common scaled input vectors to generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively.

For example, the first multiplication scaling circuit 224_1 may generate a first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) by multiplying each of the plurality of common multiplication scaled input elements CSIE included in the first common multiplication scaled input vector (e.g., {right arrow over (CSX₁)}) with the first-to-Rth multiplication scale coefficients MSC1 to MSCR. The operation of the first to (h)-th multiplication scaling circuits 224_1 to 224_h will be described in more detail below with reference to FIG. 31.

FIG. 30 is a diagram illustrating in more detail the operation of the common scaling circuit of FIG. 29. Hereinafter, for a more concise description, the operation of the first common scaling circuit 223_1 will be representatively described with reference to FIGS. 1 to 4, 23 to 26, and 29 and 30. However, example embodiments are not limited thereto, and the second to (h)-th common scaling circuits 223_2 to 223_h may also operate in a similar manner.

The first common scaling circuit 223_1 may receive the first input vector (e.g., {right arrow over (X₁)}). That is, the first common scaling circuit 223_1 may sequentially receive x₁₁to x_1n.

The first common scaling circuit 223_1 may sequentially receive the plurality of common scale coefficients CSC. For example, the first common scaling circuit 223_1 may sequentially receive S_r1to S_rn.

The first common scaling circuit 223_1 may generate the first common scaled input vector (e.g., {right arrow over (CSX₁)}) by performing the common scaling on the first input vector (e.g., {right arrow over (X₁)}) based on the plurality of common scale coefficients CSC. For example, the first common scaling circuit 223_1 may generate common scaled input elements CSIE11 to CSIE1n by multiplying x₁₁to x_1nwith S_r1to S_rn, respectively.

FIG. 31 is a diagram illustrating in more detail an operation of a multiplication scaling circuit of FIG. 29. Hereinafter, for a more concise description, the operation of the first multiplication scaling circuit 224_1 will be representatively described with reference to FIGS. 1 to 4, 23 to 26, and 29 and 31. However, example embodiments are not limited thereto, and the second to (h)-th multiplication scaling circuits 224_2 to 224_h may also operate in a similar manner.

The first multiplication scaling circuit 224_1 may receive the first common scaled input vector (e.g., {right arrow over (CSX₁)}). That is, the first multiplication scaling circuit 224_1 may sequentially receive the common scaled input elements CSIE11 to CSIE1n described above with reference to FIG. 30. The first multiplication scaling circuit 224_1 may sequentially receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

The first multiplication scaling circuit 224_1 may generate the plurality of quantization scaled input elements QSIE by multiplying each of the received common scaled input elements CSIE11 to CSIE1n with the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

For example, the first multiplication scaling circuit 224_1 may generate the plurality of quantization scaled input elements QSIE for x₁₁by multiplying the common scaled input element CSIE11 with each of the first-to-Rth multiplication scale coefficients MSC1 to MSCR. In more detail, the first multiplication scaling circuit 224_1 may generate the plurality of quantization scaled input elements QSIE respectively corresponding to “α_{1_r1}×x₁₁” to “α_{R_r1}×x₁₁” by multiplying the common scaled input element CSIE11 with each of 2⁰to 2^R-1(illustrated as diagonal stripe).

Similarly, the first multiplication scaling circuit 224_1 may generate the plurality of quantization scaled input elements QSIE for x₁₂by multiplying the common scaled input element CSIE12 with each of the first-to-Rth multiplication scale coefficients MSC1 to MSCR (illustrated as a dot pattern).

In this way, the first multiplication scaling circuit 224_1 will be able to sequentially calculate the plurality of quantization scaled input elements QSIE corresponding to x₁₃to x_1n.

In example embodiments, the first to (h)-th multiplication scaling circuits 224_1 to 224_h may calculate the quantization scaled input elements by increasing the exponent part EXPP of the multiplication scaled input element, as above described with reference to FIG. 10. In this case, the computational amount of the first to (h)-th multiplication scaling circuits 224_1 to 224_h may be reduced or minimized.

The first multiplication scaling circuit 224_1 may sequentially output the plurality of quantization scaled input elements QSIE. For example, the first multiplication scaling circuit 224_1 may provide the plurality of quantization scaled input elements QSIE to the first data type converter 230.

That is, according to various example embodiments of FIGS. 29 to 31, the input vector scaler 220 may perform the common scaling on each of the plurality of input vectors and then perform the multiplication scaling to generate the plurality of quantization scaled input vectors QSX.

FIG. 32 is a block diagram illustrating a configuration of the input vector scaler of FIG. 26 according to various example embodiments. Referring to FIGS. 1 to 4, 23 to 26, and 32, the input vector scaler 220 may include a multiplication scaling circuit 225 and the quantization scaler SCL_quantization. The quantization scaler SCL_quantization may include first to (h)-th quantization scaling circuits 226_1 to 226_h.

The multiplication scaling circuit 225 may sequentially receive the plurality of multiplication scale coefficients MSC from the scale coefficient buffer 210. For example, the multiplication scaling circuit 225 may sequentially receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR (e.g., 2⁰to 2^R-1).

The multiplication scaling circuit 225 may sequentially receive the plurality of common scale coefficients CSC from the common scale coefficient buffer 270. For example, the multiplication scaling circuit 225 may sequentially receive S_r1to S_rn.

The multiplication scaling circuit 225 may generate the plurality of quantization scale coefficients QSC based on the plurality of multiplication scale coefficients MSC and the plurality of common scale coefficients CSC. For example, the multiplication scaling circuit 225 may generate the plurality of quantization scale coefficient QSC by multiplying each of the plurality of common scale coefficients CSC with the plurality of multiplication scale coefficients MSC. The operation of the multiplication scaling circuit 225 is described in more detail below with reference to FIG. 33.

The quantization scaler SCL_quantization may receive the plurality of quantization scale coefficients QSC. For example, each of the first to (h)-th quantization scaling circuits 226_1 to 226_h may receive the plurality of quantization scale coefficients QSC from the multiplication scale circuit 225. In this case, the quantization scale coefficients QSC received by each of the first to (h)-th quantization scaling circuits 226_1 to 226_h may be same.

Each of the first to (h)-th multiplication scaling circuits 226 to 226_h may receive different input vectors. For example, the first to (h)-th quantization scaling circuits 226_1 to 226_h may receive the first to (h)-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively. That is, each of the first to (h)-th quantization scaling circuits 226_1 to 226_h may sequentially receive the plurality of input elements for the corresponding input vector. For example, the first common scaling circuit 226_1 may sequentially receive x₁₁to x_1n.

Each of the first to (h)-th quantization scaling circuits 226_1 to 226_h may perform quantization scaling on the received input vector based on the plurality of quantization scale coefficients QSC. For example, the first to (h)-th quantization scaling circuits 226_1 to 226_h may generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively. The operation of the first to (h)-th quantization scaling circuits 226_1 to 226_h will be described in more detail below with reference to FIG. 34.

FIG. 33 is a diagram illustrating in more detail an operation of a multiplication scaling circuit of FIG. 32. Referring to FIGS. 1 to 4, 23 to 26, and 32 and 33, the multiplication scaling circuit 225 may sequentially receive the first-to-Rth multiplication scale coefficients MSC1 to MSCR (e.g., 2⁰to 2^R-1). The multiplication scaling circuit 225 may sequentially receive S_r1to S_rn.

The multiplication scaling circuit 225 may generate the plurality of quantization scale coefficients QSC (e.g., α_{1_r1}to α_{R_rn}) by multiplying each of S_r1to S_rnwith each of the multiplication scale coefficients MSC.

For example, the multiplication scaling circuit 225 may generate the quantization scale coefficients (e.g., α_{1_r1}to α_{R_r1}) for the first row vector (e.g., {right arrow over (w_r1)}) of the weight matrix WM by multiplying S_r1with the first-to-Rth multiplication scale coefficients MSC1 to MSCR (illustrated as diagonal stripe).

Similarly, the multiplication scaling circuit 225 may generate the quantization scale coefficients QSC (e.g., α_{1_r2}to α_{R_r2}) for the second row vector (e.g., {right arrow over (w_r2)}) of the weight matrix WM by multiplying S_r2with the first-to-Rth multiplication scale coefficients MSC1 to MSCR (illustrated as a dot pattern).

In this way, the first common scaling circuit 222_1 will be able to sequentially calculate the plurality of quantization scale coefficients QSC corresponding to S_r3to S_rn.

In example embodiments, the multiplication scaling circuit 225 may calculate the quantization scale coefficient QSC by increasing the exponent EXPP of the common scale coefficient CSC, similar to that described above with reference to FIG. 10.

FIG. 34 is a diagram illustrating in more detail the operation of the quantization scaling circuit of FIG. 32. Hereinafter, for a more concise description, the operation of the first quantization scaling circuit 226_1 will be representatively described with reference to FIGS. 1 to 4, 23 to 26, and 32 to 34. However, example embodiments are not limited thereto, and the second to (h)-th quantization scaling circuits 226_2 to 226_h may also operate in a similar manner.

The first quantization scaling circuit 226_1 may receive the first input vector (e.g., {right arrow over (X₁)}). That is, the first quantization scaling circuit 226_1 may sequentially receive x₁₁to x_1n.

The first quantization scaling circuit 226_1 may receive the plurality of quantization scale coefficients QSC. For example, the first quantization scaling circuit 226_1 may sequentially receive α_{1_r1}to α_{R_rn}described above with reference to FIG. 33.

The first quantization scaling circuit 226_1 may perform the quantization scaling on the plurality of input elements based on the plurality of quantization scale coefficients QSC. For example, the first quantization scaling circuit 226_1 may generate the plurality of quantization scaled input elements QSIE by multiplying each of the plurality of quantization scale coefficients QSC with corresponding input elements.

For example, the first quantization scaling circuit 226_1 may generate the plurality of quantization scaled input elements QSIE for x₁₁by multiplying α_{1_r1}to α_{R_r1}with x₁₁(illustrated as diagonal stripe).

Similarly, the first quantization scaling circuit 226_1 may generate the plurality of quantization scaled input elements QSIE for x₁₂by multiplying α_{1_r2}to α_{R_r2}with x₁₂(illustrated as a dot pattern).

In this way, the first quantization scaling circuit 226_1 may generate the plurality of quantization scaled input elements QSIE based on x₁₃to x_1n.

That is, according to various example embodiments of FIGS. 32 to 34, the input vector scaler 220 may generate the plurality of quantization scaled input vectors by generating the plurality of quantization scale coefficients, and then performing quantization scaling for each of the plurality of input vectors based on the plurality of generated quantization scale coefficients. In this case, unlike various example embodiments described above with reference to FIGS. 27 to 31, the input vector scaler 220 may include only one multiplication scaling circuit. However, example embodiments are not limited thereto.

Referring to FIGS. 27 to 34, the input vector scaler 220 may generate the plurality of quantization scaled input elements (e.g., quantization scaled input elements QSIE illustrated as diagonal stripe) based on one input element (e.g., x₁₁). In other words, the input vector scaler 220 may generate the plurality of quantization scaled input elements QSIE by using one input element repeatedly. This may be achieved through the use of a buffer or memory. Therefore, according to various example embodiments, the input reuse of the matrix multiplier 200 may be increased or maximized, and thus, the number of times the matrix multiplier 200 receives input elements from the outside may be reduced or minimized. In this case, the number of times the matrix multiplier 200 accesses an external memory device where the input elements stored may be reduced or minimized, so the operation efficiency and operating speed of the matrix multiply device MMD may be improved.

FIG. 35 is a block diagram illustrating a configuration of the first data type converter of FIG. 26. Referring to FIGS. 1, 4, and 23 to 35, the first data type converter 230 may include first to (h)-th exponent extraction circuits 231_1 to 231_h and first to (h)-th data type conversion circuits 232_1 to 232_h.

Each of the first to (h)-th exponent extraction circuits 231_1 to 231_h may receive different quantization scaled input vectors QSX. For example, the first to (h)-th exponent extraction circuits 231_1 to 231_h may receive the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively.

Each of the first to (h)-th exponent extraction circuits 231_1 to 231_h may extract the exponent from the plurality of quantization scaled input elements QSIE included in the received quantization scaled input vector QSX. For example, the first exponent extraction circuit 231_1 may sequentially receive the plurality of quantization scaled input elements QSIE included in the first quantization scaled input vector {right arrow over (QSX₁)}. The first exponent extraction circuit 231_1 may extract the first exponent EXP1 from the plurality of quantization scaled input elements QSIE included in the received first quantization scaled input vector {right arrow over (QSX₁)}. In this way, the first to (h)-th exponent extraction circuits 231_1 to 231_h may extract first to (h)-th exponents EXP1 to EXPh, respectively.

The first to (h)-th exponent extraction circuits 231_1 to 231_h may provide the extracted exponents to the first to (h)-th data type conversion circuits 232_2 to 232_h, respectively. In addition, each of the first to (h)-th exponent extraction circuits 231_1 to 231_h may provide each of the extracted exponents to the second data type converter 260.

A detailed method of extracting the exponent from the received element by each of the first to h-th exponent extract circuits 231_1 to 231_h is similar to the operation of the exponent extraction circuit described above with reference to FIGS. 11 and 12, so that a detailed description thereof is omitted.

The first to (h)-th data type conversion circuits 232_1 to 232_h may receive the first to (h)-th exponents EXP1 to EXPh, respectively. The first to (h)-th data type conversion circuits 232_1 to 232_h may receive the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively.

Each of the first to (h)-th data type conversion circuits 232_1 to 232_h may convert a data type of the received quantization scaled input vector into fixed point based on the received exponent. That is, the first to (h)-th data type conversion circuits 232_1 to 232_h may output the first to (h)-th fixed point quantization scaled input vectors (e.g., {right arrow over (QSX′₁)} to {right arrow over (QSX′_h)}), respectively. For example, the first data type conversion circuit 232_1 may convert each of the quantization scaled input elements QSIE included in the first quantization scaled input vector (e.g., {right arrow over (QSX′₁)}) into the fixed point format based on the first exponent EXP1.

A specific method in which each of the first to (h)-th data type conversion circuits 232_1 to 232_h convert the data type of the quantization scaled input element QSIE to fixed point is similar to the operation of the data type described above with reference to FIGS. 11 and 13, so that a detailed description thereof is omitted.

FIG. 36 is a block diagram illustrating in more detail a configuration of a processing element array of FIG. 26. Referring to FIGS. 1 to 4 and 23 to 36, the processing element array 250 may include a plurality of processing elements PE arranged in a row direction and a column direction. Hereinafter, for a more concise description, it is assumed that the plurality of processing elements PE are arranged with (h) rows and (m) columns. In addition, the processing element arranged in the (i)-th row and (j)-th column of the processing element array 150 will be referred to as “PEij”. For example, the processing element arranged in the first row and second column of processing element array 250 will be referred to as “PE12.”

The processing element array 250 may include the first to (h)-th processing element rows PER1 to PERh. Each of the first to (h)-th processing element rows PER1 to PERh may include the plurality of processing elements PE. For example, the first processing element row PER1 may include processing elements PE11 to PE1m.

The processing element array 150 may include first to (h)-th processing element columns PEC1 to PECm. Each of the first to (m)-th processing element columns PEC1 to PECm may include the plurality of processing elements PE. For example, the first processing element column PEC1 may include processing elements PE11 to PEh1.

Different processing element rows may receive different fixed point quantization scaled input vectors QSXfxp. For example, the first to (h)-th processing element rows PER1 to PERh may receive the first to (h)-th fixed point quantization scaled input vectors (e.g., {right arrow over (QSX′₁)} to {right arrow over (QSX′_h)}) respectively.

The processing elements included in same processing element row may receive the same fixed point multiplication scaled input vector QSXfxp. For example, each of the processing elements PE11 to PE1m may receive the first fixed point quantization scaled input vector (e.g., {right arrow over (QSX′₁)}).

Different processing element columns may receive a plurality of different quantization sign bits QSBs. For example, the first to (m)-th processing element columns PEC1 to PECm may receive the first to (m)-th plurality of quantization sign bits QSBs_1 to QSBs_m, respectively.

Processing elements arranged in the same processing element column may receive the plurality of same quantization sign bits QSBs. For example, each of the processing elements PE11 to PEh1 may receive the first plurality of quantization sign bits QSBs_1, and each of the processing elements PE12 to PEh2 may receive the second plurality of quantization sign bits QSBs_2.

Each of the plurality of processing elements PE may calculate different fixed point output elements based on the received plurality of fixed point quantization scaled input elements QSIEfxp and the plurality of quantization sign bits QSBs. That is, according to some example embodiments, one processing element PE may calculate one fixed point output element. For example, the processing element PEij may calculate y′_ij. Hereinafter, the fixed point output element calculated in each processing element PE will be described in more detail.

Each of the first to (h)-th processing element rows PER1 to PERh may calculate different fixed point output vectors. For example, the first to (h)-th processing element rows PER1 to PERh may output the first to (h)-th fixed point output vectors (e.g., {right arrow over (Y′₁)} to {right arrow over (Y′_h)}) respectively.

The processing elements arranged in the same processing element row and different processing element columns may calculate different fixed point output elements. For example, the processing elements PE11 to PE1m may calculate y′₁₁to y′_1mrespectively. In a similar manner, the processing elements PE21 to PE2m may calculate y′₂₁to y′_2mrespectively, and the processing elements PEh1 to PEhm may calculate y′_h1to y′_hmrespectively.

FIG. 37 is a block diagram illustrating in more detail an operation of the processing elements of FIG. 36. Hereinafter, the operation of the first processing element row PER1 will be representatively described with reference to FIGS. 1 to 4 and 23 to 37. However, example embodiments are not limited thereto, and the second to (h)-th processing element rows PER2 to PERh may also operate in a similar manner.

The first processing element row PER1 may receive the first fixed point quantization scaled input vector (e.g., {right arrow over (QSX′₁)}). For example, the first processing element row PER1 may sequentially receive the plurality of fixed point quantization scaled input elements QSIEfxp. That is, the first processing element row PER1 may sequentially receive a plurality of quantization scaled input elements QSIE described above with reference to FIGS. 28, 31, and 34 with fixed point format.

The processing elements PE11 to PE1m may receive the first plurality of quantization sign bits QSBs_1 to the (m)-th plurality of quantization sign bits QSBs_m, respectively. For example, the processing element PE11 may receive the first plurality of quantization sign bits QSBs_1, and the processing element PE12 may receive the second plurality of quantization sign bits QSBs_2.

The first plurality of quantization sign bits QSBs_1 may include quantization sign bits QSBs arranged in a first column of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R described above with reference to FIG. 25. For example, the first plurality of quantization sign bits QSBs_1 are quantization sign bits arranged in the first column of each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R (e.g., b_{1_1_r1}to b_{1_R_rn}).

More specifically, the processing element PE11 may sequentially receive the quantization sign bits arranged in the first row and first column of each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R (e.g., b_{1_1_r1}to b_{1_R_r1}); and then sequentially receive the quantization sign bits arranged in the second row and first column of each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R (e.g., b_{1_1_r2}to b_{1_R_r2}). In this way, the processing element PE11 may sequentially receive the quantization sign bits arranged in the (n)-th row and first column of each of the first to (R)-th quantization sign bit matrices QSBM_1 to QSBM_R (e.g., b_{1_1_rn}to b_{1_R_rn}).

That is, the processing element PE11 may sequentially receive the pair of quantization sign bit QSB and fixed point quantization scaled input element QSIEfxp corresponding to each other.

The processing element PE11 may calculate the fixed point output element (e.g., y′₁₁) based on the order in which the quantization sign bits QSB and the plurality of fixed point multiplication scaled input elements QSIEfxp are received. For example, the processing element PE11 may calculate the fixed point output element y′₁₁by subtracting a second value obtained by summing the fixed point multiplication scaled input elements corresponding to quantization sign bits QSB representing ‘1’ among the fixed point quantization scaled input elements QSIEfxp, from a first value obtained by summing the fixed point quantization scaled input elements corresponding to quantization sign bit QSB representing ‘0’ among the plurality of fixed point scaled input elements QSIEfxp.

In this way, the processing element PE1j may receive the j-th plurality of quantization sign bits QSBs_j (e.g., b_{1_1_cj}to b_{n_R_cj}). The processing element PE1j may receive the plurality of fixed point quantization scaled input elements QSIEfxp included in the first fixed point quantization scaled input vector (e.g., {right arrow over (QSX′₁)}). The processing element PE1j may calculate the output element “y′_1j” based on the j-th plurality of quantization sign bits QSBs_j and the first fixed point quantization-multiplication scaled input vector (e.g., {right arrow over (QSX′₁)}).

For a more detailed example, the processing element PE may calculate the fixed point output element OEfxp according to Equation 23 below.

$\begin{matrix} OEfxp = \sum QSIEfxp \times {(- 1)}^{QSB} & (Equation 23) \end{matrix}$

Referring to Equation 23, OEfxp may represent the fixed point output element, and QSIEfxp and QSB may represent the pair of quantization sign bit QSB and fixed point quantization scaled input element QSIEfxp provided to the processing element PE. That is, the processing element PE may calculate the fixed point output element by sequentially performing calculations of adding or subtracting the value of the fixed point quantization scaled input element QSIEfxp based on the received quantization sign bit QSB. However, example embodiments are not limited thereto.

FIG. 38 is a diagram illustrating a configuration of one of the processing elements of FIG. 36 implemented according to various example embodiments. Referring to FIGS. 1 to 4 and 23 to 38, the processing element PE may include an arithmetic logic unit ALU and an accumulation register REG_ACC.

The arithmetic logic unit ALU may include first to third input terminals TI1 to TI3 and an output terminal TO. The first input terminal TI1 may sequentially receive a plurality of fixed point quantization scaled input elements QSIEfxp. The second input terminal TI2 may sequentially receive a plurality of quantization sign bits QSB. The third input terminal TI3 may be connected to the accumulation register REG_ACC.

In example embodiments, each fixed point quantization scaled input element QSIEfxp may have a code length of 8-bit, and the arithmetic logic unit ALU may be configured to receive data in 8-bit units through the first input terminal TI1.

When the value received by the second input terminal TI2 is ‘0’, the arithmetic logic unit ALU may store the sum of the value received by the first input terminal TI1 and the third input terminal TI3 in the accumulation register REG_ACC through the output terminal TO. On the other hand, when the value received by the second input terminal TI2 is ‘1’, the arithmetic logic unit ALU may store a value in the accumulation register REG_ACC, which is obtained by subtracting a value provided to the first input terminal TI1 from a value provided to the third input terminal TI3. That is, when receiving a quantization sign bit QSB at the second terminal TI2 equal to ‘0’ the arithmetic logic unit ALU may calculate the sum of the values input into the first and third terminals TI1 and TI3 and output the sum through output terminal TO. Similarly, when receiving when receiving a quantization sign bit QSB at the second terminal TI2 equal to ‘1’ the arithmetic logic unit ALU may subtract the value input into the first terminal TI1 from the value input into the third terminal TI3 and output the resulting value through output terminal TO. The accumulation register may store the most recently received value output from the arithmetic logic unit ALU and feed that value back to the third input terminal TI3.

In this way, the processing element PE may be able to store the fixed point output element OEfxp calculated according to Equation 23 described above in the accumulation register REG_ACC. In this case, the fixed point output element OEfxp stored in the accumulation register REG_ACC may be provided to the second data type converter 260. However, example embodiments are not limited to the specific method in which the processing element PE performs the calculation and the specific configuration method of the processing element PE.

In example embodiments, the fixed point output element OEfxp may have a code length of 8-bit or more. For example, the fixed point output element OEfxp may have a code length long enough to represent the accumulated size of the plurality of fixed point quantization scaled input elements QSIEfxp.

In example embodiments, the accumulation register REG_ACC may have a size of 8-bit or more. For example, the accumulation register REG_ACC may have a size large enough to store the fixed point output element OEfxp.

In example embodiments, the fixed point output element OEfxp may have a code length of 10-bit to 12-bit or more. However, example embodiments are not limited thereto.

In example embodiments, the accumulation register REG_ACC may have a size of 10-bit to 12-bit. However, example embodiments are not limited thereto.

In example embodiments, the plurality of fixed point quantization scaled input elements QSIEfxp received by the arithmetic logic unit ALU may correspond to same exponent value. In this case, the arithmetic logic unit ALU may be able to perform the operation of Equation 23 described above even without considering the place value of each of the plurality of fixed point quantization scaled input elements QSIEfxp. Accordingly, the arithmetic logic unit ALU may calculate the fixed point output element OEfxp with a reduced or minimum amount of calculation.

In example embodiments, the processing element PE may not include a ‘1-bit adder’ to perform the multiplication operation on the multiplication scale coefficient MSC. That is, according to some example embodiments, since the quantization scaled input element QSIE being provided to the processing element array 250, each processing element PE may be implemented not to perform the calculation on the multiplication scale coefficient MSC. In this case, instead of each processing element PE including the circuit element for performing the multiplication calculation for the multiplication scale coefficient MSC, only one multiplication scaling circuit may be included in each processing element row PER (e.g., when the input vector scaler 220 is implemented according to various example embodiments described above with reference to FIGS. 27 to 31); or only one multiplication scaling circuit may exist for the whole processing element array 250 (e.g., when the input vector scaler 220 is implemented according to various example embodiments described above with reference to FIGS. 32 to 34). Therefore, the size and production cost of the matrix multiplier 200 may be reduced.

FIG. 39 is a diagram illustrating an operation of the second data type converter of FIG. 26. Hereinafter, for a more concise description, the operation of the second data type converter 260 for one fixed point output element OEfxp will be representatively described. However, example embodiments are not limited thereto. For example, the second data type converter 260 may operate in a similar manner for any fixed point output element OEfxp.

Referring to FIGS. 1 to 4 and 23 to 39, the second data type converter 260 may receive the fixed point output element OEfxp. For example, the second data type converter 260 may receive one of y′₁₁to y′_hn.

The second data type converter 260 may receive the first to (h)-th exponents EXP1 to EXPh from the first data type converter 230. In this case, the first to (h)-th exponents EXP1 to EXPh may correspond to the first fixed point output vector (e.g., {right arrow over (Y′₁)}) to the (h)-th fixed point output vector (e.g., {right arrow over (Y′_h)}) respectively.

The fixed point output element OEfxp may include an sign part SP and a mantissa part MTSP. The second data type converter 260 may generate the output element OE by converting the data type of the fixed point output element OEfxp to floating point.

The method in which the second data type converter 260 converts the data type of the fixed point output element OEfxp is similar to the method in which the second data type converter 160 converts the data type of the fixed point partial product PSPfxp described above with reference to FIG. 18, and therefore, detailed description thereof will be omitted.

FIG. 40 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1. Hereinafter, referring to FIGS. 1 to 4, and 23 to 40, the operation of the matrix multiply device MMD that receives one input vector (e.g., {right arrow over (X₁)}) and outputs one output vector (e.g., {right arrow over (Y₁)}) will be described.

In operation S310, the matrix multiply device MMD may receive the weight matrix WM. For example, the uniform BCQ circuit UBC may receive the plurality of weights (e.g., w₁₁to w_nm) included in the weight matrix WM.

In operation S320, the matrix multiply device MMD may generate a plurality of multiplication scale coefficients MSC, a plurality of common scale coefficients CSC, and a plurality of quantization sign bits QSB by performing uniform binary coding quantization for the weight matrix WM. For example, the uniform BCQ circuit UBC may approximate each weight based on a plurality of multiplication scale coefficients MSC, one common scale coefficient CSC, and a plurality of quantization sign bits QSBs. The uniform BCQ circuit UBC may provide the plurality of generated multiplication scale coefficients MSC, the plurality of common scale coefficients CSC, and the plurality of quantization sign bits QSBs to the matrix multiplier 200. In this case, the plurality of multiplication scale coefficients MSC may be stored in the multiplication scale coefficient buffer 210, the plurality of common scale coefficients CSC may be stored in the common scale coefficient buffer 270, and the plurality of quantization sign bits QSB may be stored in the quantization sign bit buffer 240. However, example embodiments are not limited thereto.

In operation S330, the matrix multiply device MMD may receive the input vector (e.g., {right arrow over (X₁)}). For example, the matrix multiplier 200 may receive the plurality of input elements (e.g., x₁₁to x_1n) included in the input vector.

In example embodiments, the matrix multiply device MMD may perform operation S330 regardless of the order of operations S310 to S320. For example, the matrix multiply device MMD may perform operation S330 followed by operations S310 to S320, or perform operation S330 between operations S310 and S320.

In operation S340, the matrix multiply device MMD may generate a quantization scaled input vector QSX by performing the quantization scaling for the input vector based on the plurality of multiplication scale coefficients MSC and the plurality of common scale coefficients CSC. For example, the input vector scaler 220 may generate the plurality of quantization scaled input elements QSIE by performing the quantization scaling for each of the plurality of input elements based on the plurality of multiplication scale coefficients MSC and the plurality of common scale coefficients CSC.

In operation S350, the matrix multiply device MMD may generate the output vector (e.g., {right arrow over (Y₁)}) based on the plurality of quantization scaled input elements QSIE included in the quantization scaled input vector QSX and the plurality of quantization sign bits QSBs. For example, the matrix multiplier 200 may generate one output element (e.g., y₁₁) by performing calculations of sequentially adding or subtracting the plurality of quantization scaled input elements QSIE included in the quantization scaled input vector QSX based on the first plurality of quantization sign bits QSBs_1. In this way, the matrix multiplier 200 may generate the plurality of output elements included in the output vector.

FIG. 41 is a flowchart illustrating in more detail operation S350 of FIG. 40. Referring to FIGS. 1 to 4 and 23 to 41, operation S350 may include operations S351 to S353.

In operation S351, the matrix multiplier 200 may generate the fixed point quantization scaled input vector QSXfxp by converting the data type of the quantization scaled input vector QSX to fixed point. That is, the matrix multiplier 200 may generate the fixed point quantization scaled input vector QSXfxp based on the quantization scaled input vector QSX. For example, the first data type converter 230 may generate the plurality of fixed point quantization scaled input elements QSIEfxp by converting each data type of the plurality of quantization scaled input elements QSIE to fixed point.

In operation S352, the matrix multiplier 100 may generate the fixed point output vector (e.g., {right arrow over (Y′₁)}) based on the plurality of quantization sign bits QSB and the plurality of fixed point quantization scaled input elements QSIEfxp. For example, the processing element array 150 may generate the fixed point output vector by performing the calculations of sequentially adding or subtracting the plurality of fixed point quantization scaled input elements QSIEfxp based on the plurality of quantization sign bits QSBs.

More specifically, the processing element array 250 may generate one fixed point output element (e.g., y′₁₁) by performing the calculations of sequentially adding or subtracting the plurality of quantization scaled input elements QSIE based on the first plurality of quantization sign bits QSBs_1. Similarly, the processing element array 250 may generate one fixed point output element (e.g., y′₁₂) by performing the calculations of sequentially adding or subtracting the plurality of quantization scaled input elements QSIE based on the second plurality of quantization sign bits QSBs_2. In this way, the processing element array 250 may calculate the plurality of fixed point output elements OEfxp included in the fixed point output vector (e.g., {right arrow over (Y′₁)}).

In operation S353, the matrix multiplier 200 may convert the data type of the fixed point output vector (e.g., {right arrow over (Y′₁)}) to floating point. For example, the second data type converter 260 may receive the plurality of fixed point output elements OEfxp included in the fixed point output vector. The second data type converter 260 may convert the plurality of fixed point output elements OEfxp into the plurality of output elements OE respectively.

FIG. 42 is a flowchart illustrating the operation of the matrix multiply device of FIG. 1. Hereinafter, referring to FIGS. 1 to 4 and 23 to 42, the operation of the matrix multiply device MMD that receives one input vector (e.g., {right arrow over (X₁)}) and generates one input element (e.g., y₁₁) will be described.

In operation S410, the matrix multiply device MMD may receive the first to (n)-th weights. Operation S410 is similar to operation S210 described above with reference to FIG. 21, and therefore, detailed description thereof will be omitted.

In operation S420, the matrix multiply device MMD may generate the first to (R)-th multiplication scale coefficients MSC, the first to (n)-th common scale coefficients CSC, and the first-to-(N×R)th quantization sign bits QSB by performing the uniform binary coding quantization for the first to (n)-th weights. That is, the uniform BCQ circuit UBC may generate the first to (n)-th common scale coefficients CSC based on the first to (n)-th weights, unlike S220 described above with reference to FIG. 21 (which only generates a single common scale coefficient CSC for the first to n-th weights).

In operation S430, the matrix multiply device MMD may receive the first to (n)-th input elements. Operation S430 is similar to operation S230 described above with reference to FIG. 21, and therefore, detailed description thereof will be omitted.

In example embodiments, the matrix multiply device MMD may perform operation S430 regardless of the order of operations S410 to S420. For example, the matrix multiply device MMD may perform operation S430 followed by operations S410 and S420, or perform operation S430 between operations S410 and S420.

In operation S440, the matrix multiply device MMD may generate the first-to-(N×R)th quantization scaled input elements QSIE by performing the quantization scaling for the first-to-Nth input elements based on the first to (R)-th multiplication scale coefficients MSC and the first to (n)-th common scale coefficients CSC. For example, the input vector scaler 220 may generate the first-to-(N×R)th quantization scaled input elements QSIE by performing the quantization scaling for the first-to-Nth input elements in various ways as described above with reference to FIGS. 27 to 34. The more detailed operation of the input vector scaler 220 will be described in more detail below with reference to FIGS. 43 to 45.

In operation S450, the matrix multiply device MMD may generate one output element (e.g., y₁₁) based on the first-to-(N×R)th quantization scaled input elements QSIE and the first-to-(N×R)th quantization sign bits QSB. For example, the matrix multiplier 200 may generate one output element (e.g., y₁₁) by calculating a value obtained by subtracting a value obtained by summing the quantization scaled input elements QSIE corresponding to the quantization sign bits QSBs representing ‘1’ among the first-to-(N×R)th quantization scaled input elements QSIE, from the sum of the quantization scaled input elements QSIE corresponding to the quantization sign bit QSB representing ‘0’ among the first-to-(N×R)th quantization scaled input elements QSIE.

FIGS. 43 to 45 are flowcharts illustrating in more detail operation S440 of FIG. 42 implemented according to example embodiments. Hereinafter, operation S440 when the input vector scaler 220 is implemented according to various example embodiments described above with reference to FIGS. 27 to 28 will be described with reference to FIG. 43, operation S440 when the input vector scaler 220 is implemented according to various example embodiments described above with reference to FIGS. 29 to 31 will be described with reference to FIG. 44, and operation S440 when the input vector scaler 220 is implemented according to various example embodiments described above with reference to FIGS. 32 to 34 will be described with reference to FIG. 45.

First, referring to FIGS. 1 to 4 and 23 to 43, operation S440 may be implemented as the following operation S440a. Operation S440a may include operations S441a and S442a.

In operation S441a, the input vector scaler 220 may generate the first-to-(N×R)th multiplication scaled input elements MSIE by multiplying each of the first-to-Nth input elements with the first to (R)-th multiplication scale coefficients MSC. For example, the first multiplication scaling circuit 221_1 may generate the first to (R)-th multiplication scaled input element MSIE by multiplying the first input element with the first to (R)-th multiplication scale coefficients MSC1 to MSCR; and generate (R+1)th-to-(2R)th multiplication scaled input elements MSIE by multiplying the second input element with the first to (R)-th scale coefficients MSC1 to MSCR.

In operation S442a, the input vector scaler 220 may multiply each of the first-to-(N×R)th multiplication scaled input elements MSIE by the corresponding one of the first to (n)-th common scale coefficients CSC to generate the first-to-(N×R)th quantization scaled input elements QSIE. For example, the first common scaling circuit 222_1 may generate the first to (R)-th quantization scaled input elements QSIE (e.g., quantization scaled input elements for x₁₁) by multiplying the first-to-Rth multiplication scaled input elements MSIE (e.g., multiplication scaled input elements MSIE11_1 to MSIE11_R) with the first common scale coefficient CSC (e.g., S_r1). Similarly, the first common scaling circuit 222_1 may generate the (R+1)th-to-(2R)th quantization scaled input elements QSIE (e.g., quantization scaled input elements for x₁₂) by multiplying the (R+1)th-to-(2R)th multiplication scaled input elements MSIE (e.g., multiplication scaled input elements MSIE12_1 to MSIE12_R) with the second common scale coefficient CSC (e.g., S_r2).

Next, referring to FIGS. 1 to 4, 23 to 42, and 44, operation S440 may be implemented as the following operation S440b. Operation S440b may include operations S441b and S442b.

In operation S441b, the input vector scaler 220 may generate the first to (n)-th common scaled input elements CSIE by multiplying the first-to-Nth input elements with the first to (n)-th common scale coefficients CSC, respectively. For example, the first common scaling circuit 223_1 may generate the first to (n)-th common scaled input elements CSIE by multiplying the first-to-Nth input elements with the first to (n)-th common scale coefficients CSC, respectively.

In operation S442b, the input vector scaler 220 may generate the first-to-(N×R)th quantization scaled input elements QSIE by multiplying each of the first-to-Nth common scaled input elements CSIE with the first to (R)-th multiplication scale coefficients MSC. For example, the first multiplication scaling circuit 224_1 may generate the first to (R)-th quantization scaled input elements QSIE by multiplying the first common scaled input element CSIE with the first to (R)-the multiplication scale coefficients MSC; and generate the (R+1)th-to-(2R)th quantization scaled input elements QSIE by multiplying the second common scaled input element CSIE with the first to (R)-th multiplication scale coefficients MSC.

Next, referring to FIGS. 1 to 4, 23 to 42, and 45, operation S440 may be implemented as the following operation S440c. Operation S440c may include operations S441c and S442c.

In operation S441c, the input vector scaler 220 may generate the first-to-(N×R)th quantization scale coefficients QSC by multiplying each of the first to (R)-th multiplication scale coefficients MSC with the first to (n)-th common scale coefficients CSC. For example, the multiplication scaling circuit 225 may generate the first-to-(N×R)th quantization scale coefficients QSC (e.g., α_{1_r1}to α_{R_rn}) by calculating the products of different combinations (e.g. each combination) of the first to (R)-th multiplication scale coefficients MSC and the first to (n)-th common scale coefficients CSC.

In operation S442c, the input vector scaler 220 may generate the first-to-(N×R)th quantization scaled input elements QSIE by multiplying each of the first-to-(N×R)th quantization scale coefficients QSC with the corresponding one of the first to (n)-th input elements. For example, the first quantization scaling circuit 226_1 may generate the first to (R)-th quantization scaled input elements QSIE by multiplying the first input element (e.g., x₁₁) with the first to (R)-th quantization scale coefficients QSC (e.g., α_{1_r1}to α_{R_r1}). Similarly, the first quantization scaling circuit 226_1 may generate the (R+1)th-to-(2R)th quantization scaled input elements QSIE by multiplying the second input element (e.g., x₁₂) with the (R+1)th-to-(2R)th quantization scale coefficients QSC (e.g., α_{1_r2}to α_{R_r2}).

FIG. 46 is a flowchart illustrating in more detail operation S450 of FIG. 42. Referring to FIGS. 1 to 4 and 23 to 46, operation S450 may include operations S451 to S453.

In operation S451, the matrix multiplier 100 may generate the first-to-(N×R)th fixed point quantization scaled input elements QSIEfxp by converting the data type of the first-to-(N×R)th quantization scaled input elements QSIE to fixed point. For example, the first data type converter 230 may convert the first-to-(N×R)th quantization scaled input elements QSIE to the first-to-(N×R)th fixed point quantization scaled input elements QSIEfxp, respectively.

In operation S452, the matrix multiplier 200 may calculate one fixed point output element OEfxp by subtracting a value obtained by summing the fixed point quantization scaled input elements QSIE corresponding to the quantization sign bits QSBs representing ‘1’ among the first-to-(N×R)th fixed point quantization scaled input elements QSIEfxp, from a sum of the fixed point quantization scaled input elements QSIEfxp corresponding to the quantization sign bit QSB representing ‘0’ among the first-to-(N×R)th fixed point quantization scaled input elements QSIEfxp. For example, the processing element PE11 may calculate one fixed point output element OEfxp (e.g., y′₁₁) in the manner described above with reference to Equation 23.

In operation S453, the matrix multiplier 200 may generate one output element by converting the data type of the fixed point output element OEfxp to floating point. For example, the second data type converter 260 may receive the fixed point output element OEfxp (e.g., y′₁₁) and output the output element OE (e.g., y₁₁).

FIG. 47 is a diagram illustrating the operation of the BCQ circuit of FIG. 1 according to various example embodiments. Referring to FIGS. 1 to 4 and 47, the uniform BCQ circuit UBC may perform the binary coding quantization operation for each row of the weight matrix WM.

The uniform BCQ circuit UBC may approximate each row vector of the weight matrix WM with the sum of a zero point vector (e.g., {right arrow over (ZP)}) and the products of the plurality of quantization scale coefficients (e.g., α) and the plurality of quantization sign vectors (e.g., {right arrow over (B)}). For example, the uniform BCQ circuit UBC may perform the uniform binary coding quantization operation for each row of the weight matrix WM based on Equation 24 below. Hereinafter, the difference from the operation of the matrix multiply device MMD described above with reference to FIGS. 23 to 46 will be mainly described.

$\begin{matrix} \vec{w_{r ι}} \approx \vec{{ZP}_{r ι}} + \sum_{k = 1}^{R} (α_{k_ri} \times \vec{B_{k_r ι}}) & (Equation 24) \end{matrix}$

Referring to Equation 24, {right arrow over (ZP_rι)} may represent the zero point vector for the (i)-th row vector of the weight matrix WM. {right arrow over (ZP_rι)} may be implemented as a row vector having the same dimension as the number of columns of the weight matrix WM. Each element of {right arrow over (ZP_rι)} may have the same value. For example, {right arrow over (ZP_rι)} may be expressed as Equation 25 below.

$\begin{matrix} \vec{{ZP}_{r ι}} = [\begin{matrix} ZPVi & ZPVi & \dots & ZPVi \end{matrix}] & (Equation 25) \end{matrix}$

Referring to Equation 25, ZPVi may represent a zero point value ZPVi included in {right arrow over (ZP_rι)}. That is, the zero point value ZPVi corresponding to the weights included in the i-th row of the weight matrix WM may be the same.

Accordingly, each of the plurality of weights included in the weight matrix WM may be approximated based on the plurality of quantization scale coefficients QSC and the plurality of quantization sign bits QSB according to Equation 26 below.

$\begin{matrix} (Equation 26) \end{matrix}$

$W_{ij} \approx ZPVi + 2^{0} \times s_{ri} \times {(- 1)}^{b_{j_1_ri}} + 2^{1} \times s_{ri} \times {(- 1)}^{b_{j_2_ri}} + \dots + 2^{R - 1} \times s_{ri} \times {(- 1)}^{b_{j_R_ri}}$

That is, according to various example embodiments of the present disclosure, one weight may be approximated based on one zero point value ZPV, R multiplication scale coefficients MSC, one common scale coefficient CSC, and R quantization sign bits QSB.

In this way, the uniform BCQ circuit UBC may approximate the weight matrix WM based on the plurality of zero point values ZPV, the plurality of multiplication scale coefficients MSC, the plurality of common scale coefficients CSC, and the plurality of quantization sign bits QSB. In this case, unlike previously described with reference to FIG. 1, the uniform BCQ circuit UBC may further provide the plurality of zero point values ZPV to the matrix multiplier 100.

The operation of the matrix multiplier 100 based on the weights approximated based on the zero point value ZPV, the multiplication scale coefficient MSC, the common scale coefficient CSC, and the quantization sign bit QSB will be described in more detail below with reference to FIGS. 48 to 58.

FIG. 48 is a block diagram illustrating a configuration of the matrix multiplier of FIG. 1 implemented according to various example embodiments. Referring to FIGS. 1 to 4 and 47 and 48, the matrix multiplier 100 of FIG. 1 may be implemented as the matrix multiplier 300 of FIG. 48.

The matrix multiplier 300 may include a multiplication scale coefficient buffer 310, a common scale coefficient buffer 370, an input vector scaler 320, a first data type converter 330, a quantization sign bit buffer 340, a processing element array 350, and a second data type converter 360.

Each of the components of the matrix multiplier 300 may perform operations similar to the components of the matrix multiplier 200 described above with reference to FIGS. 23 to 46. Hereinafter, the difference between the matrix multiplier 300 and the matrix multiplier 200 described above with reference to FIGS. 23 to 46 will be mainly described.

The multiplication scale coefficient buffer 310 may store the plurality of multiplication scale coefficients MSC and a plurality of zero point scale coefficients ZPSC provided from the uniform BCQ circuit UBC. The multiplication scale coefficient buffer 210 may provide the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC to the input vector scaler 220.

In example embodiments, each of the plurality of zero point scale coefficients ZPSC may be ‘1’. Hereinafter, for a more concise description, various example embodiments in which each of the plurality of zero point scale coefficients ZPSC is ‘1’ will be representatively described. However, example embodiments are not limited thereto.

The common scale coefficient buffer 370 may store the plurality of common scale coefficients CSC and the plurality of zero point values ZPC provided from the uniform BCQ circuit UBC. The common scale coefficient buffer 370 may provide the plurality of common scale coefficients CSC and the plurality of zero point values ZPV to the input vector scaler 320.

The input vector scaler 320 may receive the input matrix XM. The input vector scaler 320 may generate the plurality of quantization scaled input vectors QSX based on the plurality of common scale coefficients CSC, the plurality of zero point values ZPV, the plurality of zero point scale coefficients ZPSC, and the plurality of multiplication scale coefficients MSC. That is, the input vector scaler 320 may generate the plurality of quantization scaled input vectors QSX further based on the plurality of zero point values ZPV and the plurality of zero point scale coefficients ZPSC.

Each of the plurality of quantization scaled input vectors QSX may be implemented as a row vector having dimension ‘R+1’-times of the corresponding input vector. For example, the first quantization scaled input vector (e.g., {right arrow over (X₁)}) corresponding to the first input vector (e.g., {right arrow over (QSX₁)}) may include “(R+1)×n” quantization scaled input elements QSIE as illustrated in Equation 27 below.

$\begin{matrix} (Equation 27) \end{matrix}$

$\vec{{QSX}_{1}} ∋ (1 \times {ZPV}_{1} \times x_{11}), (2^{0} \times s_{r 1} \times x_{1 1}), (2^{1} \times s_{r 1} \times x_{1 1}), \dots, (2^{R - 1} \times s_{r 1} \times x_{1 1}),$

$(1 \times {ZPV}_{2} \times x_{12}), (2^{0} \times s_{r 2} \times x_{1 2}), (2^{1} \times s_{r 2} \times x_{1 2}), \dots, (2^{R - 1} \times s_{r 2} \times x_{1 2}),$

$\dots,$

$(1 \times {ZPV}_{n} \times x_{1 n}), (2^{0} \times s_{rn} \times x_{1 n}), (2^{1} \times s_{rn} \times x_{1 n}), \dots, (2^{R - 1} \times s_{rn} \times x_{1 n})$

Referring to Equation 27, some of the plurality of quantization scaled input elements QSIE included in the first scaled input vector may correspond to values obtained by multiplying each of the plurality of input elements included in the first input vector with the corresponding zero point value ZPV.

Similarly, the input vector scaler 320 may generate the second to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₂)} to {right arrow over (QSX_h)}). For concise description, detailed description of the quantization scaled input elements QSIE included in each of the second to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₂)} to {right arrow over (QSX_h)}) will be omitted. A specific method in which the input vector scaler 320 generates the plurality of quantization scaled input vectors QSX will be described in more detail below with reference to FIGS. 49 to 57.

The first data type converter 330 may receive the plurality of quantization scaled input vectors QSX. The first data type converter 330 may convert the data type of the plurality of quantization scaled input vectors QSX to fixed point. That is, the first data type converter 330 may generate the first to (h)-th fixed point quantization scaled input vectors (e.g., {right arrow over (QSX′₁)} to {right arrow over (QSX′_h)}).

The quantization sign bit buffer 340 may store the plurality of quantization sign bits QSB and a plurality of zero point correction bits ZPCB provided from the uniform BCQ circuit UBC. The quantization sign bit buffer 140 may provide the plurality of quantization sign bits QSB and a plurality of zero point correction bits ZPCB to the processing element array 350.

In example embodiments, each of the plurality of zero point correction bits ZPCB may be ‘0’. Hereinafter, for a more concise description, various example embodiments in which each of the plurality of zero point correction bits ZPSB is ‘0’ will be representatively described. However, example embodiments are not limited thereto.

The processing element array 350 may receive the plurality of quantization sign bits QSB and the plurality of zero point correction bits ZPCB, and the plurality of fixed point quantization scaled input vectors QSXfxp. The processing element array 350 may generate the fixed point output matrix YMfxp based on the plurality of quantization sign bits QSB, the plurality of zero point correction bits ZPCB, and the plurality of fixed point quantization scaled input elements.

The processing element array 350 may include the plurality of processing elements arranged in the row direction and the column direction. Each of the plurality of processing elements may calculate different fixed point output elements. The detailed configuration and operation of the processing element array 350 will be described in more detail below with reference to FIG. 58.

The second data type converter 360 may receive the plurality of fixed point output elements from the processing element array 350. The second data type converter 360 may convert the data type of the fixed point output matrix YMfxp to floating point based on the plurality of exponents EXP. That is, the second data type converter 360 may output the output matrix YM having floating point data type.

FIG. 49 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to various example embodiments. Referring to FIGS. 1 to 4 and 47 to 49, the input vector scaler 220 may include a multiplication scaler SCL_multiple and a common scaler SCL_common. The multiplication scaler SCL_multiple may include first to (h)-th multiplication scaling circuits 321_1 to 321_h. The common scaler SCL_common may include first to (h)-th common scaling circuits 322_1 to 322_h.

The components of the input vector scaler 320 may perform operations similar to those of the input vector scaler 220 described above with reference to FIGS. 27 and 28. Hereinafter, the differences between the input vector scaler 320 and the input vector scaler 220 described above with reference to FIGS. 23 to 46 will be mainly described.

The multiplication scaler SCL_multiple may sequentially receive the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC from the multiplication scale coefficient buffer 310. For example, each of the first to (h)-th multiplication scaling circuits 321_1-321_h may sequentially receive the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC from the multiplication scale coefficient buffer 310.

Each of the first to (h)-th multiplication scaling circuits 321_1 to 321_h may perform multiplication scaling on the received input vector based on the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC. For example, the first to (h)-th multiplication scaling circuits 321_1 to 321_h may generate the first to (h)-th multiplication scaled input vectors (e.g., {right arrow over (MSX₁)} to {right arrow over (MSX_h)}), respectively. The operation of the first to (h)-th common scaling circuits 322_1 to 322_h will be described in more detail below with reference to FIG. 50.

The common scaler SCL_common may sequentially receive the plurality of common scale coefficients CSC and the plurality of zero point values ZPV from the common scale coefficient buffer 370. For example, each of the first to (h)-th common scaling circuits 322_1 to 322_h may sequentially receive the plurality of common scale coefficients CSC and the plurality of zero point values ZPV from the common scale coefficient buffers 270.

Each of the first to (h)-th scaling circuits 322_1 to 322_h may perform common scaling on the received multiplication scaled input vector MSX based on the plurality of common scale coefficients CSC and the plurality of zero point values ZPV. For example, the first to (h)-th common scaling circuits 322_1 to 322_h may generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively. The operation of the first to (h)-th common scaling circuits 322_1 to 322_h will be described in more detail below with reference to FIG. 51.

FIG. 50 is a diagram illustrating in more detail an operation of a multiplication scaling circuit of FIG. 49. Hereinafter, for a more concise description, the operation of the first multiplication scaling circuit 321_1 will be representatively described with reference to FIGS. 1 to 4 and 47 to 50. However, example embodiments are not limited thereto, and the second to (h)-th multiplication scaling circuits 321_2 to 321_h may also operate in a similar manner.

The first multiplication scaling circuit 321_1 may receive the first input vector (e.g., {right arrow over (X₁)}). That is, the first multiplication scaling circuit 321_1 may sequentially receive a plurality of input elements (e.g., x₁₁to x_1n).

The first multiplication scaling circuit 321_1 may sequentially receive the zero point scale coefficient ZPSC and the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

The first multiplication scaling circuit 321_1 may generate the multiplication scaled input elements MSIE11_0 to MSIE1n_R by multiplying each of the plurality of received input elements with the zero point scale coefficient ZPSC and the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

For example, the first multiplication scaling circuit 321_1 may generate the multiplication scaled input element MSIE11_0 by multiplying the input element “x₁₁” with the zero point scale coefficient ZPSC, and generate the multiplication scaled input elements MSIE11_1˜MSIE11_R by multiplying “x₁₁” with the first-to-Rth multiplication scale coefficients MSC1-MSCR (illustrated as diagonal stripe).

Similarly, the first multiplication scaling circuit 321_1 may generate the multiplication scaled input element MSIE12_0 by multiplying the input element “x₁₂” with the zero point scale coefficient ZPSC; and generate the multiplication scaled input elements MSIE12_1˜MSIE12_R by multiplying “x₁₂” with the first-to-Rth multiplication scale coefficients MSC1-MSCR (illustrated as a dot pattern).

In this way, the first multiplication scaling circuit 321_1 may be able to sequentially calculate the plurality of multiplication scaled input elements MSIE corresponding to x₁₃to x_1n. In this case, as the zero point scale coefficient ZPSC is equal to ‘1’, the multiplication scaled input elements generated based on the zero point scale coefficient ZPSC (e.g., the multiplication scaled input elements MSIE11_0 to MSIE1n_0) may be same as the corresponding input elements. For example, the multiplication scaled input elements MSIE11_0 to MSIE1n_0 may be same as x₁₁to x_1n, respectively.

FIG. 51 is a diagram illustrating in more detail the operation of the common scaling circuit of FIG. 49. Hereinafter, for a more concise description, the operation of the first common scaling circuit 322_1 will be representatively described with reference to FIGS. 1 to 4 and 47 to 51. However, example embodiments are not limited thereto, and the second to (h)-th common scaling circuits 322_2 to 322_h may also operate in a similar manner.

The first common scaling circuit 322_1 may receive the first multiplication scaled input vector (e.g., {right arrow over (MSX₁)}). That is, the first common scaling circuit 322_1 may sequentially receive the common scaled input elements MSIE11_0-MSIE1n_R described above with reference to FIG. 50.

The first common scaling circuit 322_1 may sequentially receive the plurality of zero point values ZPV and the plurality of common scale coefficients CSC. For example, the first common scaling circuit 322_1 may sequentially receive the first zero point value ZPV1 and R instances of S_r1for the first row vector (e.g., {right arrow over (w_r1)}) of the weight matrix WM, and may sequentially receive the second zero point value ZPV2 and R instances of S_r2for the second zero row vector (e.g., {right arrow over (w_r2)}) of the weight matrix WM.

The first common scaling circuit 322_1 may generate the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) based on the order in which the multiplication scaled input elements MSIE11_0 to MSIE1n_R are received, and the order in which the plurality of zero point values ZPV and the plurality of common scale coefficients CSC are received.

For example, the first common scaling circuit 322_1 may generate one quantization scaled input element QSIE based on the product of the multiplication scaled input element MSIE11_0 and the first zero point value ZPV1; and may generate R quantization scaled input elements QSIE based on the product of S_r1and each of the multiplication scaled input elements MSIE11_1 to MSIE11_R (illustrated as diagonal stripe).

Similarly, the first common scaling circuit 322_1 may generate one quantization scaled input element QSIE based on the product of the multiplication scaled input element MSIE12_0 and the second zero point value ZPV2; and may generate R quantization scaled input elements QSIE based on the product of S_r2and each of the multiplication scaled input elements MSIE12_1 to MSIE12_R (illustrated as a dot pattern).

As a result, the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) may include the quantization scaled input elements QSIE corresponding to the product of the plurality of zero point values ZPV and the plurality of input elements.

FIG. 52 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to various example embodiments. Referring to FIGS. 1 to 4, 47 and 48, and 52, the input vector scaler 320 may include the common scaler SCL_common and the multiplication scaler SCL_multiple. The common scaler SCL_common may include the first to (h)-th common scaling circuits 323_1 to 323_h. The multiplication scaler SCL_multiple may include the first to (h)-th multiplication scaling circuits 324_1 to 324_h.

The components of the input vector scaler 320 may perform operations similar to those of the input vector scaler 220 described above with reference to FIGS. 29 and 31. Hereinafter, the differences between the input vector scaler 320 and the input vector scaler 220 described above with reference to FIGS. 29 to 31 will be mainly described.

The first to (h)-th common scaling circuits 323_1 to 323_h may receive the first to (h)-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively. For example, the first common scaling circuit 323_1 may sequentially receive x₁₁to x_1nand the h-th common scaling circuit 323_h may sequentially receive x_h1to x_hn.

The common scaler SCL_common may sequentially receive the plurality of common scale coefficients CSC and the plurality of zero point values ZPV from the common scale coefficient buffer 370. For example, each of the first to (h)-th common scaling circuits 323_1-323_h may sequentially receive the plurality of common scale coefficients CSC and the plurality of zero point values ZPV from the common scale coefficient buffers 370.

Each of the first to (h)-th common scaling circuits 323_1 to 322_h may perform common scaling for the received input vector based on the plurality of common scale coefficients CSC and the plurality of zero point values ZPV. For example, the first to (h)-th common scaling circuits 323_1 to 322_h may generate the first to (h)-th common scaled input vectors (e.g., {right arrow over (CSX₁)} to {right arrow over (CSX_h)}), respectively. The operation of the first to (h)-th common scaling circuits 323_1 to 323_h will be described in more detail below with reference to FIG. 53.

The multiplication scaler SCL_multiple may sequentially receive the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC from the multiplication scale coefficient buffer 310. For example, each of the first to (h)-th multiplication scaling circuits 324_1 to 324_h may sequentially receive the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC from the multiplication scale coefficient buffer 310.

Each of the first to (h)-th multiplication scaling circuits 324_1 to 324_h may perform multiplication scaling on the received common scaled input vector CSX based on the multiplication scale coefficient MSC and the plurality of zero point scale coefficients ZPSC. For example, the first to (h)-th multiplication scaling circuits 324_1 to 324_h may generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively. The operation of the first to (h)-th multiplication scaling circuits 324_1 to 324_h will be described in more detail below with reference to FIG. 54.

FIG. 53 is a diagram illustrating in more detail the operation of the common scaling circuit of FIG. 52. Hereinafter, for a more concise description, the operation of the first common scaling circuit 323_1 will be representatively described with reference to FIGS. 1 to 4, 47 and 48, and 52 and 53. However, example embodiments are not limited thereto, and the second to (h)-th common scaling circuits 323_2 to 323_h may also operate in a similar manner.

The first common scaling circuit 323_1 may receive the first input vector (e.g., {right arrow over (X₁)}). That is, the first common scaling circuit 323_1 may sequentially receive x₁₁to x_1n. The first common scaling circuit 323_1 may sequentially receive the plurality of zero point values ZPV and the plurality of common scale coefficients CSC. For example, the first common scaling circuit 323_1 may sequentially receive the first zero point value ZPV1 and R instances of S_r1for the first row vector (e.g., {right arrow over (w_r1)}) of the weight matrix WM, and may sequentially receive the second zero point value ZPV2 and R instances of S_r2for the second row vector (e.g., {right arrow over (w_r2)}) of the weight matrix WM.

The first common scaling circuit 323_1 may generate the first common scaled input vector (e.g., {right arrow over (CSX₁)}) based on the order in which the plurality of input elements, the plurality of zero point values ZPV, and the plurality of common scale coefficients CSC are received.

For example, the first common scaling circuit 322_1 may generate the common scaled input elements CSIE11_0 to CSIE1n_R. For example, the first common scaling circuit 323_1 may generate one common scaled input element CSIE11_0 based on the product of x₁₁and the first zero point value ZPV1, and generate R common scaled input elements CSIE11_1 to CSIE11_R based on the product of x₁₁and R instances of S_r1(illustrated as diagonal stripe).

Similarly, the first common scaling circuit 323_1 may generate one common scaled input element CSIE12_0 based on the product of x₁₂and the second zero point value ZPV2, and generate R common scaled input elements CSIE12_1-CSIE12_R based on the product of x₁₂and R instances of S_r2(illustrated as a dot pattern).

As a result, the first common scaled input vector (e.g., {right arrow over (CSX₁)}) may include the common scaled input elements CSIE11_0 to CSIE1n_0 corresponding to the products of the plurality of zero point values ZPV and the plurality of input elements.

FIG. 54 is a diagram illustrating in more detail an operation of the multiplication scaling circuit of FIG. 52 according to various example embodiments. Hereinafter, for a more concise description, the operation of the first multiplication scaling circuit 324_1 will be representatively described with reference to FIGS. 1 to 4, 47 and 48, and 52 to 54. However, example embodiments are not limited thereto, and the second to (h)-th multiplication scaling circuits 324_2 to 324_h may also operate in a similar manner.

The first multiplication scaling circuit 324_1 may receive the first common scaled input vector (e.g., {right arrow over (CSX₁)}). For example, the first multiplication scaling circuit 324_1 may receive the plurality of common scaled input elements CSIE11_0 to CSIE1n_R described above with reference to FIG. 53. The first multiplication scaling circuit 324_1 may receive the zero point scale coefficient ZPSC and the first-to-Rth multiplication scale coefficients MSC1 to MSCR repeatedly.

The first multiplication scaling circuit 324_1 may generate the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) based on the order in which the plurality of common scaled input elements CSIE11_0 to CSIE1n_R are received, and the order in which the zero point scale coefficient ZPSC and the first to R multiplication scale coefficients MSC1 to MSCR are received.

For example, the first multiplication scaling circuit 324_1 may generate one quantization scaled input element QSIE (e.g., “ZPV1×x₁₁”) based on the product of the common scaled input element CSIE11_0 and the zero scale coefficient ZPSC; and generate R quantization scaled input elements QSIE by multiplying the multiplication scaled input elements CSIE11_1 to CSIE11_R and the first-to-Rth multiplication scale coefficients MSC1 to MSCR, respectively (illustrated as diagonal stripe).

Similarly, the first multiplication scaling circuit 324_1 may generate one quantization scaled input element QSIE (e.g., ZPV2×x₁₂) based on the product of the common scaled input element CSIE12_0 and the zero scale coefficient ZPSC; and generate R quantization scaled input elements QSIE by multiplying the multiplication scaled input elements CSIE12_1 to CSIE12_R and the first-to-Rth multiplication scale coefficients MSC1 to MSCR, respectively (illustrated as a dot pattern).

FIG. 55 is a block diagram illustrating a configuration of the input vector scaler of FIG. 48 according to various example embodiments. Referring to FIGS. 1 to 4, 47 and 48, and 55, the input vector scaler 320 may include the multiplication scaling circuit 325 and the quantization scaler SCL_quantization. The quantization scaler SCL_quantization may include first to (h)-th quantization scaling circuits 326_1 to 326_h.

The components of the input vector scaler 320 may perform operations similar to those of the input vector scaler 220 described above with reference to FIGS. 32 to 34. Hereinafter, the differences between the input vector scaler 320 and the input vector scaler 220 described above with reference to FIGS. 32 to 34 will be mainly described.

The multiplication scaling circuit 325 may sequentially receive the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC from the multiplication scale coefficient buffer 310.

The multiplication scaling circuit 325 may sequentially receive the plurality of common scale coefficients CSC and the plurality of zero point values ZPV from the common scale coefficient buffer 370.

Each of the first to (h)-th multiplication scaling circuits 326_1 to 326_h may receive a different input vector. For example, the first to (h)-th quantization scaling circuits 326_1 to 326_h may receive the first to h-th input vectors (e.g., {right arrow over (X₁)} to {right arrow over (X_h)}), respectively. For a more detailed example, the first quantization scaling circuit 326_1 may sequentially receive x₁₁to x_1n.

Each of the first to (h)-th quantization scaling circuits 326_1 to 326_h may perform quantization scaling on the received input vector based on the plurality of quantization scale coefficients QSC and the plurality of zero point values ZPV. For example, the first to (h)-th quantization scaling circuits 326_1 to 326_h may generate the first to (h)-th quantization scaled input vectors (e.g., {right arrow over (QSX₁)} to {right arrow over (QSX_h)}), respectively. The operation of the first to (h)-th quantization scaling circuits 326_1 to 326_h will be described in more detail below with reference to FIG. 57.

FIG. 56 is a diagram illustrating in more detail an operation of a multiplication scaling circuit of FIG. 55. Referring to FIGS. 1 to 4, 47 and 48, and 55 and 56, the multiplication scaling circuit 325 may repeatedly receive the zero point scale coefficient ZPSC and the first-to-Rth multiplication scale coefficients MSC1 to MSCR.

The multiplication scaling circuit 325 may sequentially receive the plurality of zero point values ZPV and the plurality of common scale coefficients CSC. For example, the multiplication scaling circuit 325 may sequentially receive the first zero point value ZPV1 and R instances of S_r1for the first row vector (e.g., {right arrow over (w_r1)}) of the weight matrix WM, and may sequentially receive the second zero point value and R instances of S_r2ZPV2 for the second row vector (e.g., {right arrow over (w_r2)}) of the weight matrix WM.

The multiplication scaling circuit 325 may sequentially output the plurality of quantization scale coefficients QSC and the plurality of zero point values ZPV based on an order in which the plurality of multiplication scale coefficients MSC and the plurality of zero point scale coefficients ZPSC are received, and an order in which the plurality of common scale coefficients CSC and the plurality of zero point values ZPV are received.

For example, the multiplication scaling circuit 325 may generate the first zero point value ZPV1 by multiplying the zero point scale coefficient ZPSC with the first zero point value ZPV1; and generate the plurality of quantization scale coefficients QSC (e.g., α_{1_r1}to α_{R_r1}) for the first row vector (e.g., {right arrow over (w_r1)}) of the weight matrix WM by multiplying the first-to-Rth multiplication scale coefficients MSC1 to MSCR with R instances of S_r1, respectively (illustrated as diagonal stripe).

Similarly, the multiplication scaling circuit 325 may generate the second zero point value ZPV2 by multiplying the zero point scale coefficient ZPSC by the second zero point value ZPV2; and generate the plurality of quantization scale coefficients QSC (e.g., α_{1_r2}to α_{R_r2}) for the second row vector (e.g., {right arrow over (w_r2)}) of the weight matrix WM by multiplying the first-to-Rth multiplication scale coefficients MSC1 to MSCR with R instances of S_r2, respectively (illustrated as a dot pattern).

In this way, the multiplication scaling circuit 325 may sequentially output the plurality of zero point values ZPV and the plurality of quantization scale coefficients QSC.

FIG. 57 is a diagram illustrating in more detail the operation of the quantization scaling circuit of FIG. 55. Hereinafter, for a more concise description, the operation of the first quantization scaling circuit 326_1 will be representatively described with reference to FIGS. 1 to 4, 47 and 48, and 55 to 57. However, example embodiments are not limited thereto, and the second to (h)-th multiplication scaling circuits 326_2 to 326_h may also operate in a similar manner.

The first quantization scaling circuit 326_1 may receive the first input vector (e.g., {right arrow over (X₁)}). That is, the first quantization scaling circuit 326_1 may sequentially receive x₁₁to x_1n.

The first quantization scaling circuit 326_1 may sequentially receive the plurality of zero point values ZPV and the plurality of quantization scale coefficients QSC. For example, the first quantization scaling circuit 326_1 may sequentially receive the plurality of zero point values ZPV and the plurality of quantization scale coefficients QSC described above with reference to FIG. 56.

The first quantization scaling circuit 326_1 may scale the first input vector (e.g., {right arrow over (X₁)}) based on the order in which the plurality of zero point values ZPV and the plurality of quantization scale coefficients QSC are received to generate the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}).

For example, the first quantization scaling circuit 326_1 may generate the plurality of quantization scaled input elements QSIE for x₁₁by multiplying x₁₁with the first zero point value ZPV1 and α_{1_r1}to α_{R_r1}(illustrated as diagonal stripe).

Similarly, the first quantization scaling circuit 326_1 may generate the plurality of quantization scaled input elements QSIE for x₁₂by multiplying x₁₂with the second zero point value ZPV2 and α_{1_r2}to α_{R_r2}(illustrated as diagonal stripe).

As a result, the first quantization scaled input vector (e.g., {right arrow over (QSX₁)}) may include the quantization scaled input elements QSIE corresponding to the products of the plurality of zero point values ZPV and the plurality of input elements.

FIG. 58 is a diagram illustrating in more detail an operation of the processing element array of FIG. 48. Referring to FIGS. 1 to 4 and FIGS. 48 to 58, the processing element array 350 may include the plurality of processing element rows PER. Each of the plurality of processing element rows PER may include the plurality of processing elements. However, hereinafter, for a more concise description, the operation of the first processing element row PER1 will be representatively described. However, example embodiments are not limited thereto.

The first processing element row PER1 may include the processing elements PE11 to PE1m. The first processing element row PER1 may receive the first fixed point multiplication scaled input vector (e.g., {right arrow over (QSX′₁)}). For example, each of the processing elements PE11 to PE1m may sequentially receive the plurality of fixed point quantization scaled input elements QSIEfxp. That is, the processing elements PE11 to PE1m may sequentially receive the plurality of quantization scaled input elements QSIE described above with reference to FIGS. 51, 54, and 57 in fixed point format. In this case, some of the plurality of fixed point quantization scaled input elements QSIEfxp received by each of the processing elements PE11 to PE1m may correspond to the products of the zero point value ZPV and the input element.

Each of the processing elements PE11 to PE1m may receive the plurality of quantization sign bits QSB and the plurality of zero point correction bits ZPCB. For example, the processing element PE11 may receive the first plurality of quantization sign bits QSBs_1 and the plurality of zero point correction bits ZPCB. The processing element PE12 may receive the second plurality of quantization sign bits QSBs_2 and the plurality of zero point correction bits ZPCB.

For a more detailed example, the processing element PE11 may receive one zero point correction bit ZPCB, and then receive the R quantization sign bits QSB (e.g., b_{1_1_r1}to b_{1_R_r1}) corresponding to x₁₁. Thereafter, the processing element PE11 may receive one zero point correction bit ZPCB, and then receive the R quantization sign bits QSB (e.g., b_{1_1_r2}to b_{1_R_r2}) corresponding to x₁₂. In this way, the processing element PE11 may sequentially receive the first plurality of quantization sign bits QSBs_1 and the plurality of zero point correction bits ZPCB.

The processing element PE11 may calculate the fixed point output element (e.g., y′₁₁) based on an order in which the quantization sign bits QSB and the plurality of zero point correction bits ZPCB are received, and an order in which the plurality of fixed point quantization scaled input elements QSIEfxp are received. For example, similar to that described above with reference to FIGS. 37 and 38, the processing element PE11 may calculate the fixed point output element (e.g., y′₁₁) based on whether the quantization sign bit QSB and the zero point correction bit ZPCB corresponding to the received fixed point quantization scaled input element QSIEfxp represents ‘0’ or ‘1’.

Therefore, according to some example embodiments, the product of the zero point value ZPV and the input element may be calculated in the matrix multiplier 300. In this case, the matrix multiply device MMD may be able to calculate the product of the input matrix XM and the weight matrix WM even if it does not include a separate multiplier for calculating the product of the zero point value ZPV and the input element.

In other words, according to some example embodiments, even if the uniform BCQ circuit UBC performs the uniform binary coding quantization for the weight matrix WM asymmetrically, the matrix multiplier 300 may be able to calculate the product of the input matrix XM and the weight matrix with a small computational amount. Therefore, according to some example embodiments, the versatility of the matrix multiply device MMD may increase.

FIG. 59 is a block diagram illustrating the processing element array of FIG. 26 implemented in a systolic array manner. Referring to FIGS. 1 to 4, 23 to 48, and 59, the processing element array 250 may include the plurality of processing elements PE arranged in a row direction and a column direction. The plurality of processing element PE may operate in the systolic array scheme.

The processing element array 250 may be implemented to sequentially propagate the plurality of fixed point quantization scaled input elements QSIEfxp in the row direction. For example, the first processing element row PER1 may sequentially propagate the plurality of fixed point quantization scaled input elements QSIEfxp included in the first fixed point quantization scaled input vector (e.g., {right arrow over (QSX′₁)}) in the row direction.

For a more detailed example, the processing element PE11 may receive one fixed point quantization scaled input element QSIEfxp at a first time point. The processing element PE11 may transfer the fixed point quantization scaled input element QSIEfxp to the processing element PE12 disposed adjacent to the processing element PE11 in the row direction at a second time point after the first time point. In this way, the processing elements PE included in the first processing element row PER1 may sequentially transfer the plurality of fixed point quantization scaled input elements QSIEfxp provided from the first data type converter 230 to adjacent processing elements.

The processing element array 250 may be implemented to sequentially propagate the plurality of quantization sign bits QSB in the column direction. For example, the first processing element column PEC1 may sequentially propagate the first plurality of quantization sign bits QSBs_1 in the column direction.

For a more detailed example, the processing element PE11 may receive one quantization sign bit QSB at the first time point. The processing element PE11 may transfer the quantization sign bit QSB to the processing element PE21 disposed adjacent to the processing element PE11 in the column direction at the second time point after the first time point. In this way, the processing elements PE included in the first processing element column PEC1 may sequentially transfer the first plurality of quantization sign bits QSBs_1 provided from the quantization sign bit buffer 240 to adjacent processing elements.

The plurality of processing elements PE may generate different fixed point output elements OEfxp. The processing element array 250 may be implemented to sequentially propagate the fixed point output element OEfxp in the column direction. For example, the fixed point output element OEfxp (e.g., y′₁₁) calculated by the processing element PE11 may be sequentially transferred in the column direction. In this way, the fixed point output element OEfxp may be transferred to the second data type converter 260. The method of propagating the fixed point output element OEfxp is similar to the method of propagating the quantization sign bit QSB described above, and therefore, detailed description thereof will be omitted.

That is, each of the processing elements included in the processing element array 250 may be implemented to receive one or more of the fixed point output element OEfxp, the quantization sign bit QSB, and the fixed point quantization scaled input element QSIEfxp from adjacently disposed processing elements. On the other hand, each of the processing elements included in the processing element array 250 may be implemented to transfer one or more of the fixed point output element OEfxp, the quantization sign bit QSB, and the fixed point quantization scaled input element QSIEfxp to adjacently disposed processing elements. A more detailed configuration of the processing element PE operating in the systolic array scheme will be described in more detail below with reference to FIG. 60.

For a more concise description, FIG. 59 representatively illustrates various example embodiments in which the fixed point output element OEfxp, the quantization sign bit QSB, and the fixed point quantization scaled input element QSIEfxp are each propagated in the systolic array method, but example embodiments are not limited thereto. For example, the processing element array 250 may be implemented to propagate only one or two of the fixed point output elements OEfxp, the quantization sign bits QSB, and the fixed point quantization scaled input elements QSIEfxp in systolic array scheme.

In example embodiments, each of the processing elements included in the processing element array 250 may operate in response to the same control clock signal. In this case, each of the plurality of processing elements may propagate the fixed point output element OEfxp, the quantization sign bit QSB, and/or the fixed point quantization scaled input element QSIEfxp to another processing element at the same time point. However, example embodiments are not limited thereto.

For a more concise description, FIG. 59 representatively illustrates example embodiments in which the processing element array 250 described with reference to FIGS. 23 to 48 operates in the systolic array scheme, but example embodiments are not limited thereto. For example, according to some example embodiments, the processing element array 150 described with reference to FIGS. 5 to 22 or the processing element array 350 described with reference to FIGS. 49 to 58 may also operate in the systolic array type.

FIG. 60 is a diagram illustrating in more detail a configuration of the processing element of FIG. 59. FIGS. 1 to 4, 23 to 48, and 59 and 60, the processing elements PE may include the arithmetic logic unit ALU, the accumulation register REG_ACC, a quantization scaled input element register REG_QSIE, a quantization sign bit register REG_QSB, and an output register REG_OUT. For a more concise description, detailed descriptions of the configuration and operation of the arithmetic logic unit ALU and the accumulation register REG_ACC described above with reference to FIG. 38 will be omitted.

Hereinafter, various example embodiments in which the arithmetic logic unit ALU, the accumulation register REG_ACC, the quantization scaled input element register REG_QSIE, the quantization sign bit register REG_QSB, and the output register REG_OUT operate in response to the same control clock signal is representatively described. However, example embodiments are not limited thereto.

The quantization scaled input element register REG_QSIE may sequentially receive the plurality of quantization scaled input elements QSIEfxp. The quantization scaled input element register REG_QSIE may receive the plurality of quantization scaled input elements QSIEfxp, and transfer the received quantization scaled input elements QSIEfxp to the adjacent processing element PE and first input terminal TI1 after one cycle of the control clock signal has elapsed.

The quantization sign bit register REG_QSB may sequentially receive the plurality of quantization sign bits QSB. The quantization sign bit register REG_QSB may receive one quantization sign bit QSB and transfer the received one quantization sign bit QSB to the adjacent processing element PE and the second input terminal TI2 after one cycle of the control clock signal has elapsed.

The output register REG_OUT may receive the fixed point output element OEfxp from the accumulation register REG_ACC. For example, the output register REG_OUT may transfer the fixed point output element OEfxp to the adjacent processing element PE or the second data type converter 260.

That is, the output register REG_OUT may provide the fixed point output element OEfxp to the output register of another adjacent processing element or directly provide the fixed point output element OEfxp to the second data type converter 260. For example, the fixed point output element OEfxp (e.g., y′_h1) calculated by the processing element PE11 may be sequentially transferred to the second data type converter 260 through the processing elements PE21 to PEh1. However, example embodiments are not limited thereto.

For a more concise description, various example embodiments in which each of the registers included in the processing element PE receives and outputs data for each cycle of the control clock signal is described in FIG. 60, but example embodiments are not limited to the specific operation method of the register in response to the control clock signal.

FIG. 61 is a diagram illustrating the operation of the matrix multiply device of FIG. 1 according to various example embodiments. Referring to FIGS. 1 to 48 and 61, the matrix multiply device MMD may receive a full input matrix FXM. The full-input matrix FXM may include the input matrix XM described above with reference to FIGS. 1 to 48. For example, the full-input matrix FXM may include a plurality of input matrices XM.

The matrix multiply device MMD may receive a full weight matrix FWM. The full-weight matrix FWM may include the weight matrix WM described above with reference to FIGS. 1 to 48. For example, the full-weight matrix FWM may include a plurality of weight matrices WM.

The uniform BCQ circuit UBC may perform uniform binary coding quantization operation for each of the plurality of weight matrices WM included in the full-weight matrix FWM. For example, the uniform BCQ circuit UBC may generate the plurality of quantization sign bits QSB, the plurality of common scale coefficients CSC, and the plurality of scale coefficients MSC from each of the plurality of weight matrices WM.

The matrix multiplier 100 may receive the plurality of quantization sign bits QSB, the plurality of common scale coefficients CSC, and the multiplication scale coefficient MSC. The matrix multiplier 100 may perform the matrix multiplication on the full-input matrix FXM and the full-weight matrix FWM based on the plurality of quantization sign bits QSB, the plurality of common scale coefficients CSC, and the multiplication scale coefficients MSC.

The matrix multiplier 100 may perform the matrix multiplication on the full-input matrix FXM and the full-weight matrix FWM through one of various tiling techniques. For example, the matrix multiplier 100 may calculate the full-output matrix FYM by sequentially calculating the product of the plurality of input matrices XM and the plurality of weight matrices WM and then combining the calculated results to calculate the full-output matrix FYM.

FIG. 62 is a diagram illustrating the full-input matrix of FIG. 61. Referring to FIGS. 1 to 48 and 61 and 62, the full-input matrix FXM may include the plurality of input matrices XM. In other words, the full-input matrix FXM may be tiled with the plurality of input matrices XM arranged in the row direction and column direction. Hereinafter, for more concise description, the input matrix arranged in the (i)-th row and (j)-th column of the full-input matrix FXM will be referred to as “XM_ij”.

In example embodiments, the input matrix XM described above with reference to FIGS. 1 to 48 may be one of the plurality of input matrices XM included in the full-input matrix FXM.

In example embodiments, the plurality of input matrices XM included in the full input matrix FXM may have the same row size and column size. For example, each of the plurality of input matrices XM may include ‘n’ input elements for each row. Each of the plurality of input matrices XM may include ‘h’ input elements for each column.

The row size of the full-input matrix FXM may be an integer multiple of the row size of each of the plurality of input matrices XM. For example, one row of the full-input matrix FXM may include ‘N’ input elements. In this case, ‘N’ may be an integer multiple of ‘n’.

The column size of the full-input matrix FXM may be an integer multiple of the column size of each of the plurality of input matrices XM. For example, one column of the full-input matrix FXM may include ‘H’ input elements. In this case, ‘H’ may be an integer multiple of ‘h’.

FIG. 63 is a diagram illustrating a full-weight matrix of FIG. 61. Referring to FIGS. 1 to 48 and 61 and 63, the full-weight matrix FXM may include the plurality of weight matrices WM. In some examples, the full-weight matrix FWM may be tiled with the plurality of weight matrices WM arranged in the row direction and column direction. Hereinafter, for more concise description, the weight matrix arranged in the (i)-th row and (j)-th column of the full-weight matrix FWM will be referred to as “WM_ij”.

In example embodiments, the weight matrix WM described above with reference to FIGS. 1 to 48 may be one of the plurality of weight matrices WM included in the full-input matrix FXM.

In example embodiments, each of the plurality of weight matrices WM included in the full-weight matrix FWM may have the same row size and column size. For example, each of the plurality of input matrices WM may include ‘m’ weights for each row. Each of the plurality of weight matrices WM may include ‘n’ weights for each column.

The row size of the full-weight matrix FWM may be an integer multiple of the row size of each of the plurality of input matrices WM. For example, one row of the full-weight matrix FWM may include ‘M’ weights. In this case, ‘M’ may be an integer multiple of ‘m’.

The column size of the full-weight matrix FWM may be an integer multiple of the row size of each of the plurality of weight matrices WM. For example, one column of the full-weight matrix FWM may include ‘N’ weights. In this case, ‘N’ may be an integer multiple of ‘n’.

The uniform BCQ circuit UBC may perform the uniform binary coding quantization operation for each of the plurality of weight matrices WM included in the full-weight matrix FWM. In this case, the plurality of common scale coefficients CSC generated based on the weight matrix WM_11 may be different from the plurality of common scale coefficients CSC generated based on the weight matrix WM_12. Likewise, the plurality of quantization sign bits QSB generated based on the weight matrix WM_11 may be different from the plurality of quantization sign bits QSB generated based on the weight matrix WM_12. The specific method in which the uniform BCQ circuit UBC performs the uniform binary coding quantization operation for each weight matrix WM is similar to that described above with reference to FIGS. 1 to 48, and therefore, detailed description thereof will be omitted.

FIG. 64 is a diagram illustrating the full-output matrix of FIG. 61. Referring to FIGS. 1 to 48 and 61 and 64, the full-output matrix FYM may correspond to the product of the full-input matrix FXM and the full-weight matrix FWM.

The full-output matrix FYM may include a plurality of sub-matrices FYM_sub arranged in the row direction and the column direction. Hereinafter, for more concise description, the sub-matrix arranged in the (i)-th row and (j)-th column of the full-output matrix FYM will be referred to as “FYM_sub_ij”.

Each of the plurality of sub-matrices FYM_sub may have the same row size and column size. The row size of each of the plurality of sub-matrices FYM_sub may be the same as the row size of the input matrix XM. The column size of each of the plurality of sub-matrices FYM_sub may be the same as the column size of the weight matrix WM. For example, each of the plurality of sub-matrix FYM_sub may include ‘m’ output elements for each row. Each of the plurality of sub-matrix FYM_sub may include ‘h’ output elements for each column.

The row size of the full-output matrix FYM may be the same as the row size of the full-weight matrix FWM. For example, the row size of the full-output matrix FYM may be ‘M’.

The column size of the full-output matrix FYM may be the same as the column size of the full-input matrix FXM. For example, the column size of the full-output matrix FYM may be ‘H’.

The matrix multiply device MMD may calculate the full-output matrix FYM in units of sub-matrix FYM_sub. For example, the matrix multiply device MMD may calculate one sub-matrix FYM_sub by adding the products of a plurality of tiled input matrices XM and the plurality of tiled weight matrices WM, respectively.

For a more detailed example, when ‘N’ is 3 times of ‘n’, the matrix multiplier 100 may calculate the sub-matrix FYM_sub_11 by sequentially calculating and adding the product of the input matrix XM_11 and the weight matrix WM_11, the product of the input matrix XM_12 and the weight matrix WM_21, and the product of the input matrix XM_13 and the weight matrix WM_31. In this case, each product of the tiled input matrix and the tiled weight matrix may correspond to the output matrix YM described above with reference to FIGS. 1 to 48. In some examples, the matrix multiplication device MMD may calculate a first output matrix based on a product of the input matrix XM_11 and the weight matrix WM_11, calculate the second output matrix based on a product of an input matrix XM_12 and a weight matrix WM_21, and a third output matrix based on a product of an input matrix XM_13 and a weight matrix WM_31. Thereafter, the matrix multiply device MMD may calculate the sub-matrix FYM_sub_11 by adding the above-described first to third output matrices. However, example embodiments are not limited thereto.

That is, the matrix multiply device MMD may be implemented to calculate one sub-matrix (e.g., a portion of the full-output matrix FYM) by accumulating the plurality of output matrices. For example, the matrix multiply device MMD may be implemented to temporarily store the plurality of output matrices in a memory device (e.g. an external volatile memory device (e.g., an SRAM device)) and then accumulate the stored output matrices to calculate one sub-matrix. In this way, the matrix multiply device MMD may be able to sequentially calculate the plurality of sub-matrices FYM_sub to calculate the full-output matrix FYM.

FIG. 65 is a block diagram illustrating a neural processing system implemented according to example embodiments. Referring to FIG. 65, the neural processing system 2000 may include a central processing unit 2100, a neural processing unit 2200, a volatile memory device 2300, a nonvolatile memory device 2400, and a user interface 2500. The central processing unit 2100, the neural processing unit 2200, the volatile memory device 2300, the volatile memory device 2400, and the user interface 2500 may be connected to each other through a bus BUS.

The central processing unit 2100 may control an overall operation of the neural processing system 2000. For example, the central processing unit 2100 may control each component of the neural processing system 2000 to operate the artificial intelligence model.

In example embodiments, the artificial intelligence model that the neural processing system 2000 executes may be or include or be included in one or more of any type of the artificial intelligence model such as a language model, an image identification model, an image generation model, a weather analysis model, or the like. For example, the artificial intelligence model that neural processing system 2000 executes may be or include or be included in one or more of any type of the artificial intelligence model such as GPT-3, GPT-4, Pangu, GShard, Megatron-LM, or the like. However, example embodiments are not limited thereto.

In example embodiments, the artificial intelligence model executed by the neural processing system 2000 may perform an inference operation and/or a training operation. However, example embodiments are not limited thereto.

Each artificial intelligence model may include a plurality of processing layers. Each of the plurality of processing layers may be implemented to receive layer input data to generate layer output data. In this case, the generated layer output data may be used as layer input data for another processing layer. For example, layer output data generated from a first processing layer may be used as layer input data for a second processing layer. More detailed descriptions of the artificial intelligence model and the processing layer will be described with reference to FIG. 66.

Each of the plurality of processing layers may transform layer input data into layer output data based on matrix multiplication calculation. For example, each of the plurality of processing layers may generate the output matrix corresponding to the layer output data by multiplying the input matrix corresponding to the layer input data, by the weight matrix. However, the range of the present disclosure is not limited thereto, and each of the plurality of processing layers may generate the output data by converting the input matrix corresponding to the layer input data in any manner. For example, each of the plurality of processing layers may be implemented to generate the layer output data by sequentially multiplying the input matrix corresponding to the layer input data by the plurality of weight matrices, or to convert the input matrix to the layer output data based on any conversion parameter. In other words, example embodiments are not limited to the specific manner in which each of the plurality of processing layers transforms the layer input data.

The neural processing unit 2200 may include a matrix multiplication device 2210. The matrix multiplication device 2210 may execute at least some of calculations included in the plurality of processing layers. For example, the matrix multiplication device 2210 may perform a matrix multiplication calculation included in the plurality of processing layers.

In example embodiments, the matrix multiplication calculation may occupy most of a processing load required/used by the neural processing system 2000 to execute each of the plurality of processing layers.

In example embodiments, the matrix multiplication device 2210 may be implemented as the matrix multiplication device MMD described above with reference to FIGS. 1 to 58. In this case, the neural processing unit 2200 may execute the calculations included in the plurality of processing layers with a smaller computation amount. Therefore, the neural processing system 2000 including the matrix multiplication device MMD according to various example embodiments of the present disclosure may drive the artificial intelligence model at a faster speed and using less power.

The volatile memory device 2300 may be used as an operating memory of the neural processing unit 2200. For example, the volatile memory device 2300 may temporarily store data generated during an operation of the neural processing unit 2200.

In example embodiments, the neural processing unit 2200 may access the volatile memory device 2300 to execute the calculations included in the plurality of processing layers. For example, the neural processing unit 2200 may be implemented to read a parameter stored in the volatile memory device 2300 to perform a calculation for layer input data, or may be implemented to temporarily store intermediate data generated during the calculation in the volatile memory device 2300.

In example embodiments, a calculation speed of the neural processing unit 2200 may be higher than an access speed of the neural processing unit 2200 to the volatile memory device 2300. Accordingly, a bottleneck phenomenon may occur in an operating speed of the artificial intelligence model due to a communication speed between the neural processing unit 2200 and the volatile memory device 2300.

In example embodiments, when the matrix multiply device 2210 is implemented as the matrix multiply device MMD described above with reference to FIGS. 1 to 58, one output element may be calculated based on one processing element PE included in the matrix multiplier MMD. In this case, since one output element may be generated even without combining the output results of the plurality of processing elements PE, the number of times of accesses to the volatile memory device 2300 of the neural processing unit 2200 may be reduced or minimized, and the artificial intelligence model may be driven based on a smaller size of processing element array (e.g. fewer processing elements (PE)). Therefore, according to some example embodiments, the operation speed of the artificial intelligence model may be improved, and the production cost of the neural processing system 2000 may be reduced or minimized.

In example embodiments, when the matrix multiply device 2210 is implemented as the matrix multiply device MMD described above with reference to FIGS. 1 to 58, each processing element PE may not include a ‘1-bit adder’ to perform the multiplication operation for the multiplication scale coefficient MSC. In some examples, according to some example embodiments, the result of the multiplication scaling circuit scaling the input element is provided to the processing element array, so the size and production cost of each processing element PE included in the matrix multiply device MMD are reduced.

In example embodiments, the volatile memory device 2300 may be implemented with any type of volatile memory such as one or more of a dynamic random access memory (DRAM), a static random access memory (SRAM), or the like.

In example embodiments, the volatile memory device 2300 may be used as a buffer memory, an operating memory, or a cache memory of the central processing unit 2100. However, example embodiments are not limited thereto.

The volatile memory device 2400 may store data for the operation of the neural processing system 2000. For example, the volatile memory device 2400 may store various types of data such as a parameter for an operating system (OS) of the neural processing system 2000, a parameter for driving the artificial intelligence model, and the like. However, example embodiments are not limited thereto.

The central processing unit 2100 may communicate, e.g., with one or more users through the user interface 2500. The central processing unit 2100 may provide model input data provided by the user, to the volatile memory device 2300 or the neural processing unit 2200, through the user interface 2500. The central processing unit 2100 may return model output data generated by the artificial intelligence model based on the model input data to the user through the user interface 2500.

FIG. 66 is a block diagram illustrating an artificial intelligence model driven by the neural processing system of FIG. 65. Referring to FIGS. 65 and 66, the neural processing system 2000 may drive an artificial intelligence model AIM.

The artificial intelligence model AIM may receive the model input data MID. The artificial intelligence model AIM may include first to L-th processing layers PL_1-PL_L.

The artificial intelligence model AIM may generate the model output data MOD by sequentially converting the model input data MID through the first to L-th processing layers PL_1-PL_L. For example, the first processing layer PL_1 may receive the model input data MID, and may generate second layer input data LID_2. The second processing layer PL_2 may receive the second layer input data LID_2, and may generate third layer input data LID_3. In this way, the L-th processing layer PL_L may receive L-th layer input data LID_L, and may generate the model output data MOD.

Each of the first to L-th processing layers PL_1-PL_L may transform (or convert) the received data into data to be output through various types of calculations. For example, a matrix multiplication calculation may be included in calculations performed by the first processing layer PL_1 to transform the model input data MID to the second layer input data LID_2. Similarly, each of the first to L-th processing layers PL_1-PL_L may have to perform the matrix multiplication calculation to transform the received layer input data. However, the range is not limited thereto, and some of the first to L-th processing layers PL_1-PL_L may not perform the matrix multiplication calculation.

In example embodiments, the matrix multiplication calculation performed by each of the first to L-th processing layers PL_1-PL_L may be performed through the matrix multiplication device 2210.

In example embodiments, if the matrix multiplication device 2210 is implemented as the matrix multiplication device MMD described above with reference to FIGS. 1 to 48, the matrix multiplication device 2210 may output a result of the matrix multiplication calculation at a faster speed. Therefore, according to various example embodiments, an operation speed of the artificial intelligence model AIM may be improved.

For a more concise description, various example embodiments including of the plurality of processing layers in which the artificial intelligence model AIM operates in series is representatively described In FIG. 66, but example embodiments are not limited thereto. For example, the artificial intelligence model AIM may further include a processing layer that operates in parallel for at least some of the first to L-th processing layers PL_1-PL_L described above. For example, example embodiments are not limited to the specific implementation method of the artificial intelligence model AIM.

Any of the elements and/or functional blocks disclosed above may include or be implemented in processing circuitry such as hardware including logic circuits; a hardware/software combination such as a processor executing software; or a combination thereof. For example, the processing circuitry more specifically may include, but is not limited to, a central processing unit (CPU), an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip (SoC), a programmable logic unit, a microprocessor, application-specific integrated circuit (ASIC), etc. The processing circuitry may include electrical components such as at least one of transistors, resistors, capacitors, etc. The processing circuitry may include electrical components such as logic gates including at least one of AND gates, OR gates, NAND gates, NOT gates, etc.

The contents described above are specific example embodiments for implementing various inventive concepts. Inventive concepts may include not only the above-described example embodiments but also other example embodiments that may be simply changed in design or may be easily modified. Additionally, inventive concepts may also include technologies that may be easily modified and implemented using various example embodiments. Therefore, the scope of inventive concepts should not be limited to the above-described example embodiments, but should be defined by the claims described below as well as the claims and equivalents of the present disclosure. Furthermore, example embodiments are not necessarily mutually exclusive with one another. For example, some example embodiments may include one or more features described with reference to one or more figures, and may also include one or more other features described with reference to one or more other figures.

Number	Date	Country	Kind
10-2023-0143978	Oct 2023	KR	national
10-2024-0019936	Feb 2024	KR	national

MATRIX MULTIPLIER AND OPERATION METHOD OF MATRIX MULTIPLY DEVICE INCLUDING THE SAME

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (2)