This application is based upon and claims the benefit of priority from the prior Japanese Patent Application NO. 2008-294828 filed on Nov. 18, 2008, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an error judging circuit and a shared memory system.
As a method for keeping cache coherency in a multi processor system or a shared memory system, there are a SMP (Symmetric Multi Processing) system and a ccNUMA (cache coherent Non-Uniform Memory Access) system.
In the ccNUMA system, a full directory system in which a main storage memory is divided for every block having a size and information for keeping cache coherency called as a directory is stored in the main storage memory with respect to each block is generally employed. In many case, the block size is equal to a cache line size of a cache memory.
There is a method for assuring an area for storing a directory s different from the area of the main storage memory. However, when a memory different from the main storage memory is provided in order to storage the directory, although the area that is used as the main storage memory may be ensured, the cost of the system is increased.
On the other hand, it is general that a main storage memory of a system specifically used as a server is a DIMM (Dual Inline Memory Module) that store parity information or a check bit of an error correction code at the portion of, for example, 1 of ratio of 8:1. There is also a method in which the portion of 1 of the ratio of 8:1 is used as the area for storing a directory in stead of parity information. However, it is important for the main storage memory of a system used as a server, that is, the DIMM to assure reliability of stored data (that is, that there is no error in data).
In the system other than the ccNUMA system, that is, in the SMP system or the like in which it is not important to store in the main storage memory other than data, S8EC-D8ED ECC (Single-8-bit-group Error Correct, Double-8-bit-group Error Detect, Error Checking and Correcting) system is employed by storing SEC-DED ECC (Single Error Correct, Double Error Detect, Error Checking and Correcting) in the portion of one byte, or by checking data by a unit of every plurality times of reading. Particularly, in the later case, even one chip is broken down in a commercially available DIMM constituted by, for example, 18 chips, it is possible to deploy the DIMM by data correction.
Accordingly, when employing the ccNUMA system in which a system used as a serer is a full directory system, it is desirable to ensure the area for the directory while providing data integrity or data assurance by which one chip failure of the DIMM is endured like the SMP system.
On the other hand, an architecture such as a main frame has been proposed in which the main storage memory is partitioned into units each called as a page, and information called as a storage key is stored for each of the pages. By storing the key information of, for example, 4 bits in the area of the main storage memory, a memory protect function that allows access only when the same key information is included may be provided. However, in a recent multi processor system, the capacity of the main storage memory has been increased. Accordingly, the page number has been increased in proportion to the increase of the capacity of the main storage memory, and the total mount of key information has been also increased. It is general to use the DIMM that is cheap and standardized for the main storage memory. Accordingly, it is difficult to separately establish the area for storing the key information at a moderate price.
Consequently, a part of the area of the main storage memory established by the DIMM is diverted as a storage area of the key information. Consequently, the area which is normally used as the area of the main storage memory is used for storing key information, so that it is difficult to improve the utilization efficiency of the main storage memory.
In a conventional multi processor system or a shared memory system, a part of the area of the main storage memory is used for storing information for keeping cache coherency, information for data assurance, information for protecting the memory, or the like, so that there is a problem in that it is difficult to execute data assurance while improving utilization efficiency of the area of the main storage memory without increasing the cost of the system.
According to an aspect of the invention, an error judging circuit includes a first EOR circuit tree that generates a check bit of a correction code by polynomial remainder calculation with respect to a polynomial expression of an original code which is protected from an error with respect to data of m bit block unit by addition in a Galois extension field GF (2m) in SmEC-DmED using Reed-Solomon code, a second EOR circuit tree for generating syndromes from Sn=Y(αn) with respect to code C(x) in which the check bit is added to the original code when a polynomial representation of a code which is to be detected an error and has a possibility that an error is mixed is Y(x), and an error detection circuit unit that detect if there is a one block error, a two block error, or no error based on whether or not an equation of syndromes S12=S0S2 is satisfied.
The object and advantages of the embodiments will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description and are exemplary and explanatory and are not restrictive of the embodiments, as claimed.
According to the disclosed error judging circuit and the memory shared system, a first EOR circuit tree for generating a check bit of a correction code by polynomial remainder calculation of C(x)=x2I(x)modP(x) with respect to a polynomial expression I(x) of an original code of a target which is protected from an error with respect to data of m bit block unit by addition in a Galois extension field GF (2m) (m is a natural number not less than 8) in SmEC-DmED using (k, k−3) Reed-Solomon code (k is a natural number not more than 2m) when P(x) is a primitive polynomial of m-order in a Galois field GF(2), a primitive element in the Galois extension field GF (2m) is α, and a root of P(x)=0 is αi (i=0, . . . , m−1), a second EOR circuit tree for generating syndromes S0, S1, S2 from Sn=Y(αn) (n=0, 1, 2) with respect to code C(x) in which the check bit is added to the original code when a polynomial representation of a code which is a target to be detected and has the possibility that an error is mixed is Y(x), and an error detection circuit unit for detecting that there is a one block error, a two block error, or no error based on whether or not S12=S0S2 is satisfied and for detecting a position p of a block error from S0αp=S1 in the Galois extension field GF (2m) are used.
By executing an encoding and decoding processing using the Reed-Solomon code as described above to the data stored in and read from the main storage memory, a space area of the main storage memory may be generated. Data assurance may be executed while improving the utilization efficiency of the area of the main storage memory by diverting the generated area for another application and by keeping error correction intensity to the same degree as the conventional degree.
Hereinafter, each embodiment of an error judging circuit and a shared memory system of the invention will be described with reference to the accompanying drawings.
As described below, an encoding circuit includes a first EOR circuit tree for generating a check bit of a correction code by polynomial remainder calculation of C(x)=x2I(x)modP(x) with respect to a polynomial expression I(x) of an original code of a target which is protected from an error with respect to data of m bit block unit by addition in a Galois extension field GF (2m) (m is a natural number not less than 8) in SmEC-DmED using (k, k−3) Reed-Solomon code (k is a natural number not more than 2m) when P(x) is a primitive polynomial of m-order in a Galois field GF(2), a primitive element in the Galois extension field GF (2m) is α, and a root of P(x)=0 is αi (i=0, . . . , m−1). On the other hand, a decoding circuit includes a second EOR circuit tree for generating syndromes S0, S1, S2 from Sn=Y(αn) (n=0, 1, 2) with respect to code C(x) in which the check bit is added to the original code when a polynomial representation of a code which is a target to be detected and has the possibility that an error is mixed is Y(x), and an error detection circuit unit for detecting that there is a one block error, a two block error, or no error based on whether or not S12=S0S2 is satisfied and for detecting a position p of a block error from S0αp=S1 in the Galois extension field GF (2m). As described below, an error judging circuit includes the first EOR circuit tree, the second EOR circuit tree, and the error detection circuit unit.
Note that in the embodiment, for the sake of description, k=36. In this case, (k, k−3) Reed-Solomon code is (36, 33) Reed-Solomon code.
Further, when m is a natural number less than 8, (k, k−3) Reed-Solomon code does not work out, so that it is important that m is a natural number not less than 8. In the embodiment, for the sake of description, m=8. However, it goes without saying that, for example, m=16, or the like.
The main storage memory 13 includes a plurality of DIMM's 130. Each DIMM 130 includes 18 memory chips 131-1 to 131-8 as illustrated in
The main storage memory 13 using the DIMM's 130 capable of correcting data by using an ECC has a data width of, for example, 8 bytes+1 byte (72 bits). In this case, assuming that that the DIMM 130 can resist one chip fault is a condition in order to assure reliability of data, it is sufficient to execute S4EC-D4ED. For example, it has been known that S4EC-D4ED may be provided by providing a check bit of 16 bits with respect to data of 16 bytes. Since the ratio of the data and the check bit in this case is 8:1, as illustrated in
In the embodiment, by regarding two reading by 2 interleave as one unit, in the case of an S8EC-D8ED code, even in the case of one chip failure, it is treated as a single error of 8 bit block. Herewith, an area of 8 bits (1 byte) may be assured for storing extra data, and data may be corrected when there is no error in another chip.
In the embodiment, an encoding circuit and a decoding circuit capable of high speed processing using (36, 33) Reed-Solomon code in a Galois extension field GF (28) as an S8EC-D8ED code are constituted. For example, P(x)=x8+x4+x3+x2+1 is used for 8th order primitive polynomial P(x) in the Galois field GF(2). Assuming that the primitive element of the Galois extension field GF (28) (also the primitive element of Galois field GF(2)) is α, since the primitive polynomial P(x) is 8th order polynomial, there are 8 roots for P(x)=0, and the roots become αi (i=0, . . . , 7).
Encoding Circuit
S8EC-D8ED is a code capable of correcting 1 byte error and detecting 2 byte error. A Reed-Solomon code in which the total of data and error correction code is 36 bytes becomes (36, 33) Reed-Solomon code whose data portion is 33 bytes and error correction code is 3 bytes when distance between cods is 3. Generator polynomial G(X) of the (36, 33) Reed-Solomon code may be expressed by the following expression.
G(x)=(x+α0)(x+α1)(x+α2)
In this case, assuming that the polynomial of the original code which is a target that is protected from an error (that is, error protection target) is I(x), polynomial C(x) of an error correction code is expressed by a polynomial remainder calculation of C(x)=x2I(x)modP(x), and protected code F(x) is expressed by F(x)=x2I(x)+C(x).
The problem when the encoding processing is provided by a high-speed hardware circuit is a portion of the aforementioned polynomial remainder calculation. If the polynomial remainder calculation may be expressed by the form of a generator matrix similar to the existing SEC-DED ECC, it becomes easy to mount the encoding processing on hardware.
Data bit string D={d263, d262, d0} of 33 bytes of the original code is expressed by the following polynomial. In the following polynomial, i indicate the byte position (byte 0 to 32) of a block of 33 bytes, and j indicates the bit position (bit 0 to 7) in the byte block, and each d is 0 or 1.
Herein, when the data bit string D in which only a single bit is 1 is defined as Di={bit string in which only (263-i)-th is 1 and the others are 0}, and assuming that Ii(x) is a polynomial corresponding to each data bit string Di, the polynomial I(x) of the original code is expressed by the linear sum of the polynomial Ii(x) corresponding to the data bit in which di=1 by linearity. In Di, I(x), di, i is a natural number from 0 to 263. Consequently, polynomial C(x) of an error correction code is expressed by the linear sum of Ci(x)=X2Ii(x)modP(x), that is, by C(x)=ΣCi (x). Accordingly, the polynomial Ci(x) of an error correction code related to the data bit string Di is preliminarily obtained, and it is possible to constitute a generator matrix from the polynomial Ci(x) of the error correction code.
From the property of P(α)=0 of the primitive polynomial P(x), in the embodiment, the order of α is set to not more than 7 by repeatedly using α8=α4+α3+α2+1. Herewith, the polynomial Ci(x) of the error correction code is expressed as below.
In the polynomial Ci(x) of the error correction code, K=0, 1, 2, “i” of Ci[n] (n is 8k+j in the aforementioned polynomial) is the index of “C”, “[n]” indicates the index of “i”, each Ci[n] is 0 or 1, and i is 0 to 263. The linear combination of the {Ci[n]} (n=23 . . . 0} becomes the check bit string {Cn} (n=23 . . . 0).
Since the above described equation is an addition in the Galois extension field GF(28), the equation is substantially EOR.
Note that, in order to improve the tolerance property of detection of data destruction in which data bit becomes all-0, a protection action may be made. For example, some of Cn is inverted and stored in the DIMM 130. Note that such a protection action itself is well known, so that the description will be omitted.
As illustrated in
Data and the CheckBit of an error correction code are stored in the main storage memory 13 (DIMM 130) as illustrated with
Decoding Circuit
The EOR circuit tree 31 has the same structure as the EOR tree 21 illustrated in
In the embodiment, three types of syndromes of 8 bits are generated for S8EC-D8ED. Formally, this is practically the same as to generate a syndrome of 24 bits. However, for the sake of simple description of the error detection and correction processing, the syndromes are expressed by S0, S1, S2. Assuming that the code read from the main storage memory 13 (DIMM 130) is expressed by a polynomial representation Y(x) of 36 bytes for both of the data and check bit, the syndromes S0, S1, S2 are expressed by an abstract expression of the syndrome generator matrix of Sn=Y(αn) (n=0, 1, 2). That is, Y(x) is a polynomial representation of the code which is a target for which an error is detected (that is, error detection target). Herein, the read data string D′ is expressed by D′=d′263, . . . , d′0, c′23, . . . c′0. Since there is a possibility that data is broken or an error is mixed during stored in the DIMM 130, when a sign of “′” is attached to the data after read to distinguish with the data before storage, the read code is expressed by the following polynomial representation Y (x).
Herein, similarly to the case when the aforementioned generator matrix is made, the syndromes S0, S1, S2 is also expressed by linear combination in which only one is 1 in a bit string formed by read data and check bit. Also in this case, similarly to the case of encoding, the generator matrix for obtaining the syndromes S0, S1, S2 is obtained by lowering the order by using α8=α4+α3+α2+1. Although check bit is involved, the method for obtaining the matrix for obtaining the syndromes S0, S1, S2 is similar to the method used when encoding, so that the description will be omitted.
In the following matrix representation, S0=ΣS0iαi, S1=ΣS11αi, S2=ΣS2iαi, and each S0 S1 S2 is the value 0 or 1.
When all of the values of the three syndromes S0, S1, S2 obtained as described above are not zero, there is no error. When at least one of the syndromes is not zero, there is some sort of error. When at least one syndrome is not zero, if the equation S12=S0S2 is satisfied in the Galois extension field (28), there is one block error that can be restored. As a Reed-Solomon code, when equality of the equation is not satisfied, there is a two block error. However, it will never happen that equality of the equation is satisfied and S0=0, that is, that there is not less than three block error. When equality of the equation is not satisfied, there is not less than two block error, and since this exceeds performance of S8ED, it is impossible to correct (restore) the data. However, it is possible to detect the data by the performance of D8ED. In this manner, when correction of data is impossible, it is important to notify the error by a system, or to execute an appropriate processing such as an error mark processing. Further, it is also important to consider that α255=1 herein.
As described above, since the syndromes S0, S1, S2 are expressed by the sum of the term of a of not less than 7-th degree of S0=ΣS0iαi, S1=ΣS1iαi, S2=ΣS2iαi, it is not easy to evaluate the equation S12=S0S2 in the condition. Consequently, the logarithmic conversion illustrated in
In this manner, based on the syndrome0[7:0], syndrome1[7:0], Syndrome2[7:0] output from the EOR circuit tree 31, three decode/encode circuits 41 outputs syndrome Alog_Syndrome0[7:0], Alog_Syndrome1[7:0], Alog_Syndrome2[7:0] after logarithmic conversion, and outputs SingleBlockErrorData[7:0] indicating one block error.
The adder circuit 42 includes an accumulator 421, an AND circuit 422, an OR circuit 423, and an increment circuit 424. A carry bit of the accumulator 421 is input in the OR circuit 423. The carry bit of the increment circuit 424 is ignored. The adder circuit 42 executes remainder calculation including mod 255 operation in order to calculate judgment of a block error and the position of the block error.
In
Further, in
The error judging circuit 43 includes an accumulator 431, a matching detection circuit 432, an inverter circuit 433, an AND circuit 434, and a NOR circuit 435. Syndrome0[7:0], Syndrome1[7:0], Syndrome2[7:0] output from the EOR tree 31 are input in the accumulator 431. Bit No_Error indicating that there is no error is output from the accumulator 431. Alog_Syndrome1[6:7,0] output from a rearrangement circuit 45 that executes rearrangement of bit of the syndrome Alog_Syndrome1[7:0] output from the second decode/encode circuit 41 from the top, and Alog_Syn2_PL_Syn0_mod [7:0] of the adder circuit 42 at the lower side are input in the matching detection circuit 432 as illustrated in
Since the correction data CorrectData[263:0] output from the error correction circuit unit 33 becomes correct data to which an error correction processing is subjected (that is, after correction), the correction data may be used in the case of not a two block error. Further, when the correction check bit CorrectCB[23:0] output from the error correction circuit unit 33 is not used after the processing, a circuit portion generating the correction check bit CorrectCB[23:0] among the error correction circuit unit 33 may be omitted. Further, when the value of SignalBlockErrorData[7:0] indicating the position p of one block error is not within the range of 0 to 35, that is, not less than 36, it is judged to be much block error, and is dealt similar to tow block error.
If the equation S12=S0S2 is satisfied, the syndrome S0 indicates correction data, and the byte position p which is to be corrected is obtained by S0αp=S1 in Galois extension field GF (28). Accordingly, an error is corrected by operating EOR with the syndrome S0 with respect to 8 bit data that exists at the block position p of the read data, and the performance of S8EC may be assured.
Note that since α255=1)(=α0) is satisfied, when constituting a circuit for equal sign formation detection or error position detection of the equation S12=S0S2 in the Galois extension field GF (28) when adding power index, it goes without saying that it is important to consider that there is a case that the addition result of power index may be not only 0, but also 255.
Next, the circuit scale and latency of the encoding circuit and the decoding circuit of the aforementioned embodiment will be described.
As is understood from the size of the generator matrix, the circuit scale of the encoding circuit is 24 EOR circuits of about 140 bits. The latency of the encoding circuit is about 9 steps of EOR circuits of 2 bits since an EOR circuit tree is used.
In the decoding circuit, the scale of the circuit for generating syndromes (syndrome S0 is slightly smaller than other syndromes due to sparse matrix) is 24 EOR circuits of about 140 bits. The latency of the circuit for generating syndromes is about 9 steps of EOR circuits of 2 bits since an EOR circuit tree is used similarly to the encoding circuit. The error detection circuit unit and the error correction circuit unit require at leas three circuits for logarithmic conversion from 8 bits to 8 bits in order to execute an error correction processing from a generated syndrome, one 8 bit accumulator and matching detection circuit for error judgment, and one circuit for correcting a single block (AND-EOR circuit corresponding to 264 bits). The 8 bit decode/encode circuit that functions as a logarithmic conversion circuit is constituted by an AND-OR circuit. Accordingly, it is important that the latency of the error detection circuit unit and error correction circuit unit is slightly longer than that of the circuit that generates a syndrome.
The check bit of ECC that is simplified as in the aforementioned embodiment makes it possible to correct data even when one chip among the DIMM 130 is broken down. In the case of aforementioned example, 33 bytes among 36 bytes becomes a data area (32 bytes is original data, one byte is a space area that is newly ensured for extra data) as illustrated in
In this manner, in the case of the aforementioned embodiment, a space area capable of storing extra data (or information) of one byte may be ensured in the main storage memory 13 by reducing the number of check bit to be required while maintaining the same error correction and detection performance as in the past with respect to 32 byte data. By using the space area of one byte, directory information of ccNUMA, key information of the main frame, and the like may be stored. That is, a part of the area of the main storage memory 13 may be used for string information for keeping cache coherency, information for assuring data, information for protecting a memory, and the like while keeping the same error correction and detection performance as in the past. Accordingly, it becomes possible to improve utilization efficiency of the area of the main storage memory 13 to assure data without increasing the coast of the system.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a depicting of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2008-294828 | Nov 2008 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4162480 | Berlekamp | Jul 1979 | A |
4498175 | Nagumo et al. | Feb 1985 | A |
4567568 | Inagawa et al. | Jan 1986 | A |
4637021 | Shenton | Jan 1987 | A |
5155734 | Kashida | Oct 1992 | A |
5226043 | Pughe, Jr. et al. | Jul 1993 | A |
5313464 | Reiff | May 1994 | A |
5325370 | Cleveland et al. | Jun 1994 | A |
5790447 | Laudon et al. | Aug 1998 | A |
5793779 | Yonemitsu et al. | Aug 1998 | A |
6049476 | Laudon et al. | Apr 2000 | A |
6574774 | Vasiliev | Jun 2003 | B1 |
7353336 | Gaither | Apr 2008 | B2 |
20010052103 | Hirofuji et al. | Dec 2001 | A1 |
20010053225 | Ohira et al. | Dec 2001 | A1 |
20030135810 | Hsu et al. | Jul 2003 | A1 |
20040236901 | Briggs | Nov 2004 | A1 |
20050289440 | Nerl et al. | Dec 2005 | A1 |
20070098021 | Brennan | May 2007 | A1 |
20070226593 | Mead et al. | Sep 2007 | A1 |
20080162991 | Dell et al. | Jul 2008 | A1 |
20080235560 | Colmer et al. | Sep 2008 | A1 |
20080307289 | Hsu | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
1288542 | Mar 2001 | CN |
101258471 | Sep 2008 | CN |
54-057848 | May 1979 | JP |
2-215231 | Aug 1990 | JP |
9-120671 | May 1997 | JP |
11-232129 | Aug 1999 | JP |
2006-252545 | Sep 2006 | JP |
Number | Date | Country | |
---|---|---|---|
20100125771 A1 | May 2010 | US |