This application claims the priorities from Chinese Patent Application No. 202310068186.3 filed on Jan. 16, 2023 and Chinese Patent Application No. 202310070527.0 filed on Jan. 16, 2023 before the China National Intellectual Property Administration (CNIPA), the entire disclosure of which are incorporated herein by reference in their entity.
Embodiments of the present disclosure relate to the field of computer technology, particularly to the field of lossless data compression method, and more particularly to an erasing-based lossless compression and decompression methods for Floating-point Data.
The advance of sensing devices and Internet of Things has brought about the explosion of time series data. A significant portion of time series data are floating-point values produced at an unprecedentedly high rate in a streaming fashion. If these huge floating-point time series data (abbr. time series or time series data in the following) are transmitted and stored in their original format, it would take up a lot of network bandwidth and storage space, which not only causes expensive overhead, but also reduces the system efficiency and further affects the usability of some critical applications. Therefore, when processing or storing floating-point data, it is necessary to first compress the floating-point data according to a certain algorithm while meeting certain accuracy requirements. The compressed floating-point data will occupy less storage space, operation resources and transmission resources.
Normally, there are two categories of compression methods specifically for floating-point time series data, i.e., lossy compression algorithms and lossless compression algorithms. The former would lose some information, and thus it is not suitable for scientific calculation, data management or other critical scenarios, in which any error could result in disastrous consequences. To this end, lossless floating-point time series compression has attracted extensive interest for decades. One representative lossless algorithm is based on the XOR operation.
As shown in
Gorilla (see Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., Veeraraghavan, K.: Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8(12), 18161827 (2015)) and Chimp (see Liakos, P., Papakonstantinopoulou, K., Kotidis, Y.: Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15(11), 3058{3070(2022)} are two state-of-the-art XOR-based lossless floating-point compression methods. Gorilla assumes that the XORed result of two consecutive floating-point values is likely to have both many leading zeros and trailing zeros. However, the XORed result actually has very few trailing zeros in most cases. As shown in
However, increasing the number of trailing zeros of the XORed results plays a significant role in improving the compression ratio for time series.
Embodiments of the present disclosure propose an erasing-based lossless compression method for floating-point values, an erasing-based lossless decompression method for floating-point values, an electronic device, and a non-transitory computer readable storage medium.
In a first aspect, some embodiments of the present disclosure provide an erasing-based lossless compression method for floating-point values. The method includes: acquiring a floating-point value, and calculating a decimal place count of the floating-point value; transforming the floating-point value into a binary format, where the floating-point value in the binary format is composed of a digit on a sign bit, digits on exponent bits, and digits on mantissa bits; determining, in the mantissa bits, a reference mantissa bit based on the decimal place count and the digits on the exponent bits; performing erasing operation on bits following the reference mantissa bit by setting corresponding digits on the bits following the reference mantissa bit to be zero, to obtain a value in the binary format, and using the value in the binary format obtained by the erasing operation as a mantissa prefix number of the floating-point value; inputting the mantissa prefix number of the floating-point value into an eXclusive OR (XOR) based compressor, to obtain an XORed result, and storing the XORed result.
In a second aspect, some embodiments of the present disclosure provide an erasing-based lossless decompression method for floating-point values. The method includes: acquiring an XORed result and a modified decimal significand count of a floating-point value, where the XORed result is obtained during compression of the floating-point value by performing XOR operation on a mantissa prefix number of the floating-point value and a mantissa prefix number of a previous floating-point value; performing XOR operation on the XORed result and the mantissa prefix number of the previous floating-point value, to obtain a mantissa prefix number of the floating-point value; calculating a decimal place count of the floating-point value based on the modified decimal significand count of a floating-point value; and recovering the floating-point value based on the mantissa prefix number of the floating-point value and the decimal place count of the floating-point value.
In a third aspect, some embodiments of the present disclosure provide an electronic device. The electronic device includes at least one processor; and a memory communicatively connected to the at least one processor; where, the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, cause the at least one processor to perform the method described in any one of the embodiments described in the first and second aspects.
In a fourth aspect, some embodiments of the present disclosure provide a non-transitory computer readable storage medium, storing computer instructions thereon, where the computer instructions are used to cause the computer to perform the method described in any one of the embodiments described in the first and second aspects.
After reading detailed descriptions of non-limiting embodiments given with reference to the following accompanying drawings, other features, objectives and advantages of the present disclosure will be more apparent:
Embodiments of the present disclosure is further described below in detail in combination with the accompanying drawings. It may be appreciated that the specific embodiments described herein are merely used for explaining the relevant disclosure, rather than limiting the disclosure. In addition, it should be noted that, for the ease of description, only the parts related to the relevant disclosure are shown in the accompanying drawings.
It should be noted that, in the specification, the expressions such as “first,” “second” and “third” are only used to distinguish one feature from another, rather than represent any limitations to the features. It should be further understood that the terms “comprise,” “comprising,” “having,” “include” and/or “including,” when used in the specification, specify the presence of stated features, elements and/or components, but do not exclude the presence or addition of one or more other features, elements, components and/or combinations thereof. In addition, expressions such as “at least one of,” when preceding a list of listed features, modify the entire list of features rather than an individual element in the list. Further, the use of “may,” when describing the implementations of the present disclosure, relates to “one or more implementations of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration.
It should be noted that embodiments in the present disclosure and the features in the embodiments may be combined with each other on a non-conflict basis. Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.
Step 1: acquiring a floating-point value, and calculating a decimal place count of the floating-point value.
In the embodiment, the floating point value may be a single-precision floating-point value or a double-precision floating point value. The decimal place count of the floating-point value is the count of decimal place(s) in the floating-point value in the decimal format. For example, for a floating-point value 3.17, the decimal place count thereof is 2. For another example, for a floating-point value −0.0314, the decimal place count thereof is 4. For another example, for a floating-point value 314.0, the decimal place count thereof is 1.
S2: transforming the floating-point value into a binary format, wherein the floating-point value in the binary format is composed of a digit on a sign bit, digits on exponent bits, and digits on mantissa bits.
In the embodiment, the floating-point value is transformed from the decimal format into the binary format. The floating-point value in the binary format may be a double floating-point value occupying 64 bits, which include 1 sign bit, 11 exponent bits, and 52 mantissa bits, just as illustrated in
S3: determining, in the mantissa bits, a reference mantissa bit based on the decimal place count and the digits on the exponent bits.
In the embodiment, a reference mantissa bit (the mantissa bits after this reference mantissa bit will be erased) is determined in the mantissa bits of the floating-point value, based on the decimal place count and the digits on the exponent bits.
In an alternative implementation of the embodiment, the floating-point value in the binary format may be a floating-point value occupying 64 bits, and then the reference mantissa bit may be determined by:
where α denotes the decimal place count of the floating-point value, g(α) denotes the place of the reference mantissa bit in the mantissa bits of the floating-point value, and ei denotes a digit on the ith exponent bit in the exponent bits of the floating-point value. The operator ┌x┐ means round x up. That is, the digit on the reference mantissa bit is mg(α), then the digits <mg(α)+1, . . . , m52> on the mantissa bits after the reference mantissa bit g(α) are set to be zero, and thus the mantissa bits after the reference mantissa bit g(α) are erased.
In an alternative implementation of the embodiment, the floating-point value in the binary format may be a floating-point value occupying 32 bits, and then the reference mantissa bit is determined by:
where α denotes the decimal place count of the floating-point value, g(α) denotes the place of the reference mantissa bit in the mantissa bits of the floating-point value, and ei denotes a digit on the ith exponent bit in the exponent bits of the floating-point value. Therefore, the digit on the reference mantissa bit is mg(α), then the digits <mg(α)+1, . . . , m23> on the mantissa bits after the reference mantissa bit g(α) are set to be zero, and thus the mantissa bits after the reference mantissa bit g(α) are erased.
S4: performing erasing operation on bits following the reference mantissa bit by setting corresponding digits on the bits following the reference mantissa bit to be zero, to obtain a value in the binary format, and using the value in the binary format obtained by the erasing operation as a mantissa prefix number of the floating-point value.
In the embodiment, the bits following the reference mantissa bit g(α) are erased by setting corresponding digits on the bits following the reference mantissa bit g(α) to be zero. A value in the binary format is obtained by the erasing. The value in the binary format obtained by the erasing is used as mantissa prefix number of the floating-point value. For example, given the float-point value 3.17, the decimal place count thereof is α=2, the float-point value 3.17 is transformed into binary format, i.e., “0 10000000000 1001010111000010100011110101110000101000111101011100”, then it may be calculated that e=(e1e2 . . . e11)2=Σi=111ei×211−i=1×210=1024, g(α)=[α×log2 10]+e−1023=8, it indicates that the 8th mantissa bit is determined as the reference mantissa bit, and then mantissa bits after the 8th mantissa bit are erased from the binary format of the floating value 3.17 to obtain a value “0 10000000000 1001010100000000000000000000000000000000000000000000”. The value “0 10000000000 1001010100000000000000000000000000000000000000000000” may be used as the mantissa prefix number of the floating-point value 3.17.
S5: inputting the mantissa prefix number of the floating-point value into an eXclusive OR (XOR) based compressor, to obtain an XORed result, and storing the XORed result.
In the embodiment, the mantissa prefix number of the floating-point value is inputted in to an eXclusive OR (XOR) based compressor, to obtain an XORed result. In an alternative implementation, the mantissa prefix number of the floating-point value may be inputted in to an eXclusive OR (XOR) based compressor, to perform XOR operation on the mantissa prefix number of the floating-point value and a mantissa prefix number of a previous floating-point value, to obtain the XORed result. For example, as illustrated in
The compression method transforms a floating-point value to another one with more trailing zeros under a guaranteed bound, so it can potentially improve the compression ratio of most XOR-compression methods tremendously.
In an alternative implementation, the decimal place count of the floating-point value may be also stored, so that the XORed result and the decimal place count of the floating-point value form the lossless compressed data for recovering the floating-point value.
In an alternative implementation, a modified decimal significand count of the floating-point value may be calculated and then also be stored, so that the XORed result and the modified decimal significand count of the floating-point value form the lossless compressed data for recovering the floating-point value. The modified decimal significand count of the floating-point value may be used for later recovering the decimal place count of the floating-point value.
In an alternative implementation, the modified decimal significand count of the floating-point value may be calculated by:
where v denotes the floating-point value, β* denotes the modified decimal significand count of the floating-point value, and β denotes a decimal significand count of the floating-point value. Decimal significand count of a floating-point value refers to the count of significand place(s) in decimal format, e.g., the decimal significand count of 3.17 is 3, the decimal significand count of −0.0314 is 3, and the decimal significand count of 3.140 is 4. For example, for the floating point value 3.17, the modified decimal significand count thereof is equal to the decimal significand count thereof, which is 3. Since the decimal significand count β of a double value would not be greater than 17, it requires much fewer bits to store β.
According to the embodiment of the present disclosure, a reference mantissa bit in the mantissa bits is determined based on the decimal place count and the digits on the exponent bits, then bits following the reference mantissa bit are erased (i.e., corresponding digits on the bits following the reference mantissa bit are set to be zero), and the erased floating-point value is input into the XOR-based compressor for XOR operation. At one hand, by erasing the mantissa bits following the reference mantissa bit, plenty tailing mantissa bits are set to be zero, so that an XORed result having plenty tailing zeros are obtained when XOR operation is performed on the erased floating-point value and its neighbor floating-point value. At another hand, the reference mantissa bit (the mantissa bits after which are erased) is determined based on the decimal place count and the digits on the exponent bits, and then only the mantissa bits following reference mantissa bit will be erased, and none of the sign bit and the exponent bits is erased, so that while ensuring that the XORed result has plenty tailing zeros, the compression-decompression precision are ensured, so that the effect of compressing floating-point values are improved. A new idea for compressing floating-point values without any precision loss is provided. In addition, embodiments of the present disclosure use the XORed result and the decimal place count of the floating-point value to form the lossless compressed data for recovering the floating-point value, or uses the XORed result and the modified decimal significand count of the floating-point value to form the lossless compressed data for recovering the floating-point value, so that during the later decompression, the XORed result is decompressed to obtain the mantissa prefix number, and the original floating-point value is recovered based on the mantissa prefix number and the decimal place count (the decimal place count is obtained from storage or is recovered from the stored modified decimal significand count), so that the decompression ratio and the decompression efficiency are further improved.
Step 1: acquiring the stored XORed result and the modified decimal significand count of a floating-point value; and Step 2: performing XOR operation on the XORed result and the mantissa prefix number of the previous floating-point value, to obtain the mantissa prefix number of the floating-point value.
In the embodiment, the stored XORed result and the modified decimal significand count of the floating-point value is obtained, from a storage where the XORed result and the modified decimal significand count of the floating-point value are stored. Then XOR operation is performed on the XORed result and the mantissa prefix number of the previous floating-point value. For example, the stored XORed result Δ′ is “0 00000000000 0011010100000000000000000000000000000000000000000000”, and the mantissa prefix number of the previous floating-point value 3.25 is “0 10000000000 1010000000000000000000000000000000000000000000000000”, then the XOR operation is performed on the “0 00000000000 0011010100000000000000000000000000000000000000000000” and “0 10000000000 1010000000000000000000000000000000000000000000000000”, to obtain “0 10000000000 1001010100000000000000000000000000000000000000000000” which is the mantissa prefix number of the original floating-point value 3.17.
Step 3: obtaining the decimal place count of the floating-point value.
In the embodiment, the decimal place count of the floating-point value is obtained, for recovering the original floating-point value.
In an alternative implementation of the embodiment, the decimal place count of the floating-point value may be obtained directly when the decimal place count was stored during the compression. Alternatively, the decimal place count of the floating-point value may be recovered from the modified decimal significand count which was stored during the compression.
In an alternative implementation of the embodiment, the recovering the decimal place count based on the decimal significand count may comprises: in response to determining that β* equals to zero, determining that v=10−i, i=SP(v′)+1; in response to determining that β* does not equal to zero, then assigning β=β*, recovering the decimal place count of the floating-point value by,
where α denotes the decimal place count of the floating-point value, v denotes the floating-point value, v′ denotes the mantissa prefix number of the floating-point value v, SP(v′) is start decimal significand position of the mantissa prefix number. In an alternative implementation, SP(v′)=└log10|v′|┘, the operator[x] denotes round x down. For example, for the original floating-point value 3.17, the stored modified decimal significand count is β=β*=3, and the mantissa prefix number v′ thereof is calculated as 3.1640625, then the decimal place count of the original floating-point value is then calculated as α=β−(SP(v′)+1)=3−(└log10|3.1640625|┘+1)=2. Then, the decimal place count of the original floating-point value is recovered.
Step 3: recovering the floating-point value based on the mantissa prefix number of the floating-point value and the decimal place count of the floating-point value.
In the embodiment, the original floating-point value may be recovered based on the mantissa prefix number of the floating-point value and the decimal place count of the floating-point value.
In an alternative implementation of the embodiment, step 3 further comprises: transforming the mantissa prefix number of the floating-point value into decimal format; recovering the floating-point value by:
where Leaveout(v′, α)=(dh′−1dh′−2 . . . d0·d−1d−2 . . . d−α)10 is the operation that leaves out the digits after d−α DF(v′)=(dh′−1dh′−2 . . . d0·d−1d−2 . . . d−αd−(α+1) . . . dt′)10, where v denotes the floating-point value, v′ denotes the mantissa prefix number of the floating-point value, DF(v′) is the mantissa prefix number in the decimal format, di denotes a digit on the ith place in the mantissa prefix number in the decimal format.
For example, for the mantissa prefix number “0 10000000000 1001010100000000000000000000000000000000000000000000” of the original floating-point value 3.17, transforming it into the decimal format to obtain a value of 3.1640625, the decimal place count of the original floating-point value is recovered as α=2, then v=Leaveout(v′, a)+10−α=3.16+10−2=3.17. The original floating-point value is recovered without loss of precision.
In an alternative implementation of the embodiment, the equation v=Leaveout(v′, α)+10−α may be implemented by v=Roundup(v′, α), where Roundup(v′, α) is the operation to round v′ up to a decimal places.
According to the embodiment of the present disclosure, the stored XORed result is acquired and then the mantissa prefix number of the original floating-point value is recovered therefrom, and then the decimal place count of the original floating-point value is recovered, and the original floating-point value is recovered based on the mantissa prefix number and decimal place count of the original floating-point value, without any precision loss. The lossless decompression for the floating-point value is realized.
Following double floating-point data occupying 64 bits are taken as an example to explain the embodiments of the present disclosure in detail. The processing method for the single floating-point data is similar.
For ease of explanation, following definitions are provided.
Definition 1: Decimal Format and Binary Format. The decimal format of a double value v is DF(v)=±(dh−1dh−2 . . . d0·d−1d−2 . . . dt)10, where di∈{0,1, . . . ,9} for l≤i≤h−1, dh−1≠0 unless h=1, and dt≠0 unless l=−1. That is, DF(v) would not start with “0” except that h=1, and would not end with “0” except that l=−1. Similarly, the binary format of v is BF(v)=±(b
where the “±” (which means “+” or “−”) is the sign of v. If v≥0, “+” is usually omitted. For example, DF(0)=(0.0)10, DF(5.2)=(5.2)10, BF(−3.125)=−(11.001)2.
Definition 2: Decimal Place Count, Decimal Significand Count and Start Decimal Significand Position. Given v with its decimal format DF(v)=±(dh−1dh−2 . . . d0·d−1d−2 . . . dl)10, DP(v)=|l| is called its decimal place count. If for all l<n≤i≤h−1, di=0 but dn−1≠0 (i.e., dn−1 is the first digit that is not equal to 0), SP(v)=n−1 is called the start decimal significand position, and DS(v)=n−1=SP(v)+1−l is called the decimal significand count. For the case of v=0, we let DS(v)=0 and SP(v)=undefined.
For example, DP(3.14)=2, DS(3.14)=3, and SP(3.14)=0; DP(−0.0314)=4, DS(−0.0314)=3, and SP(−0.0314)=−2; DP(314.0)=1, DS(314.0)=4, and SP(314.0)=2.
As illustrated in
where v denotes the floating-point value, mi denotes a digit on the ith mantissa bit in the mantissa bits of the floating-point value, and ei denotes a digit on the ith exponent bit in the exponent bits of the floating-point value. If let m0=1 and BF(v)=±(b
As illustrated in
The main idea of the Erasing-based Lossless Floating-point (Elf) compression described herein is to erase some less significant mantissa bits (i.e., set them to zeros) of a double value v. As a result, v itself and the XORed result of v with its previous value are expected to have many trailing zeros. Note that v and its opposite number—v have the same double-precision floating-point formats except the different values of their signs. That is to say, the compression process for—v can be converted into the one for v if we reverse its sign bit only, and vice versa. To this end, in the rest of the disclosure, if not specified, v is assumed to be positive for the convenience of description. Before introducing the details of Elf compression, we first give the definition of mantissa prefix number.
Definition 3: Mantissa Prefix Number. Given a double value v with {right arrow over (m)}=<m1, m2, . . . , m52>, the double value v′ with {right arrow over (m′)}=<m′1, m′2, . . . , m′52> is called the mantissa prefix number of v if and only if there exists a number n∈{1,2, . . . ,51} such that m′i=mi for 1≤i≤n and m′j=0 for n+1≤j≤52, denoted as v′=MPN(v, n).
The definition of Mantissa Prefix Number is proposed firstly in embodiments of the present disclosure.
For example, as shown in
The erasing-based lossless compression method for floating-point values described in embodiments of the present disclosure is based on the following observation: given a double value v with its decimal format DF(v)=(dh−1dh−2 . . . d0·d−1d−2 . . . dl)10, we can find one of its mantissa prefix numbers v′ and a minor double value δ, 0≤δ≤10l, such that v′=v−δ. If the information of v′ and 6 are retained, v cloud be recovered without losing any precision. The parameter δ is proposed herein for ease of understanding the compression and decompression methods described herein, and the accurate value of δ is not required to be calculated. Then during recovering the original floating-point value v, it is not required to find the accurate value of δ, we just need to round v′ up to a decimal places and then plus 10−α. For example, when α=DP(v)=DP(3.17)=2, v′=3.1640625, then v=RoundUp(v′, a)=(3.16)10+10−2=3.17. In an example, the v=RoundPp(v′, α) could also be implemented by Leaveout(v′, a)=(d′h−1d′h−2 . . . d0·d−1d−2 . . . d−a)10, which leaves out the digits after d−α in DF(v′)=(d′h−1d′h−2 . . . d0·d−1d−2 . . . d−αd−(α+1) . . . dl′)10.
There are two problems here. Problem I: how to find the best mantissa prefix number v′ of v with the minimum efforts; Problem II: how to store the decimal place count α with the minimum storage cost?
For the problem I: It is time consuming to iteratively check all mantissa prefix number v′ until δ=v−v′ is greater than 10−α, it needs to verify the mantissa prefix numbers at most 52 times in the worst case. A novel mantissa prefix number search method is proposed herein.
Theorems are proposed herein for ease of explaining the mantissa prefix number search method.
Theorem 1: Given a double value v with its decimal place count DP(v)=a and binary format BF(v)=(b
Here, f(α)=┌|log2 10−α|┐ means that the decimal value 10−α requires exactly ┌|log2 10−α|┐ binary bits to represent. Suppose δ is obtained based on Theorem 1, v−δ can be regarded as erasing the bits after b−f(α) in Vs binary format. In accordance with IEEE 754 Standard and recall that the b−i=mi+e−1023 in BF(v) where i>0 described above, a correspondingmi+e−1023 can be found. Consequently, v−δ can be further deemed as erasing the mantissa bits after mg(α) in Vs underlying floating-point format, in which g(α) is defined as:
where α=DP(v) and e=(e1e2 . . . e11)2=Σi=111ei×211−i.
As a result, we can directly calculate the best mantissa prefix number v′ by simply erasing the mantissa bits after mg(α) of v, which takes only O(1).
For the problem II, if it is directly the decimal place count a stored, it would require ┌log2 αmax┐ bit for a storage, where αmax is the possible maximum value of a decimal place count. The minimum value of the double-precision floating-point number is about 4.9×10−314, so αmax=324 and ┌log2 αmax┐=9, i.e., it would require as many as 9 bits to store α during compression process for each double value. Thus, to further reduce the storage cost and improve the compression ratio, we may store the modified decimal significand count of the floating-point value instead.
Given v with its decimal format DF(v)=(dh−1dh−2 . . . d0·d−1d−2 . . . dl)10, we notice that its decimal place count α=DP(v) can be calculated by the decimal significand count β=DS(v). Since the decimal significand count of a double value would not be greater than 17 under the IEEE 754 Standard, it requires much fewer bits to store β. According to the above Definition 2, we have α=DP(v)=|l|=−l and β=DS(v)=SP(v)+1−l, so we have:
Next, we discuss how to get SP(v) without even knowing v. Two additional Theorems are proposed. The additional theorems are proposed according to the structure of double floating-point value.
Theorem 2: Given a double value v and its best mantissa prefix number v′, if v≠10−i, i>0, then SP(v)=SP(v′).
Theorem 3: Given a double value v=10−i, i>0, and its best mantissa prefix number v′, we have SP(v)=SP(v′)+1.
According to Theorem 2 and Theorem 3, we have:
For any normal number v, its decimal significand count β will not be zero. Besides, if we know v=10−SP(v), we can easily get v from v′ by the following equation:
To this end, we can record a modified decimal significand count β* for the calculation of α.
where β* denotes the modified decimal significand count of the floating-point value, and β denotes a decimal significand count of the floating-point value, SP(v′) denotes the start decimal significand position of v′.
Although there are 18 possible values of β*, i.e., β*∈{0, 1, 2, . . . , 17}, we do not consider the situations when β*=16 or 17, because for these two situations, we can only erase a small number of bits but need more bits to record β*. For example, given v=3.141592653589792 with β=16, we can erase one bit only. Thus, the erasing operation may be performed when it determined that β*<16.
In an alternative implementation, since 4 bits is leveraged to record β* for 0≤β*<15, the erasing operation is performed only when 52−g(α)>4. When 52−g(α)≤4, which means the mantissa bits to be erased is less than 4, we may do not perform the earing operation.
In an alternative implementation, when δ=0, it indicates that v itself has long trailing zeros. Once δ=0, we may do not perform the erasing operation. We may get δ by extracting the least 52−g(α) significant mantissa bits of v, to determine if δ=0.
Implementations of present disclosure store the modified decimal significand count β* instead of the decimal place count, the storage space required is reduced hugely compared with directly storing the decimal place count.
Normal numbers are the most cases of time series data, and the erasing operation in the above described compression and decompression methods are applicable to normal numbers. However, the erasing operation described above is tailored for the special numbers.
There are four types of special number:
The above erasing operation are tailored for the special numbers by:
According to yet another embodiment of the present disclose, a method for storing the modified decimal significand count of the floating-point value is provided. The method for storing the modified decimal significand count of the floating-point value includes: in response to determining that the condition C1 is satisfied, writing a first flag code (e.g., one bit of “1”) to indicate performing the erasing operation, and writing 4 bits of β* following the first flag code; in response to determining that the condition C1 is not satisfied, writing a second flag code (e.g., one bit of “0”) to indicate not performing the erasing operation. The condition C1 is satisfied when δ≠0 (i.e., a digit on a mantissa bit following the reference mantissa bit is not zero) and/or β*<16, and/or 52−g(α)>4. For example, the condition C1 is satisfied when it is determined that δ≠0. For example, the condition C1 is satisfied when it is determined that δ≠0 and β*<16. For example, the condition C1 is satisfied when it is determined that δ≠0 and 52−g(α)>4. For another example, the condition C1 is satisfied when it is determined that δ≠0 and β*<16 and 52−g(α)>4. An alternative implementation of storing the modified decimal significand count β* is described in
In an alternative implementation, given a floating-point value v, when it is determined that the above condition C1 is satisfied, the out stream writes a first flag code (e.g., one bit of “1”) to indicate that v should be transformed to v′ by erasing the least 52−g(α) significant mantissa bits of v, followed by 4 bits of β* for the recovery of v. Otherwise, the out stream writes a second flag code (e.g., one bit of “0”), and v′ is assigned v without any modification. Finally, the obtained v′ is passed to the XOR-based compressor (i.e., the XORcmp illustrated in
In an alternative implementation, when it is determined that δ≠0 and β*<16 and 52−g(α)>4 hold simultaneously, the out stream writes a first flag code (e.g., one bit of “1”) to indicate that v should be transformed to v′ by erasing the least 52−g(α) significant mantissa bits of v, followed by 4 bits of β* for the recovery of v. Otherwise, the out stream writes a second flag code (e.g., one bit of “0”), and v′ is assigned v without any modification. Finally, the obtained v′ is passed to the XOR-based compressor together with the first or second flag code for further compression.
The values in a time series usually have similar significand counts. Therefore, their modified significand counts are also similar. In the method described above, if a value v is to be erased, we always use four bits to record its β*, which consumes storage spaces. An embodiment of the present disclosure proposes to make the utmost of the modified significand count of the previous one value β*pre, which is not only suitable for streaming scenarios and adaptive to dynamic significand counts, but also retains the characteristics of lossless compression. The intuition behind this is that the modified significand count of each value in a time series is likely to be exactly the same as that of the previous value. An alternative implementation of storing the β* by make the utmost of β*pre is described in
As illustrated in
We notice that the case of “C1 and β*=β*pre” has the largest proportion among the three cases illustrated in
As illustrated in
An example algorithm for realizing the Elf+ compression corresponding to
The above algorithm presents Elf+ compression method, which is similar to the Elf compression method except two aspects. (1) We further check if β*=β*pre when v is to be erased (Lines 4-9). If β*=β*pre, we only write one bit of ‘0’. Otherwise, we write two bits of ‘11’ and four bits of β*. Moreover, we assign β* to β*pre for the compression of the next value (Line 8). (2) The flag codes are different from those in Elf compression. For example, in Elf compression, we use one bit of ‘0’ to indicate the case that v would not be erased, but in Elf+ compression we leverage two bits of ‘10’ for this case (Line 11).
Here, each of the first, second, third, and fourth flag codes may occupy one or two bits.
When β* is stored according to the encoding strategy illustrated in
When β* is stored according to the storing method illustrated in
When β* is stored according to the storing method illustrated in
An example algorithm for realizing recovering the original floating-point value corresponding to the Elf+ compression of
The naive method for calculating the significand counts of floating-point values is to first transform a floating-point value into a string, and then calculate its significand count by scanning the string. However, this method runs very slowly since the data type transformation is quite expensive. Other methods, such as BigDecimal in Java language perform even worse as these high-level classes implement many complex but unnecessary logics, which are not suitable for the calculation of significand counts.
In an alternative implementation, a trial-and-error approach is proposed herein to calculate the significand count. In particular, for any one of the above described compression methods, we iteratively check if the condition “v×10i=└v×10i┘” holds (only when the result of v×10i does not have the fractional part, does the condition hold), where i is sequentially from sp* to at most sp*+17 (note that the maximum significand count of a double value is 17). Here, sp* is calculated by:
The value i (denoted as i*) that first makes the equation “v×10i=└v×10i┘” hold can be deemed as the decimal place count α. At last, we can get the significand count β=i*+SP(v)+1 according to the equation α=β−(SP(v)+1).
The verification of the condition “v×10i=└v×10i┘” is expected to take O(β) in terms of time complexity. To expedite this process, we may take full advantage of the fact that most values in a time series have the same significand count. We may start the verification at i=max (β*pre−SP(v)−1,1). There are two cases. Case 1: β*≤β*pre. For this case, if “v×10i=└v×10i┘” does not hold, we may repetitively increase i by 1 until the condition is satisfied. Case 2: β*>β*p. For this case, we should constantly adjust i by decreasing it until the condition “i>1 and v×10i−1=└v×10i−1┘” does not hold. Finally, the significand count is obtained and returned according to the equation α=β−(SP(v)+1).
In an alternative implementation, we may leverage two sorted exponential arrays, i.e., Log Arr1={100, 101, . . . , 10i, . . . } and Log Arr2={100, 10−1, . . . , 10−j, . . . }, to accelerate the process to find SP(v). Particularly, we sequentially scan these two arrays firstly. If v≥1 and 10i≤v≤10i+1 then SP(v)=i; if v<1 and 10−i≤v≤10−(j−1), then SP(v)=−j. In an alternative implementation, we may set |Log Arr1|=|Log Arr2|=10. If v≥1010 or v≤10−10, we may call └log10|v|┘ to get SP(v) finally (i.e., SP(v)=└log10|v|┘). This alternative implementation reduce the time consumed during calculation the start position SP(v).
Theoretically, any existing XOR-based compressor such as Gorilla and Chimp mentioned above can be utilized in Elf. Since the erased value v′ tends to contain long trailing zeros, to compress the time series compactly, in this section, we propose a novel XOR-based compressor and the correspond decompressor. In an embodiment, both Elf and Elf+ use the same XORcmp and XORdcmp.
Elf XORcmp: existing XOR-based compressors store the first value v1′ of a time series using 64 bits. However, after being erased some insignificant mantissa bits, v1′ tends to have a large number of trailing zeros. As a result, we leverage ┌log2 65┐=7 bits to record the number of trailing zeros trail of v1′ (note that trail can be assigned a total of 65 values from 0 to 64), and store v1's non-trailing bits with 64-trail bits. In all, we may utilize 71-trail bits to record the first value, which is usually less than 64 bits. For each value vt′ that t>1, we store xort=vt′⊕vt−1′.
Gorilla Compressor Gorilla compressor checks whether xort is equal to 0 or not. If xort=0 (i.e., vt′=vt−1′), Gorilla writes one bit of “0”, and thus it can save many bits without actually storing vt′. If xort≠0, Gorilla writes one bit of “1” and further checks whether the condition C1 is satisfied. Here C1 is “leadt≥leadt−1 and “trailt≥trailt−1”, meaning that the leading zeros count and trailing zeros count of xort are greater than or equal to those of xort−1, respectively. If C1 does not hold, after writing a bit of “1”, Gorilla stores the leading zeros count and center bits count with 5 bits and 6 bits respectively, followed by the actual center bits. Otherwise, xort shares the information of leading zeros count and center bits count with xort−1, which is expected to save some bits.
Leading Code Optimization: Observing that the leading zeros count of an XORed value is rarely more than 30 or less than 8, only log2 8=3 bits may be used to represent up to 24 leading zeros. In particular, 8 exponentially decaying steps (i.e., 0, 8, 12, 16, 18, 20, 22, 24) may be used to approximately represent the leading zeros count. If the actual leading zeros count is between 0 and 7, it can be approximated to be 0; if the actual leading zeros count is between 8 and 11, it can be approximated to be 8; and if the actual leading zeros count is between 12 and 15, it can be approximated to be 12; if the actual leading zeros count is between 16 and 17, it can be approximated to be 16; if the actual leading zeros count is between 18 and 19, it can be approximated to be 18; if the actual leading zeros count is between 20 and 21, it can be approximated to be 20; if the actual leading zeros count is between 22 and 23, it can be approximated to be 22; if the actual leading zeros count is 24, it can be approximated to be 24. The condition of C1 is therefore converted into C2, i.e., “leadt=leadt−1 and “trailt trailt−1”. By applying this optimization to Gorilla compressor, we can get a compressor shown in
Center Code Optimization: both vt′ and vt−1′ are supposed to have many trailing zeros, which results in an XORed value with long trailing zeros. Besides, vt′ would not differentiate much from vt−1′ in most cases, contributing to long leading zeros in the XORed value. That is, the XORed value tends to have a small number of center bits (usually not more than 16). To this end, if the center bits count is less than or equal to 16, we use only log2 16=4 bits to encode it. Although we need one more flag bit, we can usually save one bit in comparison with the original solution. After optimizing the center code, an example compressor as shown in
Flag Code Reassignment:
As illustrated in
Experiments are performed to verify the performance of the above described erasing-based lossless compression method for floating-point values and the erasing-based lossless decompression method for floating-point values.
1. Datasets: 22 datasets including 14 time series and 8 non time series, which are further divided into three categories respectively according to their average decimal significand counts as described in the above Table 1.
Baselines: we compare Elf compression method algorithm with 9 existing compression methods. The erasing based lossless compression method for floating-point values as described in the embodiments above is denoted as Elf, and the one that further adopts the significand count optimization and start position optimization is denoted as Elf+.
Metrics: We verify the performance of various methods in terms of three metrics: compression ratio, compression time and decompression time. Note that the compression ratio is defined as the ratio of the compressed data size to the original one.
2. Settings: As Chimp did, we regard 1,000 records of each dataset as a block. Each compression method is executed on up to 100 blocks per dataset, and the average metrics of one block are finally reported. By default, we regard each value as a double value. All experiments are conducted on a personal computer equipped with Windows 11, 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60 GHz CPU and 16 GB memory. The JDK (Java Development Kit) version is 1.8.
Performance: the performance of Elf and Elf+ are listed in the table 2 below.
Compression ratio: as illustrated in Table 2 below, among all the floating-point compression methods, the erasing based lossless compression method (i.e., Elf) described in embodiments of the present disclosure has the best compression ratio on almost all datasets. In particular, for the time series datasets, compared with Gorilla and EPC, the Elf has an average relative improvement of (0.76-0.37)/0.76≈51%. Thanks to the erasing technique and elaborate XORcmp, Elf can still achieve relative improvement of 47% and 12% over Chimp and Chimp128 respectively on the time series datasets. For the non-time series datasets, Elf is also relatively (0.63-0.55)/0.63≈12.7% better than the best competitor Chimp128. We notice that there are few datasets that Chimp128 is slightly better than Elf in terms of compression ratio. For the datasets of WS, SUSA and BT, we find that there are many duplicate values within 128 consecutive records. In this case, Chimp128 can use only 9 bits to represent the same value. For the datasets of AS, PLat and PLon, since they have large decimal significand counts, Elf does not perform erasing but still consumes some flag bits. As pointed out by Gorilla, real-world floating point measurements often have a decimal place count of one or two, which usually results in small or medium β. To this end, Elf can achieve good performance in most real-world scenarios.
As illustrated in Table 2 below, for both time series and non-time series with small and medium β, Elf+ even outperforms the best competitor Chimp128 for datasets WS and SUSA, in which Chimp128 has a slightly better compression ration than Elf. This is because Elf+ takes full advantage of the fact that most values in a time series have the
1
6
7
9
9
7
2
9
88
6
7
.54
6
4
9
3
indicates data missing or illegible when filed
Compression time and decompression time: Elf takes a little more time than other floating-point compression algorithms during both compression and decompression processes. Compared with other floating-point compression algorithms, Elf adds an erasing step and a restoring step, which inevitably takes more time. However, the difference is not obvious, since they are all on the same order of magnitude. For almost all datasets, Elf+ takes even less time than Elf during both compression and decompression processes.
In summary, Elf can usually achieve remarkable compression ratio improvement for both time series data sets and non-time series datasets, with affordable cost of more time. Elf+ even performs better than Elf in terms both of compression ratio and running time.
Following single floating-point data occupying 32 bits are took as an example to explain the embodiments of the present disclosure in detail.
A single-precision floating-point value (abbr. single value) has a similar underlying storage layout to that of a double value, but it takes up only 32 bits, where 1 bit is for the sign, 8 values, we should make the following modifications.
If let m0=1 and BF(v)=±(b
In an alternative implementation, for a single floating point value, the reference mantissa bit may be determined by:
where α denotes the decimal place count of the floating-point value, g(α) denotes the place of the reference mantissa bit in the mantissa bits of the floating-point value, and ei denotes a digit on the ith exponent bit in the exponent bits of the floating-point value.
In an alternative implementation, for a single floating point value, when δ=0, it indicates that v itself has long trailing zeros. Thus, the erasing operation may be performed in response to δ≠0, i.e., the erasing operation may be performed in response to determining that a digit on a mantissa bit following the reference mantissa bit is not zero.
In an alternative implementation, for a single floating point value, the erasing operation may be performed in response to δ≠0 and β*<8. When the erasing operation is performed in response to δ≠0 and β*<8, a positive gain on compression ratio may be ensured while ensuring the lossless compression on floating-point value.
In an alternative implementation, for a single floating point value, the erasing operation may be performed in response to δ≠0 and 23−g(α)>3. When the erasing operation is performed in response to δ≠0 and 23−g(α)>3, a positive gain on compression ratio may be ensured while ensuring the lossless compression on floating-point value.
In an alternative implementation, for a single floating point value, the erasing operation is performed in response to β*<8 and δ≠0 and 23−g(α)>3. The other processing operations such as compression, decompression, encoding strategies for β* and the XORed result are similar to those described for the double value, and would not be repeated herein.
According to another embodiment of the present disclosure, an erasing-based lossless compression apparatus for floating-point values is provided.
As illustrated in
It should be noted that the apparatus shown in
According to another embodiment of the present disclose, an erasing-based lossless decompression apparatus for floating-point values is provided.
As illustrated in
It should be noted that the apparatus shown in
According to yet another embodiment of the present disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium stores a computer program thereon, the program, when executed by a processor, causing the processor to implement any one of the methods described above.
According to yet another embodiment of the present disclosure, an electronic device is provided. The electronic device includes one or more processors; a storage apparatus, storing one or more programs thereon, the one or more programs, when executed by the one or more processors, causing the one or more processors to implement any one of the methods described above.
It should be noted that in one or more of the above embodiments, the functions described in embodiments of the disclosure can be implemented by hardware, software, firmware, or any combination of them. When implemented by software, these functions can be stored in computer readable medium or transmitted as one or more instructions or codes on computer readable medium.
It should be understood that the various forms of processes shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in different orders. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, no limitation is made herein.
The above specific embodiments do not constitute limitation on the protection scope of the present disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202310068186.3 | Jan 2023 | CN | national |
202310070527.0 | Jan 2023 | CN | national |