The disclosure relates to a field of data storage technologies, and particularly to a method, a system and an apparatus for data storage, and a storage medium.
With the development of science and technology, data is increasing rapidly. It is an important problem how to store massive data. In order to solve this problem, related researches on data storage by using deoxyribonucleic acid (DNA) have emerged. All information is stored in the form of a DNA chain, which theoretically enables information to be stored for a longer period of time without any loss of data. With regard to the current DNA storage technology, when data at a specific location needs to be acquired, all the data stored in the DNA needs to be read and screened, and there is no way to read only a portion of data oriented to a specific location, with low efficiency and defects.
The present disclosure is intended to solve one of technical problems in the related art to at least certain extent.
Therefore, one purpose of the embodiments of the disclosure is to provide a method, a system, an apparatus for data storage, a decoding method, and a storage medium.
To achieve the above technical purpose, the technical solutions in the embodiments of the disclosure include:
In a first aspect, the embodiments of the disclosure provide a method for data storage. The method includes:
Further, grouping the first data to obtain K packet sub-data, includes:
Further, inputting a preset primer into a random generator to obtain 4T random number sequences, is specifically:
1≤j≤4T.
Further, each of the random number sequences includes K random bits; determining the packet sub-data corresponding to the ith random number sequence, and performing XOR operation on the determined packet sub-data to obtain data information DATAi, includes:
Further, the storage method further includes randomization of the DNA molecular chain. The method includes:
In a second aspect, the embodiments of the disclosure provide a decoding method. The method includes:
In a third aspect, the embodiments of the disclosure provide a system for data storage. The system includes:
Further, each of the random number sequences includes K random bits. The packet determining module includes:
In a fourth aspect, the embodiments of the disclosure provide an apparatus for data storage. The apparatus includes:
In a fifth aspect, the embodiments of the disclosure provide a storage medium stored with programs executable by a processor, the programs executable by the processor being configured to implement the method for data storage when executed by the processor.
Advantages and beneficial effects of the present disclosure will be set forth in part in the following description, and in part will become obvious from the following description, or may be learned by practice of the disclosure.
In embodiments of the disclosure, in the process of coding the first data to obtain a DNA molecular chain, a random generator is added to greatly simplify the coding process and implement efficient and accurate coding on the first data, and a primer of a DNA molecular chain is configured as a seed of a random generator to maximize the function of the primer.
In order to explain the technical solutions in embodiments of the present disclosure or the related art more clearly, the drawings described in the embodiments or the related art will be briefly introduced below. It should be understood that the drawings described as below are only some embodiments of the present disclosure. Those skilled in the art may obtain other drawings from these drawings without creative work.
Embodiments of the present disclosure are described in detail below, and examples of embodiments are illustrated in the accompanying drawings, in which the same or similar labels represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary, are only configured to explain the present disclosure and are not to be construed as a limitation of the present disclosure. The block numbers in the following embodiments are set only for illustration, the sequence between blocks is not limited, and the execution sequence of the blocks in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
For a method and a system for data storage according to embodiments of the present disclosure with reference to the drawings below, a method for data storage according to the embodiment of the present disclosure is described first with reference to the accompanying drawings.
Referring to
Specifically, DNA storage is target information to be stored, that is, first data converted into the DNA base coding stored in a DNA chain, and when the data needs to be read, the DNA chain (sometimes PCR amplification is required on the DNA chain first and then sequencing operation is performed) is sequenced to obtain a corresponding base sequence, and the corresponding base sequence is changed into information that may be recognized by the electronic computer through a series of conversions for data recovery.
First, the first data is grouped to obtain K packet sub-data, that is, S1, S2, S3 . . . Sk, the data length of each packet sub-data being fixed.
The preset primer is a DNA sequence specially designed for subsequent PCR amplification or sequencing with a specific base arrangement structure, which is predetermined and recorded before coding the first data.
The preset primer is input to a random generator as a seed of a random generator, to obtain a plurality of random numbers. The generation times capacity of the random generator is T, 4T is the generation times of the random generator, and the random generator may generate 4T random numbers by controlling the cycle number of the random generator.
For example, the data length of each packet sub-data is S=4200 (bit), N=40 (nt), nt being an abbreviation of nucleotide and representing a unit of the number of bases, 1 nt having 2-bit information capacity, K=4200/(40*2)=53 (round up to an integer).
K=53, that is, the first data may be divided into 53 packet sub-data, and the length of generation times of the random generator must be greater than 53, the capacity of the random generator being T=3 nt. Since the 3 nt information storage capacity is 43t (1 nt possesses a possibility of 4 base expressions, therefore, the information capacity is 1 nt is 4), it is also understood as 26 (1 nt corresponds to 2 bits, and 1 bit corresponds to 0/1 states, therefore, there are 2 states of 3 (nt)*2 (bit)=6th-power information capacity).
By controlling the cycle number of the random generator, a plurality of random numbers may be output according to the input preset primer. Each random number is configured to select a portion of packet sub-data from K packet sub-data, and perform XOR operation on the selected portion of packet sub-data to obtain one data information DATAi, i being the cycle number controlled, and 1≤i≤4T.
Data information DATAi is spliced with the preset primer and the generation times capacity of the random generator to obtain a DNA molecular chain, and 4T DNA molecular chains are synthesized to obtain target storage data.
It can be seen that, in the process of coding the first data to obtain a DNA molecular chain, a random generator is added to greatly simplify the coding process and implement efficient and accurate coding on the first data. A primer of a DNA molecular chain is configured as a seed of a random generator to maximize the function of the primer; a preset ratio of the content of guanine and cytosine in the prefix of a molecular chain synthesized by each DNA to the total content of guanine, cytosine, adenine and thymine contained in the primer enables sequencing with high accuracy when coding data needs to be read in advance.
Further, as an optional implementation, block S2 includes blocks S21-S22:
Specifically, for example, the data length of the first data is S=4200 bit, the packet length is N=40 nt, the packet number K may be determined as:
Further, as an optional implementation, block S3 is specifically:
Specifically, the preset primer is converted to a decimal integer as a seed into a random generator. The random generator outputs a decimal random integer in a range of [0, 2K] according to the input primer, and converts the decimal random integer into a random number sequence in a binary form, and the high bit of the random number sequence is zeroed, so that the bit number of the random number sequence is K, and the binary is a degree distribution sequence of a random number sequence fountain code.
The cycle number j may be controlled by controlling a generation times capacity of a random generator to output 4K random number sequences, 1≤j≤4K.
Further, as an optional implementation, each random number sequence includes K random bits. Block S4 includes blocks S41-S42;
at S42, XOR operation is performed on the selected packet sub-data to obtain the data information DATAi.
Specifically, referring to
According to the above way, by controlling the cycle number of the random number sequence, 4T random number sequences correspond to 4T data information. The preset primer, the generation times capacity of the random generator and the data information are assembled to form a set of fountain code drop data, that is, a DNA molecular chain.
Further, as an optional implementation, the storage method further includes randomization of a DNA molecular chain at S6. Block S6 includes S61-S62:
Specifically, to ensure sufficient clutter of the finally generated target storage data, randomization is performed again on the basis of the DNA molecular chains generated in the previous block (that is, fountain code drop data), and the preset primer is converted to a decimal integer as a seed into a random generator to generate a random integer in a range of [0, 4T+N] and the random integer is converted into a corresponding base sequence (or a corresponding binary sequence), and performs XOR operation with the random generation times capacity and the data information, to randomize the stored information.
Due to homopolymer imbalance or GC content imbalance in DNA storage, an unpredictable error may occur in the DNA sequence generation, PCR amplification and sequencing phases, so that when the DNA chain is synthesized, the homopolymer should be judged, and the situation that continuous 4 bases are the same base is discarded. Then, homopolymer and GC content are detected on the full chain. If not satisfy the requirement (4 bases are required not the same bases), the chain is deleted.
At last, DNA sequence synthesis is performed on the screened DNA molecular chains to obtain and store target storage data.
In addition, the disclosure further provides a decoding method applied to the target storage data obtained by the method for data storage. The method includes:
The specific decoding process is as follow:
When data coding and storage are performed, preset primer information of DNA storage data and a data length for target storage data are known in advance. Meanwhile, a DNA sequence of the primer is also known. PCR amplification is performed according to primer information, and data is sequenced after amplification.
Block 1: the preset primer is converted to a decimal integer as a seed of a random generator into a random generator to generate a random number in a range of [0, 4T+N] and the random number is converted to a corresponding base and performs XOR operation with a sequence in the DNA chain (target storage data) in addition to a base sequence of the preset primer to recover the original data.
Block 2: the preset primer is converted to a decimal integer as a seed of a random generator into a random generator according to the recovered data, and according to times information generated by the random generator, an integer in a range of [0, 2K] is generated, and converted to a random number sequence in a K-bit binary form to record a next binary sequence D1 and a data sequence DATA1. Continue extracting a sequencing sequence until K different sequences are extracted, and K binary sequences D1, D2 . . . DK, and data sequences DATA1, DATA2 . . . DATAK are recorded.
Block 3: K K-bit sequence D is constitutes a K-order matrix D.
Block 4: a matrix solution is performed by a Gaussian elimination method. The K-order matrix D (represented by D1, D2 . . . DK) and the K-row, 1-column DATA matrix (represented by DATA1, DATA2 . . . DATAK) construct an augmented matrix. Then, judging along a diagonal of a matrix (i from 0-K), if D[i][i]=1, all the data of the ith row is XORed with all the data of the jth row. If D[i]=0, look down along the column; if D[j][i]=1, swap two rows and then look down; if another D[j][i]=1, the ith row is XORed with the jth row to ensure that an upper triangular matrix is constructed, and the area below the diagonal of the matrix is all 0.
Block 5: reverse operation is performed according to the previous block to eliminate all 1 above a diagonal to 0, further to obtain unique S1 . . . Sk, and a coding process is performed on DATA1 . . . DATAK.
Then, refer to a system for data storage provided in embodiments of the disclosure with reference with the drawing.
The system specifically includes:
Further, as an optional implementation, each of the random number sequences includes K random bits. The packet determining module 204 includes:
It can be seen that the contents of the above method embodiments are applied to the system embodiments. The functions embodied in the system embodiments are the same as the functions of the above method embodiments, and the beneficial effects achieved are the same as the beneficial effects achieved by the above method embodiments.
Referring to
The contents of the above method embodiments are applied to the apparatus embodiments. The functions embodied in the apparatus embodiments are the same as the functions of the above method embodiments, and the beneficial effects achieved are the same as the beneficial effects achieved by the above method embodiments.
In some optional embodiments, functions/operations referred to in block diagrams may occur not in accordance with sequence in the diagrams. For example, two blocks shown in succession may be executed substantially concurrently or sometimes may be executed in the reverse sequence, depending on functions/operations involved. In addition, the embodiments presented and described in the flowcharts of the present disclosure are provided by way of examples, and are intended to provide a more thorough understanding of the technology. The disclosed methods are not limited to operations and logic flows presented herein. Alternative embodiments are predictable. The sequence of various operations is changed and sub-operations described as a part of a larger operation are independently executed.
In addition, even though the disclosure is described in the context of a functional module, it should be understood that one or more of the above functions and/or features may be integrated in a single physical unit and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module, unless otherwise indicated. It may be further understood that the detailed discussion regarding the actual implementation of each module is not necessary for understanding the disclosure. More specifically, in consideration of attributes, functions and internal relationships of various functional modules in the apparatus disclosed herein, the actual implementation of the module may be understood by those skilled in the art. Accordingly, the disclosure as set forth in the claims may be implemented without undue tests by those skilled in the art. It may be further understood that specific concepts disclosed are illustrative only and are not intended to limit the scope of the disclosure defined by the appended claims and their entire scope of equivalents.
The above functions may be stored in a computer readable memory if it is implemented in the form of a software function unit and sold and used as an independent product On the basis of such an understanding, the technical solution of the present disclosure essentially or partly contributing to the related art, or part of the technical solution may be embodied in the form of a software product. The software product including several instructions is stored in a storage medium, so that a computer device (may be a personal computer, a server or a network device, etc.) executes all or part of blocks of various embodiments of the present disclosure. The medium includes a USB disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that may store program codes.
The logics and/or blocks represented in the flowchart or described in other ways herein, for example, may be considered as an ordered list of executable instructions configured to implement logic functions, which may be specifically implemented in any computer readable medium for use by instruction execution systems, apparatuses or devices (such as a computer-based system, a system including a processor, or other systems that may obtain and execute instructions from an instruction execution system, an apparatus or a device) or in combination with the instruction execution systems, apparatuses or devices. A “computer readable medium” in the specification may be an apparatus that may contain, store, communicate, propagate or transmit a program for use by instruction execution systems, apparatuses or devices or in combination with the instruction execution systems, apparatuses or devices.
A more specific example (a non-exhaustive list) of a computer readable medium includes the followings: an electronic connector (an electronic apparatus) with one or more cables, a portable computer disk box (a magnetic device), a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (an EPROM or a flash memory), an optical fiber device, and a portable optical disk read-only memory (CDROM). In addition, a computer readable medium even may be paper or other suitable medium on which a program may be printed, since paper or other medium may be optically scanned, and then edited, interpreted or processed in other suitable ways if necessary to obtain a program electronically and store it in a computer memory.
It should be understood that all parts of the present disclosure may be implemented with a hardware, a software, a firmware and their combination. In the above implementation, multiple blocks or methods may be stored in a memory and implemented by a software or a firmware executed by a suitable instruction execution system. For example, if implemented with a hardware, they may be implemented by any of the following technologies or their combinations known in the art as in another implementation: a discrete logic circuit with logic gate circuits configured to achieve logic functions on data signals, a special integrated circuit with appropriate combined logic gate circuits, a programmable gate array (PGA), a field programmable gate array (FPGA), etc.
In the above descriptions, descriptions with reference to terms “one implementation/embodiment”, “another implementation/embodiment” or “some implementations/embodiments”, etc. mean specific features, structures, materials or characteristics described in combination with the implementation or example are included in at least one implementation or example of the present disclosure. The schematic representations of the above terms do not have to be the same implementation or example. Moreover, specific features, structures, materials or characteristics described may be combined in any one or more implementations or examples in a suitable manner.
Even though implementations of the disclosure have been illustrated and described, it may be understood by those skilled in the art that various changes, modifications, substitutions and alterations may be made for these implementations without departing from the principles and spirit of the disclosure, and the scope of the disclosure is defined by claims and their equivalents.
Although the preferred embodiments have been described in detail, the embodiments are not limited in the disclosure. Those skilled in the art know that various equivalent modifications or substitutions may be made without departing from the spirit of the disclosure, and all of these equivalent modifications or substitutions are intended to be included within the scope defined by the claims of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110583430.0 | May 2021 | CN | national |
This application is a continuation of U.S. patent application Ser. No. 17/469,048, filed Sep. 8, 2021, the content of which application is hereby expressly incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 17469048 | Sep 2021 | US |
Child | 17720641 | US |