This application claims the benefit of Chinese Patent Application No. 201710611123.2, filed on Jul. 25, 2017, the entire content of which is incorporated herein by reference for all purposes.
The content of the following submission on ASCII text file is incorporated herein by reference in its entirety: a computer readable form (CRF) of the Sequence Listing (file name: 759892000340SEQLIST.TXT, date recorded: Jul. 3, 2018, size: 102 KB).
The present disclosure relates generally to data storage and retrieval, and more specifically, to techniques for achieving reliable and efficient DNA-based data storage and retrieval.
The concept of leveraging DNA as a vehicle for data storage and retrieval can be traced back to 1988, when Joe Davis and his collaborator created a synthetic DNA named “Mocrovenus” for encoding an icon and incorporated it into E. coli cells. Compared to traditional storage media such as magnetic tape and hard disk, DNA-based storage has the advantages of higher density (e.g., ˜1 mm3 for storing 1 EB data), longer term storage (e.g., over 1 million years in −18° C.), and lower maintaining cost. DNA storage is a cutting-edge researching field which is based on both oligonucleotide synthesis (especially high throughput synthesis platform like CustomArray) for the generation of DNA storage media and sequencing (especially next-generation sequencing (NGS] like Illumina HiSeq 2500 and MiSeq) for information retrieval.
However, presently, DNA-based data storage has a number of limitations. For example, the production cost of DNA synthesis is fairly high, while the speed of data retrieval can be low due to sequencing. As such, DNA-based storage has been considered to be more suitable for large-scale archival storage, which involves fewer numbers of reads and writes of the storage medium. Furthermore, many errors may be introduced in various stages of the process (e.g., encoding, writing, storing, decoding, reading, retrieval), thus compromising the input and output of the data stream. Exemplary errors include mutations, deletions, insertions, missing of DNA fragments induced during synthesis and sequencing, and degeneration after long-term storage. Moreover, when a large amount of data is stored using DNA, it can be challenging to achieve random access to a portion of the data without retrieving the data in its entirety.
The present invention relates to techniques for achieving reliable and efficient DNA-based data storage and retrieval. Specifically, the present invention provides accurate, efficient, and reliable methods of storing input data on a nucleic acid, such as a deoxyribonucleic acid (“DNA”). In particular, the present invention utilizes a novel 5-bit transcoding framework to convert one or more data files into nucleic acid sequences (for example DNA sequences). The present invention also provides an integrated process that includes compression algorithm(s), error correction algorithm(s), and transcoding framework(s) for efficient and reliable data storage and retrieval. Further, the present invention allows for random data access, which is particularly beneficial when data on a large scale is stored together, but only partial information need to be browsed at a given time. Data that can be stored in accordance with the methods disclosed herein includes any type of data that could be expressed in a digital manner (i.e., in binary data) including, for example, text files, high definition videos, images, and/or audios.
In some embodiments, there is provided a method for storing input data on nucleic acid comprises: a) converting the input data into a set of nucleotide sequences, wherein the converting comprises i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain the set of nucleotide sequences; and b) synthesizing a set of nucleic acids comprising the set of nucleotide sequences.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences.
In some embodiments, the data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings.
In some embodiments, the nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers.
In some embodiments, the nucleotide encoding step further comprises converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences.
In some embodiments, the nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length.
In some embodiments, the length of each of the plurality of initial sub-sequence of integers is determined based on an oligo length of a selected synthesis platform, a desired error tolerance, a size of the input data, a selected error correction code, or a combination thereof.
In some embodiments, the nucleotide encoding step further comprises adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index.
In some embodiments, the index information added to each of the plurality of the initial sub-sequences of integers comprises a sequence of integers, wherein the length of the sequence of integers is based on the size of the input data.
In some embodiments, the nucleotide encoding step comprises, after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy.
In some embodiments, adding redundancy data to the plurality of integer sub-sequences having index comprises: creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding; and obtaining the plurality of sub-sequences having redundancy based on the filled matrix.
In some embodiments, the number of columns of the empty matrix is determined based on an oligo length of a selected synthesis platform, the type of the error correction code, a predetermined error tolerance value, a size of the plurality of integer sub-sequences having index, or a combination thereof.
In some embodiments, the number of rows of the empty matrix is determined based on an oligo length of a selected synthesis platform, a type of the error correction code, a predetermined error tolerance value, a size of the plurality of integer sub-sequences having index, or a combination thereof.
In some embodiments, the error correction coding is Reed-Solomon (“RS”) coding.
In some embodiments, the data generated by applying an error correction coding is generated by applying string correction of the RS coding and/or block correction of the RS coding.
In some embodiments, the 5-bit transcoding framework is according to Table 2.
In some embodiments, R and Y are chosen based on: 1) being different from the nucleotide immediately in front of R or Y; and/or 2) the estimated GC content of the nucleotide sequence.
In some embodiments, the input data corresponds to a compressed file. In some embodiments, the input data corresponds to two or more files.
In some embodiments, the input data corresponds to a text file.
In some embodiments, the data processing step further comprises compressing the input data to obtain a compressed file and converting the compressed file into a binary string.
In some embodiments, the compressed file is compressed using the Lempel-Zic-Markov chain algorithm (“LZMA”).
In some embodiments, the data processing step further comprises: grouping the two or more files into a TAR file.
In some embodiments, the TAR file is further compressed using the Lempel-Zic-Markov chain algorithm (“LZMA”).
In some embodiments, the nucleotide encoding step further comprises appending a pair of primer sequences to the 5′ and 3′ ends of each nucleotide sequence of the set of nucleotide sequences.
In some embodiments, a pair of primers is attached to the set of synthesized nucleic acids.
In some embodiments, there is provide a method for storing two or more sets of input data on nucleic acid comprises: a) separately converting the two or more sets of input data into two or more sets of corresponding nucleotide sequences according to any of the methods described herein; b) separately appending a pair of primer sequences to the 5′ and 3′ end of each set of the two or more sets of nucleotide sequences, wherein the pairs of primers for the two or more sets of corresponding nucleotide sequences are different from each other; and c) synthesizing two or more sets of nucleic acids comprising the two or more sets of corresponding nucleotide sequences, respectively.
In some embodiments, each pair of primers has a sequence that is different from any one of the two or more sets of corresponding nucleotide sequences or complementary sequences thereof.
In some embodiments, the set of synthesized nucleic acids has GC content ranging from 30% to 70%. In some embodiments, the set of synthesized nucleic acids has GC content of less than about 70%.
In some embodiments, the set of synthesized nucleic acids is stored. In some embodiments, the set of synthesized nucleic acids is stored by drying. In some embodiments, the set of synthesized nucleic acids is stored by lyophilization.
In some embodiments, the set of synthesized nucleic acids is immobilized on a carrier. In some embodiments, the carrier is a microarray.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid comprises: a) obtaining a set of nucleotide sequences of a set of nucleic acids, b) converting the set of nucleotide sequences into the output data, wherein the converting comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data, thereby obtaining the output data.
In some embodiments, the set of nucleic acids is amplified prior to retrieving the output data.
In some embodiments, the set of nucleic acids is sequenced to generate a plurality of sequence reads.
In some embodiments, the plurality of sequence reads are paired, merged, and filtered to obtain the set of nucleotide sequences.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data.
In some embodiments, the nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31.
In some embodiments, the nucleotide decoding step further comprises applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index.
In some embodiments, the step of applying error correction coding comprises: i) applying RS coding string correction to the plurality of integer sub-sequences to obtain a plurality of consensus integer sub-sequences; and ii) applying RS coding block correction to the plurality of consensus integer sub-sequences to obtain the plurality of integer sub-sequences having index.
In some embodiments, the nucleotide decoding step further comprises removing the index from the plurality of integer sub-sequences having index to obtain a plurality of core sub-sequences of integers.
In some embodiments, the nucleotide decoding step further comprises merging the core sub-sequences of integers into a string of integers.
In some embodiments, the nucleotide decoding step further comprises converting the string of integers into a binary string.
In some embodiments, the output data is stored in a compressed file. In some embodiments, the data processing step further comprises decompressing the compressed file. In some embodiments, the decompressing is carried out through the LZMA algorithm.
In some embodiments, the output data corresponds to a plurality of files. In some embodiments, the plurality of files is extracted from the output data through the TAR algorithm.
In some embodiments, the 5-bit transcoding framework is according to Table 2.
In some embodiments, the set of nucleic acids comprises primer sequences at the 3′ and 5′ ends and the method comprises removing the primer sequences before the nucleotide decoding step.
In some embodiments, there is provided a method for retrieving output data stored on a set of nucleic acids of interest, wherein the set of nucleic acids of interest is one of a plurality of sets of nucleotide sequences present in a mixture, each set encoding a different set of output data and having a different set of primer pairs at the 3′ and 5′ end, comprises: a) amplifying the set of nucleic acids using the primer pair corresponding to nucleic acids of interest; b) obtaining a set of nucleotide sequences of the amplified nucleic acids, c) converting the set of nucleotide sequences into the output data according to the method of any one of claims 41-53; thereby obtaining the output data.
In some embodiments, there is provided a method for retrieving two or more sets of output data stored on corresponding two or more sets of nucleic acids of interest, wherein the two or more sets of nucleic acids of interest are among a plurality of nucleotide sequences present in a mixture, each set encoding a different set of output data and having a different set of primer pairs at the 3′ and 5′ end, comprises: a) amplifying (e.g., separately amplifying or amplifying together) the two or more sets of nucleic acids of interest using primer pairs corresponding to the two or more sets of nucleic acids of interest; b) obtaining two or more sets of nucleotide sequences of the amplified nucleic acids, c) separately converting the two or more sets of nucleotide sequences into the two or more sets of output data according to any of the methods described herein; thereby obtaining the two or more sets of output data.
In some embodiments, there is provided a non-transitory computer-readable storage medium stores one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to carry out any of the methods described herein.
Also provided are systems for providing nucleic acid-based data storage or data retrieval from a nucleic acid, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for carrying out any of the methods described herein.
Also provided are electronic devices for providing nucleic acid-based data storage or data retrieval from a nucleic acid comprising means for carrying out any of the methods described herein.
The present invention provides accurate, efficient, and reliable methods of storing input data on a nucleic acid, such as a deoxyribonucleic acid (“DNA”). Specifically, the present invention utilizes a novel 5-bit transcoding framework to convert one or more data files into nucleic acid sequences (for example DNA sequences). This novel transcoding framework allows for effective nucleic acid sequence design that strikes the right GC content, avoids certain homopolymers (e.g., homoploymers that are 4 or more nucleotides long), and reduces error rate in nucleic acid synthesis and amplification. The present invention also provides an integrated process that includes compression algorithm(s), error correction algorithm(s), and transcoding framework(s) for efficient and reliable data storage and retrieval. The methods provided herein can be used for storing data of any size, including large sized files. Further, the present invention allows for random data access, which is particularly beneficial when data on a large scale is stored together, but only partial information need to be browsed at a given time. Data that can be stored in accordance with the methods disclosed herein includes any type of data that could be expressed in a digital manner (i.e., in binary data) including, for example, text files, high definition videos, images, and/or audios.
Thus, the present application in one aspect provides methods for storing input data on a set of nucleic acids as well as methods for converting input data into a set of nucleotide sequences. In another aspect, there are provided methods for retrieving output data stored on a nucleic acid as well as methods of converting a set of nucleotide sequences into output data. Also provided are systems and non-transitory computer-readable storage medium for storing one or more programs for carrying out any one or more steps of the methods described herein.
It is understood that embodiments of the invention described herein include “consisting” and/or “consisting essentially of” embodiments.
Reference to “about” a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to “about X” includes description of “X”.
As used herein, reference to “not” a value or parameter generally means and describes “other than” a value or parameter. For example, the method is not used to treat cancer of type X means the method is used to treat cancer of types other than X.
As used herein and in the appended claims, the singular forms “a,” “or,” and “the” include plural referents unless the context clearly dictates otherwise.
As used herein and in the appended claims, “a set of” refers to one or a plurality of referents unless the context clearly dictates otherwise. A set of nucleic acids can be nucleic acids encoding data from the same file or same group of files compressed together. In some embodiments, nucleic acids in the same file can have the same set of primers attached to the 5′ and 3′ ends.
The present invention in one aspect provides methods (such as computer implemented methods) for converting input data into a set of nucleotide sequences. The method typically comprises a data processing step that converts the input data into a binary string and a nucleotide encoding step that converts the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The methods are useful for storing input data on a set of nucleic acids, which involves first converting the input data into a set of nucleotide sequences and then synthesizing a set of nucleic acids comprising the set of nucleotide sequences.
The input data can represent any number of files of any type, such as text files, image files, audio/video files (such as high-definition files), etc. The files can be non-compressed or compressed. When a file is non-compressed, it can first be compressed before being converted into a binary string. For example, the file can be compressed into a LZMA file (e.g., A.lzma) using the Lempel-Ziv-Markov Chain algorithm. In some embodiments, two or more files (such as three, four, five, six, and more files) are first grouped together, for example, into a TAR file (e.g., A.tar), and the TAR file is further compressed into a LZMA file (e.g., A.tar.lzma). As such, the method can allow storage of multiple files (e.g., 1-5, 5-10, 10-15, 15-25, 25-35, 35-50) in a single nucleic acid composition.
In some embodiments, to allow random access to locations within a single file, the single file can be divided into multiple sets of data and the multiple sets of data are each compressed and processed as described below. For example, a digital file corresponding to a book having 10 chapters can be divided into 10 files, with each file corresponding to a single chapter. The 10 files are then separately compressed and processed to achieve random access of any chapter.
The data processing step converts the input data into a binary string. The binary string can be directly converted into a set of nucleotide sequences, for example by following a 5-bit transcoding framework described herein. Alternatively, the binary string can be further converted into a string of integers which are then converted into a set of nucleotide sequences, for example, by following a 5-bit transcoding framework. In some embodiments, the string of integers are further subjected to error correction coding and/or other processes to generate a plurality of integer sub-sequences having redundancy, and the plurality of integer sub-sequences having redundancy are then converted into a set of nucleotide sequences, for example by following a 5-bit transcoding framework.
Thus, for example, in some embodiments, there is provided a method (such as a computer implemented method) for converting input data into a set of nucleotide sequences, wherein the converting comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. In some embodiments, there is provided a method for storing input data on nucleic acid, the method comprises: a) converting the input data into a set of nucleotide sequences, wherein the converting comprises i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences; and b) synthesizing a set of nucleic acids comprising the set of nucleotide sequences.
In some embodiments, the data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings, each of which can be further converted into an integer ranging from 0 to 31 to obtain a string of integers. The string of integers can be directly converted into a set of nucleotide sequences, for example using the 5-bit transcoding framework. Alternatively, the string of integers is subjected to further manipulation as described below.
Specifically, the string of integers can be divided into a plurality of initial sub-sequence of integers having a predetermined length. The predetermined length of the initial sub-sequence of integers is calculated based on a plurality of factors including the oligo length of the synthesis platform, the error correction code selected, the desired error tolerance, the synthesis error rate of oligo, and/or the total encoded data size, as discussed in detail below. For example, the integer string can be sliced into a list of non-overlapping integer sub-sequence using a length-fixed (e.g., 22 integers) sliding window. An index can then be added to each of the plurality of the initial sub-sequences of integers to generate a plurality of integer sub-sequences with index. The index can contain some integers also ranging from 0 to 31. The length of the index is flexible and depends on the throughput of the DNA synthesis and data size.
In some embodiments, redundancy data is added to generate a plurality of integer sub-sequences having redundancy. For example, Reed-Solomon (RS) error correction coding, is applied to the plurality of integer sub-sequences to generate a novel list of integer sub-sequences having redundancy through string correction and block correction of RS coding. Redundancy refers to the excess of synthesized oligoes to provide robustness to dropout. Redundancy in string correction is helpful for error correction of transitions and transversions of oligo. Redundancy in block correction enables correction of insertion, deletion, and completely missing of information.
In one exemplary embodiment, adding redundancy data to the plurality of integer sub-sequences having index comprises: creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding; and obtaining the plurality of sub-sequences having redundancy based on the filled matrix. The number of columns and/or rows of the empty matrix can be determined based on the type of the error correction code, a predetermined error tolerance value, a size of the plurality of integer sub-sequences having index, or a combination thereof. In some embodiments, the error correction coding is Reed-Solomon (“RS”) coding. In some embodiments, the data generated by applying an error correction coding is generated by applying string correction of the RS coding and block correction of the RS coding.
In some embodiments, the nucleotide encoding step further comprises appending a pair of primer sequences to the 5′ and 3′ ends of a set of nucleotide sequences. The primers can be used for amplifying the set of nucleic acids, e.g. by PCR amplification methods. In some embodiments, the primer sequences are added to the set of nucleotide sequences before synthesis. Alternatively, primers can be attached to synthesized nucleic acids, for example through ligation.
The methods can be useful for storing two or more sets of input data on a nucleic acid. Specifically, the method comprises: a) separately converting the two or more sets of input data into two or more sets of corresponding nucleotide sequences; b) separately appending a pair of primer sequences to the 5′ and 3′ end of each of the two or more sets of nucleotide sequences, wherein the primers for each of the two or more sets of corresponding nucleotide sequences are different from each other; and c) synthesizing a plurality of sets of nucleic acids comprising the two or more sets of corresponding nucleotide sequences respectively. Each of the pair of primers can have a sequence that is different from any one of the two or more corresponding nucleotide sequences or complementary sequences thereof.
The synthesized nucleic acids can have GC content ranging from about 30% to about 70%. For example, the synthesized nucleic acids can have GC content ranging from any of about 40% to about 60%, about 30% to about 40%, about 40% to about 50%, about 50% to about 60%, or about 60% to about 70%. In some embodiments, the synthesized nucleic acids have no homopolymers of longer than 3 nucleotides (e.g., no homopolymers of 4, 5, 6, 7, 8, 9, or 10 nucleotides). The synthesized nucleic acids in some embodiments are oligonucleotides, for example, oligonucleotides of about any of 50, 150, 200, 300, or 400 nucleotides long. In some embodiments, a set of nucleic acids comprises about any of 1, 2, 3, 5, 10, 15, or more oligonucleotides.
In some embodiments, the method further comprises storing the set of synthesized nucleic acids. In some embodiments, the set of nucleic acids is stored by drying, such as lyophilization. The set of nucleic acids can be stored as dry compositions, including lyophilized compositions. In some embodiments, the set of nucleic acids is immobilized on a carrier, including a solid carrier such as a microarray. In some embodiments, the nucleic acids are stored on a microarray having a density of about 5 μg per an area of 1 inch*3 inch (e.g., in CustomArray 12K chip). In some embodiments, the size of the input data is at least about 50 MB.
The present invention in another aspect provides methods (such as computer implemented methods) for converting a set of nucleotide sequences into an output data. The method is almost the reverse course of the encoding procedure, and typically comprises a nucleotide decoding step which converts the set of nucleotide sequences into a binary string, e.g., by using a 5-bit transcoding framework, and a data processing step which converts the binary string into the output data. The methods are useful for retrieving output data stored in a set of nucleic acids, which involves obtaining nucleotide sequence of the set of nucleic acids and then converting the set of nucleotide sequences into the output data.
In some embodiments, the set of nucleic acids is first amplified, for example by using primers present at the 5′ and 3′ ends of the set of nucleic acids. And the amplified nucleic acids can be subjected to sequencing, for example next generation sequencing. Next generation sequencing technologies are generally known in the art. For example, the nucleic acids can be sequenced by using the Illumina sequencing methods. Sequences belonging to a specific file can be obtained by aligning the primer sequences. In some embodiments, the method comprises an NGS library preparation. When the set of nucleic acids is present in a mixture comprising different sets of nucleic acids encoding different sets of data, the set of nucleic acids of interest can be specifically amplified by using the primer pair unique to the set of nucleic acids of interest, thus allowing random access of data corresponding to the set of nucleic acids of interest. If several compressed files need to be read and decoded at a single run of next generation sequencing, all of their corresponding sets of nucleic acids are amplified through PCR and all corresponding pairs will be used.
In some embodiments, the method comprises pair-end next generation sequencing, and read pairing and merging, in which forward and reverse read from a single cluster will be paired and merged into a single read, and all new reads with irregular length will be filtered. And, according to primer sequences, all reads can be grouped for each compressed file. The primers can then be removed, and the nucleotide sequences can either be converted into a plurality of integer sub-sequences comprising integers ranging from 0-31, or directly converted into a binary string which is subsequently converted into the output data.
In some embodiments, the method further comprises applying error correction of the plurality of integer sub-sequences to obtain a plurality of integer sub-sequences having index. In one exemplary embodiment, the step of applying error correction coding comprises: i) applying RS coding string correction to the plurality of integer sub-sequences to obtain a plurality of consensus integer sub-sequences; and ii) applying RS coding block correction to the plurality of consensus integer sub-sequences to obtain the plurality of integer sub-sequences having index. Since one kind of nucleic acids could have many copies of molecules during synthesis and be sequenced many times, many reads could stand for one nucleic acid. Due to error introduced during both high throughput synthesis and sequencing, these reads may have variants, but the correct reads matching completely well with originally designed nucleic acids still have advantage on the count. Through the highest frequency-based correction at every location of integer string, all integer strings sharing identical index can be corrected and merged into a consensus integer string between the string correction and block correction.
The index from the plurality of integer sub-sequences having index can then be removed to obtain a plurality of core sub-sequences of integers. The integer strings can then be concatenated into a full integer string and then converted into a binary string. The binary string can then be written into a file, such as a compressed file. The compressed file can then be decompressed, for example by using the LZMA algorithm. If the decompressed file includes data corresponding to multiple files, the decompressed file is further processed (e.g., extracted) by the TAR algorithm to obtain the multiple files.
In some embodiments, the method is useful for retrieving output data stored on a set of nucleic acids of interest, wherein the set of nucleic acids of interest is one of a plurality of sets of nucleotide sequences present in a mixture, each set encoding a different set of output data and having different sets of primer pairs at the 3′ and 5′ end. The method comprises a) amplifying the set of nucleic acids using the primer pair corresponding to set of nucleic acids of interest; b) obtaining a set of nucleotide sequences of the set of amplified nucleic acids; c) and converting the set of nucleotide sequences into the output data according to the method of any one of claims 41-53; thereby obtaining the output data.
In some embodiments, there is provided a method for retrieving two or more sets of output data stored on corresponding two or more sets of nucleic acids of interest, wherein the set of nucleic acids of interest are among a plurality of sets of nucleotide sequences present in a mixture, each set encoding a different set of output data and having a different set of primer pairs at the 3′ and 5′ end, the method comprises: a) amplifying (e.g., separately amplifying or amplifying together) the two or more sets of nucleic acids of interest using primer pairs corresponding to the two or more sets of nucleic acids of interest; b) obtaining two or more sets of nucleotide sequences of the two or more sets of amplified nucleic acids, and c) separately converting the two or more sets of nucleotide sequences into the two or more sets of output data; thereby obtaining the two or more sets of output data.
5-bit Transcoding Framework
The methods of the present invention utilize a novel 5-bit transcoding framework for converting a binary string or an integer string into a set of nucleotide sequences. “5-bit transcoding framework” refers to the conversion according to Table 1 below. Generally, every 5 continued bits from a binary string could be represented as an integer ranging from 0 to 31 and then 3 nucleotides (i.e., 3 mers). For instance, nucleic acids having four bases (e.g., A, T, G and C), thus 2-mers (i.e., NN) should have 16 kinds (e.g., AA, AT, AG, AC, TA, TT, TG, TC, GA, GT, GG, GC, CA, CT, CG and CC). Suppose degenerate base R and Y are concatenated after the 2-mers, the 3-mers (NNR/NNY) should consist of 32 kinds, which also matched well with 32 integers ranging from 0 to 31 and make binary string being converted into DNA sequence.
In some embodiments, R is selected from any two of A, T, G, and C, while Y is selected from the corresponding other two of A, T, G, and C. For example, in some embodiments, R is selected from A and G while Y is selected from T and C. In some embodiments, R is selected from A and C while Y is selected from T and G. In some embodiments, R is selected from T and G while Y is selected from A and C. In some embodiments, R is selected from T and C while Y is selected from A and G.
The choice of the nucleotide corresponding to R and Y can depend on their front basis, for example for the purposes of maintaining a desirable GC content and/or avoid homopolymers. For example, in a scheme in which R is selected from A and G and Y is selected from C and T, whether A or G is chosen for R and whether C or T is chosen for Y are dependent on their front bases (i.e., 2nd base of 3-mers). In some embodiments, R and Y are chosen so that the 2nd and 3rd bases are different. In some embodiments, R and Y are chosen to maintain a desirable GC balance. So long as the rules are followed R and Y can be randomly chosen. The coding potential of this transcoding framework is 1.67 (i.e., 5 bit to 3nt).
Table 2 provides an exemplary 5-bit transcoding framework. In the particular scheme depicted in Table 2, R is to be selected from A and G, while Y is to be selected from C and T. It is to be understood that other transcoding frameworks following the same principle can also be used.
The nucleic acids comprising the desirable nucleotide sequences can be synthesized using any nucleic acid synthesis methods. In some embodiments, the nucleic acids are synthesized by chemical synthesis. Methods of high throughput nucleic acid synthesis are described in International Application No. WO2002US40580, published as WO03052383, titled “COMBINATORIAL SYNTHESIS ON ARRAYS”, filed Feb. 17, 2002, and a publication titled “ELECTROCHEMICALLY GENERATED ACID AND ITS CONTAINMENT TO 100 MICRON REACTION AREAS FOR THE PRODUCTION OF DNA MICROARRAYS” by Maurer et al., published in December 2016, which are incorporated herein by reference in their entireties.
The nucleic acids, once synthesized, can be stored in various medium. In some embodiments, the nucleic acids are dried (e.g., lyophilized) and stored in a vial. In some embodiments, the nucleic acids are immobilized on a carrier, for example a solid carrier such as a microarray.
Also provided herein are non-transitory computer-readable storage media storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, causes the electronic device to carry out one or more steps of any of the methods described herein.
In some embodiments, there is provided a system for providing nucleic acid-based data storage or data retrieval from a nucleic acid, the system comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for carrying out one or more steps of any one of the methods described herein.
In some embodiments, there is provided an electronic device for providing nucleic acid-based data storage or data retrieval from a nucleic acid, the device comprising means for carrying out any one of the methods described herein.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences. The nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length.
In some embodiments, the length of each of the plurality of initial sub-sequence of integers is determined based on an oligo length of a selected synthesis platform, a desired error tolerance, a size of the input data, a selected error correction code, or a combination thereof.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences. The nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length. The nucleotide encoding step further comprises adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index.
In some embodiments, the index information added to each of the plurality of the initial sub-sequences of integers comprises a sequence of integers, wherein the length of the sequence of integers is based on the size of the input data.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences. The nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length. The nucleotide encoding step further comprises adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index. The nucleotide encoding step further comprises, after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences. The nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length. The nucleotide encoding step further comprises adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index. The nucleotide encoding step further comprises, after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy. Adding redundancy data to the plurality of integer sub-sequences having index comprises: creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding; and obtaining the plurality of sub-sequences having redundancy based on the filled matrix.
In some embodiments, the number of columns of the empty matrix is determined based on an oligo length of a selected synthesis platform, the type of the error correction code, a predetermined error tolerance value, a size of the plurality of integer sub-sequences having index, or a combination thereof.
In some embodiments, the number of rows of the empty matrix is determined based on an oligo length of a selected synthesis platform, a type of the error correction code, a predetermined error tolerance value, a size of the plurality of integer sub-sequences having index, or a combination thereof.
In some embodiments, the error correction coding is Reed-Solomon (“RS”) coding.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The data processing step comprises dividing the binary string into a sequence of non-overlapping 5-bit binary strings. The nucleotide encoding step comprises converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework to obtain the set of nucleotide sequences. The nucleotide encoding step further comprises dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length. The nucleotide encoding step further comprises adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index. The nucleotide encoding step further comprises, after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy. Adding redundancy data to the plurality of integer sub-sequences having index comprises: creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding; and obtaining the plurality of sub-sequences having redundancy based on the filled matrix. The data generated by applying an error correction coding is generated by applying string correction of the RS coding and/or block correction of the RS coding.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) converting the input data into a binary string; ii) dividing the binary string into a sequence of non-overlapping 5-bit binary strings; iii) converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework; iv) dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length; v) adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index; vi) after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy, thereby obtaining the set of nucleic acid sequences.
In some embodiments, there is provided a method for storing input data on nucleic acid, the method comprises: i) converting the input data into a binary string; ii) dividing the binary string into a sequence of non-overlapping 5-bit binary strings; iii) converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework; iv) dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length; v) adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index; vi) after adding the index information, adding redundancy data to the plurality of integer sub-sequences having index, thereby obtaining a plurality of integer sub-sequences having redundancy, thereby obtaining the set of nucleic acid sequences; and vii) synthesizing a set of nucleic acids comprising the set of nucleotide sequences.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) converting the input data into a binary string; ii) dividing the binary string into a sequence of non-overlapping 5-bit binary strings; iii) converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework; iv) dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length; v) adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index; vi) creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; vii) filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding (e.g., by applying string correction of the RS coding and/or block correction of the RS coding); and viii) obtaining the plurality of sub-sequences having redundancy based on the filled matrix, thereby obtaining the set of nucleic acid sequences.
In some embodiments, there is provided a method for storing input data on nucleic acid, the method comprises: i) converting the input data into a binary string; ii) dividing the binary string into a sequence of non-overlapping 5-bit binary strings; iii) converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers and converting the string of integers using the 5-bit transcoding framework; iv) dividing the string of integers into a plurality of initial sub-sequence of integers having a predetermined length; v) adding index information to each of the plurality of the initial sub-sequences of integers to obtain a plurality of integer sub-sequences having index; vi) creating an empty matrix, wherein the number of columns in the empty matrix is larger than the size of the plurality of integer sub-sequences having index, and wherein the number of rows of the empty matrix is larger than the number of integers in each of the plurality integer sub-sequences having index; vii) filling the empty matrix with the plurality of integer sub-sequences having index and data generated by applying an error correction coding (e.g., by applying string correction of the RS coding and/or block correction of the RS coding); and viii) obtaining the plurality of sub-sequences having redundancy based on the filled matrix, thereby obtaining the set of nucleic acid sequences; and xi) synthesizing a set of nucleic acids comprising the set of nucleotide sequences.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid, the method comprises: i) obtaining a set of nucleotide sequences of a set of nucleic acids, ii) converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31; iii) converting the plurality of integer sub-sequences into a binary string; and iv) converting binary string into the output data, thereby obtaining the output data.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid, the method comprises: i) sequencing a set of nucleic acids to generate a plurality of sequence reads; ii) pairing, merging, and/or filtering to obtain the set of nucleotide sequences; iii) converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31; iv) applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index; v) converting the plurality of integer sub-sequences having index into a binary string; and vi) converting binary string into the output data, thereby obtaining the output data.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid, the method comprises: i) sequencing a set of nucleic acids to generate a plurality of sequence reads; ii) pairing, merging, and/or filtering to obtain the set of nucleotide sequences; iii) converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31; iv) applying RS coding string correction to the plurality of integer sub-sequences to obtain a plurality of consensus integer sub-sequences; v) applying RS coding block correction to the plurality of consensus integer sub-sequences to obtain the plurality of integer sub-sequences having index; vi) converting the plurality of integer sub-sequences having index into a binary string; and vii) converting binary string into the output data, thereby obtaining the output data.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid, the method comprises: i) sequencing a set of nucleic acids to generate a plurality of sequence reads; ii) pairing, merging, and/or filtering to obtain the set of nucleotide sequences; iii) converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31; iv) applying RS coding string correction to the plurality of integer sub-sequences to obtain a plurality of consensus integer sub-sequences; v) applying RS coding block correction to the plurality of consensus integer sub-sequences to obtain the plurality of integer sub-sequences having index; vi) removing the index from the plurality of integer sub-sequences having index to obtain a plurality of core sub-sequences of integers; vii) merging the core sub-sequences of integers into a string of integers; viii) converting the string of integers into a binary string; and ix) converting binary string into the output data, thereby obtaining the output data.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The 5-bit transcoding framework is according to Table 2.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The 5-bit transcoding framework is according to Table 2. R and Y are chosen based on: 1) being different from the nucleotide immediately in front of R or Y; and/or 2) the estimated GC content of the nucleotide sequence.
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The input data corresponds to a compressed file. The compressed file is compressed using the Lempel-Zic-Markov chain algorithm (“LZMA”).
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The input data corresponds to two or more files. The data processing step further comprises: grouping the two or more files into a TAR file. The TAR file is further compressed using the Lempel-Zic-Markov chain algorithm (“LZMA”).
In some embodiments, there is provided a computer implemented method for converting input data into a set of nucleotide sequences, the method comprises: i) a data processing step comprising converting the input data into a binary string; and ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain a set of nucleotide sequences. The nucleotide encoding step further comprises appending a pair of primer sequences to the 5′ and 3′ ends of each nucleotide sequence of the set of nucleotide sequences.
In some embodiments, there is provided a method for storing input data on nucleic acid comprises a) converting the input data into a set of nucleotide sequences, wherein the converting comprises i) a data processing step comprising converting the input data into a binary string; ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain the set of nucleotide sequences; and b) synthesizing a set of nucleic acids comprising the set of nucleotide sequences. The method further comprises attaching a pair of primers to the set of synthesized nucleic acids.
In some embodiments, there is provided a method for storing two or more sets of input data on nucleic acid comprises: a) separately converting the two or more sets of input data into two or more sets of corresponding nucleotide sequences according to any of the methods described herein; b) separately appending a pair of primer sequences to the 5′ and 3′ end of each set of the two or more sets of nucleotide sequences, wherein the pairs of primers for the two or more sets of corresponding nucleotide sequences are different from each other; and c) synthesizing two or more sets of nucleic acids comprising the two or more sets of corresponding nucleotide sequences, respectively.
In some embodiments, there is provided a method for storing two or more sets of input data on nucleic acid comprises: a) separately converting the two or more sets of input data into two or more sets of corresponding nucleotide sequences according to any of the methods described herein; b) separately appending a pair of primer sequences to the 5′ and 3′ end of each set of the two or more sets of nucleotide sequences, wherein the pairs of primers for the two or more sets of corresponding nucleotide sequences are different from each other; and c) synthesizing two or more sets of nucleic acids comprising the two or more sets of corresponding nucleotide sequences, respectively. Each pair of primers has a sequence that is different from any one of the two or more sets of corresponding nucleotide sequences or complementary sequences thereof.
In some embodiments, the set of synthesized nucleic acids has GC content ranging from 30% to 70%.
In some embodiments, there is provided a method for storing input data on nucleic acid comprises a) converting the input data into a set of nucleotide sequences, wherein the converting comprises i) a data processing step comprising converting the input data into a binary string; ii) a nucleotide encoding step comprising converting the binary string using a 5-bit transcoding framework to obtain the set of nucleotide sequences; and b) synthesizing a set of nucleic acids comprising the set of nucleotide sequences. The method further comprises storing the set of synthesized nucleic acids.
In some embodiments, the set of synthesized nucleic acids is stored by drying. In some embodiments, the set of synthesized nucleic acids is stored by lyophilization.
In some embodiments, the set of synthesized nucleic acids is immobilized on a carrier, which can be a microarray.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid comprises: a) obtaining a set of nucleotide sequences of a set of nucleic acids, b) converting the set of nucleotide sequences into the output data, wherein the converting comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data, thereby obtaining the output data. The method further comprises amplifying the set of nucleic acids prior to retrieving the output data.
In some embodiments, there is provided a method for retrieving output data stored on nucleic acid comprises: a) obtaining a set of nucleotide sequences of a set of nucleic acids, b) converting the set of nucleotide sequences into the output data, wherein the converting comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data, thereby obtaining the output data. The method further comprises sequencing the set of nucleic acids to generate a plurality of sequence reads. The plurality of sequence reads are paired, merged, and filtered to obtain the set of nucleotide sequences.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31. The nucleotide decoding step further comprises applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31. The nucleotide decoding step further comprises applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index. The step of applying error correction coding comprises: i) applying RS coding string correction to the plurality of integer sub-sequences to obtain a plurality of consensus integer sub-sequences; and ii) applying RS coding block correction to the plurality of consensus integer sub-sequences to obtain the plurality of integer sub-sequences having index.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31. The nucleotide decoding step further comprises applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index. The nucleotide decoding step further comprises removing the index from the plurality of integer sub-sequences having index to obtain a plurality of core sub-sequences of integers.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The output data is stored in a compressed file. The data processing step further comprises decompressing the compressed file, for example, by through the LZMA algorithm.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The output data corresponds to a plurality of files. The method further comprises extracting the plurality of files from the output data through the TAR algorithm.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The nucleotide decoding step comprises converting the set of nucleotide sequences into a plurality of integer sub-sequences comprising integers ranging from 0-31. The nucleotide decoding step further comprises applying error correction coding to the plurality of integer sub-sequences, thereby obtaining the plurality of integer sub-sequences having index. The nucleotide decoding step further comprises removing the index from the plurality of integer sub-sequences having index to obtain a plurality of core sub-sequences of integers. The nucleotide decoding step further comprises merging the core sub-sequences of integers into a string of integers and converting the string of integers into a binary string.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The 5-bit transcoding framework is according to Table 2.
In some embodiments, there is provided a computer implemented method for converting a set of nucleotide sequences into an output data comprises: i) a nucleotide decoding step comprising converting the set of nucleotide sequences into a binary string using a 5-bit transcoding framework; and ii) a data processing step comprising converting binary string into the output data. The set of nucleic acids comprises primer sequences at the 3′ and 5′ ends and the method comprises removing the primer sequences before the nucleotide decoding step.
In some embodiments, there is provided there is provided a computer-enabled method for providing DNA-based data storage, the method comprising: converting a digital file into a binary string; converting the binary string using a 5-bit transcoding framework to obtain a string of integers; obtaining, from the string of integers, a plurality of sub-sequences of integers; and converting the plurality of sub-sequences of integers into a plurality of representations of DNA oligoes for DNA synthesis.
In some embodiments, converting the binary string using a 5-bit transcoding framework to obtain a string of integers comprises: dividing the binary string into a sequence of non-overlapping 5-bit binary strings; converting each 5-bit binary string into an integer ranging from 0 to 31 to obtain a string of integers. In some embodiments, the string of integers is further divided into a plurality of initial sub-sequence of integers having a predetermined length. In some embodiments, obtaining a plurality of sub-sequences of integers to be converted comprises: adding index information to each sub-sequence of the initial plurality of sub-sequences of integers; after adding the index information, adding redundancy data to the initial plurality of sub-sequences of integers to obtain the plurality of sub-sequences of integers. In some embodiments, the index information added to each sub-sequence of the initial plurality of sub-sequences comprises a string of integers, and wherein a length of the string of integers corresponding to the index information is based on a size of the digital file.
In some embodiments, the method comprises adding redundancy data to the plurality of sub-sequences of integers, which can comprise, for example, obtaining a subset of the initial plurality of sub-sequences of integers; selecting an empty matrix, wherein the number of columns of the empty matrix is larger than the number of sub-sequences in the subset, and wherein the number of rows of the empty matrix is larger than the number of integers in each sub-sequence of the subset; filling the empty matrix with the subset of the initial plurality of sub-sequences of integers and data corresponding to an error correction code; and obtaining the plurality of sub-sequences of integers based on the filled matrix. In some embodiments, the number of columns of the empty matrix is selected based on a type of the error correction code, a predetermined error tolerance value, a size of the subset, or a combination thereof. In some embodiments, the number of rows of the empty matrix is selected based on a type of the error correction code, a predetermined error tolerance value, a size of the subset, or a combination thereof.
In some embodiments, the error correction code is Reed-Solomon (“RS”) code. In some embodiments, converting the plurality of sub-sequences of integers into a plurality of representations of DNA oligoes comprises converting an integer of the plurality of sub-sequences of integers into a representation of three nucleotides, wherein: a first of the three nucleotides is selected from A, T, G, and C, a second of the three nucleotides is selected from A, T, G, and C, and a third of the three nucleotides is selected from one of two options.
In some embodiments, the digital file is a compressed file corresponding to a group of one or more files or directories. In some embodiments, the digital file comprises a LZMA file corresponding to a group of one or more files or directories compressed using the Lempel-Ziv-Markov chain algorithm.
In some embodiments according to any one of embodiments described above, wherein the method further comprises: adding, to each oligo representation of the plurality of representations of DNA oligoes, data representing a pair of primers; and, after adding the information representing the pair of primers, causing performance of DNA synthesis based on the plurality of representations of DNA oligoes.
In some embodiments, the method further comprises: obtaining a second digital file; obtaining a second plurality of representations of DNA oligoes based on the second digital file; adding data representing a second pair of primers to each oligo representation of the second plurality of representations of DNA oligoes, wherein the second pair of primers is different from the first pair of primers; and performing DNA synthesis based on the plurality of representations of DNA oligoes and the second plurality of representations of DNA oligoes.
In some embodiments, there is provided a computer-enabled method for providing DNA-based data retrieval, the method comprising: obtaining a plurality of reads corresponding to a digital file; based on the plurality of reads, obtaining a plurality of sub-sequences of integers; converting the plurality of sub-sequences of integers into a string of integers; converting the string of integers into a binary string using a 5-bit framework; and obtaining, based on the binary string, a digital file. In some embodiments, obtaining a plurality of reads corresponding to a digital file comprises: identifying a primer pre-associated with the digital file. In some embodiments, obtaining a plurality of sub-sequences of integers comprises performing frequency-based error correction based on the plurality of reads. In some embodiments, converting the string of integers into a binary string using a 5-bit transcoding framework comprises: converting each integer of the string of integers into a 5-bit binary number.
In some embodiments, there is provided a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: convert a digital file into a binary string; convert the binary string using a 5-bit transcoding framework to obtain a string of integers; obtain, from the string of integers, a plurality of sub-sequences of integers; and convert the plurality of sub-sequences of integers into a plurality of representations of DNA oligoes for DNA synthesis.
In some embodiments, there is provided a system for providing DNA-based data storage, the system comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: converting a digital file into a binary string; converting the binary string using a 5-bit transcoding framework to obtain a string of integers; obtaining, from the string of integers, a plurality of sub-sequences of integers; and converting the plurality of sub-sequences of integers into a plurality of representations of DNA oligoes.
In some embodiments, there is provided a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to obtain a plurality of reads corresponding to a digital file; based on the plurality of reads, obtain a plurality of sub-sequences of integers; convert the plurality of sub-sequences of integers into a string of integers; convert the string of integers into a binary string using a 5-bit framework; and obtain, based on the binary string, a digital file.
In some embodiments, there is provided a system for providing DNA-based data storage, the system comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: obtaining a plurality of reads corresponding to a digital file; based on the plurality of reads, obtaining a plurality of sub-sequences of integers; converting the plurality of sub-sequences of integers into a string of integers; converting the string of integers into a binary string using a 5-bit framework; and obtaining, based on the binary string, a digital file.
According to an exemplary implementation, the different steps of the method are implemented by a computer software program or programs, this software program comprising software instructions designed to be executed by a data processor of a relay module according to the disclosure and being designed to control the execution of the different steps of this method.
Consequently, an aspect of the disclosure also concerns a program liable to be executed by a computer or by a data processor, this program comprising instructions to command the execution of the steps of a method as mentioned here above.
This program can use any programming language whatsoever and be in the form of a source code, object code or code that is intermediate between source code and object code, such as in a partially compiled form or in any other desirable form.
The disclosure also concerns an information medium readable by a data processor and comprising instructions of a program as mentioned here above.
The information medium can be any entity or device capable of storing the program. For example, the medium can comprise a storage means such as a ROM (which stands for “Read Only Memory”), for example a CD-ROM (which stands for “Compact Disc-Read Only Memory”) or a microelectronic circuit ROM or again a magnetic recording means, for example a floppy disk or a hard disk drive.
Furthermore, the information medium may be a transmissible carrier such as an electrical or optical signal that can be conveyed through an electrical or optical cable, by radio or by other means. The program can be especially downloaded into an Internet-type network.
Alternately, the information medium can be an integrated circuit into which the program is incorporated, the circuit being adapted to executing or being used in the execution of the method in question.
According to one embodiment, an embodiment of the disclosure is implemented by means of software and/or hardware components. From this viewpoint, the term “module” can correspond in this document both to a software component and to a hardware component or to a set of hardware and software components.
A software component corresponds to one or more computer programs, one or more sub-programs of a program, or more generally to any element of a program or a software program capable of implementing a function or a set of functions according to what is described here below for the module concerned. One such software component is executed by a data processor of a physical entity (terminal, server, etc.) and is capable of accessing the hardware resources of this physical entity (memories, recording media, communications buses, input/output electronic boards, user interfaces, etc.).
Similarly, a hardware component corresponds to any element of a hardware unit capable of implementing a function or a set of functions according to what is described here below for the module concerned. It may be a programmable hardware component or a component with an integrated circuit for the execution of software, for example an integrated circuit, a smart card, a memory card, an electronic board for executing firmware etc. In a variant, the hardware component comprises a processor that is an integrated circuit such as a central processing unit, and/or a microprocessor, and/or an Application-specific integrated circuit (ASIC), and/or an Application-specific instruction-set processor (ASIP), and/or a graphics processing unit (GPU), and/or a physics processing unit (PPU), and/or a digital signal processor (DSP), and/or an image processor, and/or a coprocessor, and/or a floating-point unit, and/or a network processor, and/or an audio processor, and/or a multi-core processor. Moreover, the hardware component can also comprise a baseband processor (comprising for example memory units, and a firmware) and/or radio electronic circuits (that can comprise antennas) which receive or transmit radio signals. In one embodiment, the hardware component is compliant with one or more standards such as ISO/IEC 18092/ECMA-340, ISO/IEC 21481/ECMA-352, GSMA, StoLPaN, ETSI/SCP (Smart Card Platform), GlobalPlatform (i.e. a secure element). In a variant, the hardware component is a Radio-frequency identification (RFID) tag. In one embodiment, a hardware component comprises circuits that enable Bluetooth communications, and/or Wi-fi communications, and/or Zigbee communications, and/or USB communications and/or Firewire communications and/or NFC (for Near Field) communications.
It should be noted that a step of obtaining an element/value in the present disclosure can be viewed either as a step of reading such element/value in a memory unit of an electronic device or a step of receiving such element/value from another electronic device via communication means.
At step 102 (“Data Compression”), one or more files and/or directories are packed into a single file and then compressed into a compressed file. In some examples, the files and/or directories are packed into a TAR file (e.g., File.tar), which is then compressed into a LZMA file (e.g., File.tar.lzma) using the Lempel-Ziv-Markov Chain algorithm (i.e., LZMA algorithm). In some examples, one LZMA file operates as a single, undividable unit for data retrieval (e.g., during decoding). Thus, if multiple files and directories are intended to be stored together but retrieved randomly and independently, they should be grouped into multiple TAR files and compressed into multiple corresponding LZMA files at this step.
At step 104, a first round of data transcoding is carried out. First, each LZMA file is converted into a binary string. As an example, with reference to
As shown in
Turning back to
With reference to
Turning to
With reference to
Turing to
As shown in
In the depicted example in
Turning back to
Accordingly, each of the 29 integers in each integer sub-sequence (e.g., C1) can be mapped into 3 nucleotides. After all of [C1, C2, . . . , Cm] are converted, Y is replaced with C or T, while R is replaced with A or G before DNA synthesis. This is done to make sure the 3rd base is different from the 2nd base of 3-mers and avoid 3 continually identical bases (e.g., AAA, GGG, TTT, CCC). Further, the GC percentage of each oligoes should be limited from 30% to 70% through the choice of Y and R. The replacement step both reduces the errors induced by oligo synthesis and is significant for the improvement of correction ratio of oligo synthesis.
According the principle of RS coding, the tolerable errors can include two (i.e., half of the parity of string correction, 4) mutations of each oligo and 13 (i.e., half of the parity of block correction, 26) missing oligoes (including completely missing oligoes or oligoes having indels) of the 31 oligoes from the same matrix in exemplary scenario illustrated in
With reference to
To select primer pairs, a plurality of criteria can be used. For example, a primer pair can be chosen to avoid homodimer, heterodimer, hairpin structure and have enough specificity (e.g., have no binding site to the encoding nucleic acid sequences). In some examples, multiplexing PCR primer design standard is used.
The decoding procedure is essentially the reverse process of encoding procedure. With reference to
At step 114, pair-end next-generation sequencing and read pairing & merging are performed (e.g., by an Illumina sequencing system). Specifically, forward and reverse read from the same cluster are paired and merged into a single read, and all new reads with irregular length will be filtered (e.g., reads having indels). Further, according to primer sequences, all reads can be grouped for each compression file. In subsequent steps, the reads corresponding to the same compression file (i.e., reads sharing the same primers) would be analyzed together.
At step 116, reverse RS coding is performed. In some examples, a 29 by 31 zero matrix but not empty matrix would be utilized. Specifically, each read from a single compression file has the PCR primers removed at two terminals and is then transformed into an integer sub-sequence through string correction of RS coding with the aim to do error correction for mutations. Since one kind of oligo could have many copies of molecules during synthesis and be sequenced many times, many reads could originate from one oligo. Due to the error induced during both high throughput synthesis and sequencing, these reads may have variants, but the correct reads should dominate. Through the highest frequency-based correction at every location of integer sub-sequence, all integer sub-sequences sharing the identical index could be corrected and merged into a consensus integer sub-sequence. For instance, for a group of reads sharing the same index, each position of their consensus integer sub-sequence should be determined by the integer emerging most frequently at this position.
At step 118, the list of integer strings can be completely decoded through block correction of RS coding to recover the missing oligoes and oligoes with insertions and deletions. Since one kind of oligo could have many copies of molecules during synthesis and be sequenced many times, many reads could stands for one oligo. Due to the error induced during both high throughput synthesis and sequencing, these reads may have variants, but the correct reads matching well with originally designed oligoes still have advantage on the count. Through the highest frequency-based correction at every location of integer string, all integer strings sharing identical index could be corrected and merged into a consensus integer string between the string correction and block correction. Since oligoes with insertions and deletions have irregular length and would be deleted during error correction, thus the corresponding data completely equals to information lacking and need to be recovered. Based on the index information, the columns of the matrix are filled after highest frequency-based correction.
At step 120, transcoding is performed. Reads are sorted by index and then index is deleted from each integer sub-sequence. All integer sub-sequences can be then concatenated into a single integer string and then transferred into a binary string via the 5-bit transcoding framework.
At step 122, decompression is performed. Specifically, the system writes the binary string into a compression file, and then decompresses the compression file through LZMA algorithm and TAR algorithm in order. For the random access of multiple compression files, steps 116 through 122 should be performed for each of the compression files independently. A pool can store multiple compression files. Each compression file has its own PCR primer. During decoding, it is not necessary to sequence the entire pool. Rather, the corresponding PCR primer is used to amplify the oligoes of certain compression file and then sequence the amplified oligoes to decode this corresponding compression file but not the entire pool.
As discussed above, a 5-bit transcoding framework is leveraged. Specifically, every 5 continual bits from a binary string can be represented as an integer ranging from 0 to 31 and then 3 nucleotides [nt] (i.e., 3-mers). For instance, DNA oligo consists of four bases (e.g., A, T, G and C), thus 2-mers (i.e., NN) should have 16 kinds (e.g., AA, AT, AG, AC, TA, TT, TG, TC, GA, GT, GG, GC, CA, CT, CG and CC). Suppose degenerate base R and Y are concatenated after the 2-mers, the 3-mers (NNR/NNY) should consist of 32 kinds, which also matched well with 32 integers ranging from 0 to 31 and make binary string being transferred well into DNA sequence. During oligo synthesis, whether A or G is chosen to represent R and whether C or T is selected to replace Y are dependent on their front bases (i.e., 2nd base of 3-mers), in fact the system can make 2nd and 3rd base different and then keep GC balance at the same time. Given this precondition reaches, the accurate base will be randomly selected between candidate bases. In conclusion, the coding potential of this transcoding framework is 1.67 (i.e., 5 bit to 3nt).
During encoding, the text file is compressed into a single compression file and then stored using 403 oligoes with 87 nt length through the DNA storage framework. Meanwhile, in order to simulate of random access, 6 copies of this compression file are used and 6 pairs of primers are selected. Each pair of primers is added at two terminals of each of the 403 oligoes. The 6 pairs of primers (20 nt per each) were orthogonal, which means that any two of them have enough hamming distance, and share less similarity with any one of 403 oligoes. The Sequence Listing submitted herein in the ASCII text file includes SEQ ID NO.1-SEQ ID NO.403 and primer pairs PP NO.1-PP NO.6 as SEQ ID NOS. 404-415.
Synthesis of oligo pool is then performed. In total, 2418 (i.e., 403 multiplied by 6) oligoes were synthesized using the CustomArray platform developed by CustomArray, Inc. Each oligo is 127nt which includes total 40nt primers (20nt per terminal).
PCR amplification and NGS are then performed. 6 PCR reactions were done for all copies of compression file. After library preparation of 6 samples using TruSeq DNA PCR-free HT library preparation kit (96 indexes in plate format, 96 samples) and 6 library index, the pooled samples were sequenced together using MiSeq reagent kits V3 (150 cycle) due to the 127nt length of oligoes. The Q30 of NGS data is 94% (official standard>85%) and Cluster Density is 1,301 KImm2 (official standard 1200-1400 KImm2).
Lastly, decoding is performed. After independent decoding of each copy of compression file, all copies could be randomly and successfully retrieved and decompressed without any error.
In an alternative embodiment, some or all of the steps of the method previously described, can be implemented in hardware in a programmable FPGA (“Field Programmable Gate Array”) component or ASIC (“Application-Specific Integrated Circuit”) component.
In an alternative embodiment, some or all of the steps of the method previously described, can be executed on an electronic device comprising memory units and processing units as the one disclosed in the
Although the disclosure and examples have been fully described with reference to the accompanying figures, it is to be noted that various changes and modifications will become apparent to those skilled in the art. Such changes and modifications are to be understood as being included within the scope of the disclosure and examples as defined by the claims.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the techniques and their practical applications. Others skilled in the art are thereby enabled to best utilize the techniques and various embodiments with various modifications as are suited to the particular use contemplated.
Number | Date | Country | Kind |
---|---|---|---|
201710611123.2 | Jul 2017 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2018/097083 | 7/25/2018 | WO | 00 |