A paper copy of the Sequence Listing and a computer readable form of the Sequence Listing containing the file named “20UMC035.txt”, which is 20,985 bytes in size (as measured in MICROSOFT WINDOWS® EXPLORER), are provided herein and are herein incorporated by reference. This Sequence Listing consists of SEQ ID NOs:1-57.
The present disclosure relates generally to methods of writing data in nucleic acid chains, and methods of reading data written in nucleic acid chains. The present disclosure also relates to a kit for writing and reading data in nucleic acid chains.
In the era of explosive digital data growth, DNA is being explored as a next-generation molecular storage media. One can encode data in vitro by employing the four natural nucleotides (A, T, G, and C), synthesizing data in a DNA strand, and retrieving data from the DNA by sequencing. Due to molecular manipulation at the atomic level, DNA data storage achieves extremely high data density up to 1018 bytes per cubic millimeter (1 uL), 6 orders of magnitude denser than the densest media available today. DNA material is extremely stable both in liquid or dry paper at a relatively high temperature, offering high durability (high retention) over many current media materials. Data in DNA can easily generate billions of copies via a simple PCR reaction while maintaining low energy. DNA data retrieval greatly benefits from the revolution of sequencing technology, including Illumina Next Generation Sequencing (NGS) and Nanopore 3rd generation Sequencing, which can rapidly sequence human and other genomes at ever-decreasing prices.
However, current strategies for DNA data storage also confront challenges. Storage of any dataset needs template-free synthesis of specific long DNA molecules either by chemical or enzymatic methods, which remains highly expensive, time-consuming, labor intensive, and error-prone. Synthesized DNA cannot be re-used to store other datasets, further increasing the storage cost. For data retrieving, synthesis-based Illumina sequencing has to cut long data DNA molecules into short fragments (<300 bases) to keep a low error rate (<0.1%), and requires complicated post-sequencing bioinformatic analysis to assemble fragmented data.
For these reasons, efforts have been made to write data into a universal DNA sequence including native DNA. Recently, Chen et al constructed a string of protein (streptavidin)-labeled and unlabeled DNA nanostructures on template DNA to encode binary data, and used a nanopore-terminated glass nanopipette to identify the protein-labeled DNA nanostructure signal that is distinct from unlabeled DNA nanostructures and background signal, thereby decoding binary data. This method eliminates long DNA synthesis in data writing, but the storage of each bit still needs protein labeling, which increases the cost. Also, under current glass nanopore resolution, only binary data represented by the labeled and unlabeled states for each bit can be read, and neighboring protein labels have to be separated enough to distinguish from each other which lowers data density. Tabatabaei et al reported transformation of binary data into a series of positions along with native DNA and utilized the nickase PfAgo with synthetic phosphorylated guide DNA to form nicks at these data positions for data writing. Sequencing all the fragmented DNA by NGS allowed determining the nick reaction occurrence at each position, thereby retrieving the data. This data writing strategy also eliminates long DNA synthesis but requires parallel enzyme reactions for each bit which may cause non-specific nicking. Additionally, enzymatic treatment is considerably costly and time-consuming. Experimentally, data reading still relies on sequencing-by-synthesis approaches (NGS) followed by complicated sequence alignment.
These shortcomings highlight the need for new methods for reading and storing data.
The present disclosure relates generally to methods of writing data in nucleic acid chains, and methods of reading data stored in nucleic acid chains. The present disclosure also relates to a kit for writing and reading data in nucleic acid chains.
In one aspect, the present disclosure is directed a method for reading stored data, the method comprising:
In another aspect, the present disclosure is directed to a method for writing data, the method comprising:
In another aspect, the present disclosure is directed to another method for writing data, the method comprising:
In another aspect, the present disclosure is directed to a method for encoding data and reading the encoded data, the method comprising
In another aspect, the present disclosure is directed to a kit for writing and reading data, the kit comprising:
The disclosure will be better understood, and features, aspects and advantages other than those set forth above will become apparent when consideration is given to the following detailed description thereof. Such detailed description makes reference to the following drawings, wherein:
to an embodiment.
While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described below in detail. It should be understood, however, that the description of specific embodiments is not intended to limit the disclosure to cover all modifications, equivalents and alternatives falling within the spirit and scope of the disclosure as defined by the appended claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure belongs. Although any methods and materials similar to or equivalent to those described herein can be used in the practice or testing of the present disclosure, the preferred methods and materials are described below.
The approach of the present disclosure is to store data in DNA by using blockers to write a codon sequence on a nucleic acid sequence. The codon sequence is read codon-by-codon on a nanopore unzipping-sequencing (NP Unzip-Seq) platform. This coupled encoding and NP Unzip-Seq method surprisingly writes, reads, and rewrites not only binary but also multinary data in a fast, enzyme-free, and label-free manner without the need for long DNA synthesis.
The strategy is summarized by
In this data storage/retrieval strategy, data is encoded into a nucleic acid chain by defining sequences of codons, and a nanopore is used to decode the data codon-by-codon.
An aspect of the present disclosure is directed to a method for reading stored data, the method comprising:
The nanopore unzipping-sequencing (NP Unzip-Seq) platform is illustrated in
Still referring to
In some embodiments, the nanopore is a protein nanopore. Suitable protein nanopores include but are not limited to MspA nanopores, CsgG nanopores, and a-hemolysin nanopores. In some embodiments, the MspA nanopore is an MspA mutant nanopore, for example MspA-M2. In some embodiments, the nanopore is a solid nanopore, a synthetic nanopore, a hybrid nanopore, or other nucleotide-sensitive nanochannels. In some embodiments, the hybrid nanopore incorporates a protein nanopore in a solid nanopore.
In some embodiments, dissociating the first blocker from its address comprises performing enzyme-free vectorial unzipping. In some embodiments, the electric potential is from about 50 mV to about 200 mV, about 50 mV to about 150 mV, about 50 mV to about 120 mV, about 100 mV to about 200 mV, about 100 mV to about 150 mV, or about 100 mV to about 120 mV. In some embodiments, the electric potential is about 100 mV, about 120 mV, or about 150 mV. In some embodiments, the electric potential is up to several volts.
As used herein, codons refer to sequences of the nucleic acid chain encoding data. The codons each include 1 or more unpaired nucleotides. In some embodiments, the codons each include 2 or more unpaired nucleotides. In some embodiments, the codons each include 1 to 5, 1 to 4, 1 to 3, 2 to 5, 2 to 4, 2 to 3, 3 to 5, or 3 to 4 unpaired nucleotides. In some embodiments, the codons each include 1, 2, 3, 4, 5 nucleotides, or combinations thereof. In some embodiments, the codons each include up to 10 nucleotides. In some embodiments, the nanopore conductance was found to be sensitive to the first 4 unpaired nucleotides of the nucleic acid chain upstream of the blocker. Therefore, particularly suitable codons each include 4 unpaired nucleotides.
The conductance of the nanoparticle pore is highly sensitive to the identity and sequence of nucleotides in the codon. A change to a single nucleotide in the codon causes a characteristic change in the nanopore conductance. The identity of the codon can be read out from the nanopore current signature (see
The NP Unzip-Seq platform can discriminate the starts/stops of codons, including between codon repeats that are otherwise difficult to distinguish by current sequencing approaches (for example, nanopore sequencing methods struggle to identify repeating sequences because each sequence produces the same current read). In the present methods, distinctively lower conductance was briefly observed between codons (i.e., inter-codon markers as shown in
As used herein, blocker refers to an oligonucleotide that binds the nucleic acid chain to define codons. Preferably, the blocker binds the nucleic acid chain at or about an address on the nucleic acid chain. Suitable blockers each comprise about 5 to about 30 nucleotides. In some embodiments, the blockers each comprise about 5 to about 30, about 5 to about 25, about 5 to about 20, about 10 to about 30, about 10 to about 25, about 10 to about 20, about 15 to about 30, about 15 to about 25, or about 15 to about 20 nucleotides. In some embodiments, the blockers each comprise 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 nucleotides or combinations thereof.
As used herein, address refers to a short segment of the nucleic acid chain that binds a blocker to define a codon upstream of the address. Suitable addresses each comprise about 5 to about 30 nucleotides. In some embodiments, the addresses each comprise about 5 to about 30, about 5 to about 25, about 5 to about 20, about 10 to about 30, about 10 to about 25, about 10 to about 20, about 15 to about 30, about 15 to about 25, or about 15 to about 20 nucleotides. In some embodiments, the addresses each comprise 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 nucleotides or combinations thereof. In some embodiments, the addresses of the nucleic acid chain are all identical. In other embodiments, the nucleic acid chain has multiple the nucleotides of the addresses have an address sequence and the chain comprises two or more address sequences.
Nucleotides compatible with aspects of the invention may be any nucleotides, derivatives, or nucleotide-like compounds as are known in the art. In some embodiments, the nucleotides are natural nucleotides (A, T, G, C). In some embodiments, the nucleotides are artificial nucleotides such as LNA, BNA, and PNA. In some embodiments, the nucleotides are modified nucleotides such as methylated nucleotides. In some aspects, the nucleotides are DNA nucleotides. In some embodiments, the nucleotides are RNA nucleotides.
In this data writing/reading strategy, data is written into a nucleic acid chain by defining sequences of codons.
An aspect of the present disclosure is directed to a method for writing data, the method comprising:
As used herein, coding window refers to a short segment of the nucleic acid chain upstream of the address. The blocker defines a codon in the coding window when the blocker binds the address. In some embodiments, the coding windows each comprise about 1 to about 20, about 1 to about 10, about 1 to about 5, about 2 to about 20, about 2 to about 10, about 2 to about 5, about 3 to about 20, about 3 to about 10, about 3 to about 5, about 4 to about 20, about 4 to about 10, about 5 to about 20, or about 5 to about 10 nucleotides.
Frameshift encoding. Since the blocker binding to the nucleic acid chain controls the codon formation, extending or shortening the blocker length by n nucleotides relative to the address length enables shifting the codon frame backward or forward by the same number of bases. This frameshift encoding strategy enables defining different in the coding window without changing the address or coding window sequence.
Accordingly, in some embodiments, the blocker comprises an address match and an encoder. The address match complements the nucleotides of the addresses. The encoder complements nucleotides of the coding windows adjacent to the addresses. Encoders of different lengths are able to shift the codon frame by different numbers of bases, generating multiple codons within the coding window at each address.
The frameshift encoding strategy is demonstrated in
In some embodiments, the method further includes defining codons based on the address match and the encoder comprises shifting the coding window along the chain by a predetermined number of bits. In some embodiments, the encoder comprises 0 or more nucleotides and a size of the encoder determines the number of bits by which the coding window is shifted. In some embodiments, the encoders each comprise about 0 to about 10, about 0 to about 5, about 1 to about 10, or about 1 to about 5 nucleotides. In some embodiments, the encoders each comprise 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 nucleotides and combinations thereof.
By writing data in multinary format, data storage density is greatly increased. Frameshift encoding enables encoding data in multinary format on the nucleic acid chain. To encode an n-nary dataset (n is 2, 4, 8 and 16 for binary, quaternary, octal and hexadecimal data, respectively) each coding window forms n different codons by frameshift. In some embodiments, the output is multinary. In some embodiments, the output is quaternary, octal, or hexadecimal. For example, to encode data in quaternary format, in some embodiments, a set of four blockers each have an identical address match but each have a unique encoder including 0, 1, 2 or 3 nucleotides. Upon binding to the same address sequence, these staples shift the codon frame by 0, 1, 2 and 3 bases respectively to form four different codons with bit values of 0, 1, 2, and 3. In some embodiments, the output is binary.
Another aspect of the present disclosure is directed to a method for encoding data and reading the encoded data, the method comprising
Universal template. In some embodiments, the nucleic acid chain is a universal template. The universal template is a 1D array of data units. Each data unit consists of a coding window followed by an address. In some embodiments, the nucleotides of the coding windows have a coding window sequence and the coding window sequence is the same for all the coding windows.
Frameshift encoding allows different codons to be defined (and hence different data to be written) on the universal template without changing its sequence. Thus, frameshift encoding enables a universal template to encode various datasets by shifting the codon reading frame without the need for DNA synthesis.
In some embodiments, the frameshift encoding strategy is applied in writing and reading multinary data in a native DNA that functions as a hard drive. In such embodiments, the universal template is native DNA or modified native DNA, wherein long single-stranded templates are extracted from the double-stranded DNA. For example, in some embodiments the nucleic acid chain is DNA from organisms (e.g., naturally-sourced DNA derived from recombining several fragments of viral DNA). Particularly suitable organisms are bacteria and viruses, due to the advantages of large-scale native DNA proliferation and extraction at low cost and high efficiency.
To use a native DNA as the universal template, in some embodiments the methods include identifying qualified coding window sequences in the native DNA sequences. In some embodiments, the criteria for identifying qualified coding windows sequences include capacity, step, and conductance difference between two codons formed in a coding window.
Capacity. To encode an n-nary dataset (n is 2, 4, 8 and 16 for binary, quaternary, octal and hexadecimal data), each coding window should be able to form n different (nanopore-discriminable) codons by frameshift; i.e. the capacity is n.
Step. The coding window sequences are also dependent on the number of bases for each shift, i.e. step. In some embodiments, step is 1. A search with multiple Step values may result in more qualified candidate coding window sequences. If the frameshift step is set to 1-nt, a coding window needs to contain at least n+3 nucleotides.
Conductance. Finally, the nanopore current difference produced between any two codons should be larger than a cut-off value ΔIshift, a measure of discriminability at the instrument resolution. A large ΔIshift increases codon discrimination accuracy, but also results in fewer qualified candidate coding window sequences. After coding windows are identified, blockers are synthesized with encoders for encoding multinary data in the coding windows by frameshift encoding.
Barcodes. In another aspect, the nucleic acid chain is used for one or more of labelling and identification of a biomarker. In some embodiments, the nucleic acid chain is a barcode. The barcode labels biomarkers such that NP-Unzip Seq recognizes the barcode to identify the biomarker. In some embodiments, biomarker is a nucleic acids fragment (e.g. a native genome DNA or RNA fragment containing driver mutations), a pathogenic DNA or RNA fragment (e.g. from a bacterium and virus), a panel of microRNAs, long non-coding RNAs, or other nucleic acids sequences of interest. In some embodiments, the biomarker is a protein, a peptide, or a small molecule such as metabolites and metal ions. In some embodiments, the biomarker labels a probe or a receptor that labels the biomarker. For example, in some embodiments, the biomarker is a nucleic acid, the probe is a complementary strand that binds the nucleic acid biomarker, and the barcode labels the probe.
Native nucleic acid chains. In another aspect, the nucleic acid chain comprises a native nucleic acid chain. In some embodiments, the nucleic acid chain is a native nucleic acid chain. Suitable native nucleic acids chains are DNA and RNA. In some embodiments, the methods include converting double stranded DNA into single stranded DNA by any known means (e.g. heating). In some embodiments, the methods include identifying one or more of genetic alterations and epigenetic alterations in the nucleic acid chain. Genetic alterations include single nucleotide polymorphism, insertions, deletions, frameshift mutations, duplications, and repeat expansions. Epigenetic alterations include oxidative stress and methylation. In some embodiments, the methods are used to study the functions of enzymes that produce these alterations and their relations to diseases.
For example, DNA methylation occurs in clusters (CpG islands) in promotor regions. RNA methylation also has consensus sequences. Some embodiments are directed to identifying methylation sites in the nucleic acid chain. Such methods include designing a set of blockers to bind the nucleic acid sequence near (but not over) a potential methylation site. The blockers define codons, and some of the codons include methylation sites. Nanopore Unzip-Seq sequentially reads these codons while sequentially unzipping these blockers one by one, such that the status of each codon (i.e. with or without methylation) is identified. In some embodiments, the methods include performing statistical analysis of the result of many nucleic acids chains. In some embodiments, the methods include quantifying the overall methylation percentage distribution. In some embodiments, the methods include quantifying the locus-specific methylation occurrence probability.
RNA. In another aspect, the nucleic acid sequence is an RNA sequence. RNA includes double-stranded motifs that are interconnected via canonical and non-canonical interactions. In some embodiments, the methods include reading a sequence before a double-stranded motif. In some embodiments, the methods include sequentially unzipping all the double-stranded motifs along the RNA chain to read the sequence before each motif. In some embodiments, the methods include mapping the locations of all the motifs formed in the RNA. In some embodiments, the nucleic acid sequence is an RNA sequence and the method includes identifying secondary and tertiary structures in the RNA sequence.
Another aspect of the present disclosure is directed to a kit for writing and reading data, the kit comprising:
Examples 1-3 are directed to methods used in the rest of the examples. Examples 4-10 are directed to reading data encoded in a nucleic acid chain. Example 11 is directed to writing data on a nucleic acid chain via frameshift encoding. Examples 12-15 are directed to frameshift encoding and decoding multinary data in native DNA sequences. Example 16 is directed to the advantages of this invention.
Examples 1-3 are directed to methods used in the rest of the examples.
This example presents a method of preparing MspA protein that is used in other examples.
The mutated MspA porin (D90N/D91N/D93N/D118R/ D134R/E139K) was prepared by previously reported methods (Yan et al., 2019; Wang et al., 2018; Heinz et al., 2003; Butler et al., 2008). The gene of mutated MspA with poly-histidine tag (H6) on the C-terminal was synthesized and cloned into a pET-30a(+) plasmid by Genscript. Competent cells of E. coli BL21 (DE3) were transformed with the plasmid by heat shock and were plated on LB agar supplemented with kanamycin and incubated at 37° C. overnight. A single colony was picked and grew in LB medium with kanamycin. Until OD600=0.7, the cells were induced with 1 mM isopropyl β-D-thiogalactoside (IPTG) and shaken overnight at 16° C. They were harvested by centrifugation at 4000 rpm 30 min at 4° C. The pellets were lysed in the lysis buffer (100 mM Na2HPO4/NaH2PO4, 0.1 mM EDTA, 150 mM NaCl, 0.5% (w/v) Genapol pH 6.5) at 60° C. for 10 min. The centrifuge tubes were kept on ice for 10 min and centrifuged at 10,000 rpm for 30 min at 4° C. After syringe filtration with a 0.22 μm filter, the supernatant was transferred to a nickel affinity column (HisTrap™ HP). The column was washed by washing buffer (0.5 M NaCl, 20 mM HEPES, 5 mM imidazole, 0.5% (w/v) Genapol X-80, pH=8.0). The MspA proteins were eluted by eluting buffer (500 mM imidazole, 0.5 M NaCl, 20 mM HEPES, 0.5% (w/v) Genapol X-80, pH=8.0). The solution with a gradient concentration of imidazole was collected in different EP tubes. The assembly of MspA in each tube was characterized by 12% SDS-PAGE.
This example presents a method of DNA hybridization that is used in other examples.
DNA fragments were synthesized by Integrated DNA Technologies, Inc. The DNAs were resolved in deionized water to 200 μM and diluted by the same volume of salt solution (200 mM KCl, 20 mM Tris-Cl, pH 8.0) to 100 μM. Besides the experiments of discrimination of codon sequences at single-base resolution (the mixed ratio was 1:1), ten times of the staple was added for each address in the medium DNA (the mixed ratio was 1: (number of addresses×10). The DNA mixture was denatured at 95° C. for 2 min, then annealed slowly by cooling down gradually to room temperature overnight.
This example presents a method of nanopore single-channel recording that is used in other examples.
Nanopore single-channel recording was conducted according to previously reported methods (Wang et al., 2011; Tian et al., 2018). Briefly, a lipid bilayer membrane (1,2-diphytanoyl-sn-glycero-3-phosphocholine) was formed over a 100-150 p.m orifice in the center of a Teflon film that partitioned between cis and trans recording solutions. Both solutions contained KCl (1 M) and were buffered with 10 mM Tris (pH 7.8). The MspA proteins were added to the cis solution, from which they inserted into the bilayer to form a single nanopore channel. The DNA chains were added to the cis solution. Voltage was applied to the trans solution while the cis solution was grounded. Ionic current through the pore was recorded using an Axopatch 200B amplifier, filtered with a built-in 4-pole low-pass Bessel Filter at 5 kHz, and acquired with Clampex 9.0 software through a Digidata 1440 A/D converter at a sampling rate of 20 kHz. Data of single-channel event amplitudes were analyzed by Clampfit 9.0, Excel, and SigmaPlot (SPSS) software. All measurements were conducted at 22±2° C. The program Frame-Shift Coding Window Finder was written by Python.
Examples 4-10 are directed to reading data encoded in a nucleic acid chain.
This examples presents the design of a coupled Frameshift encoding/Nanopore Unzip-Sequencing (NP Unzip-Seq) decoding workflow.
This example presents a method for reading data stored in nucleic acid chains as codons.
The nanopore was used to electrically screen a group of DNA coding segments with gradual nucleotide substitution. From D9T0C to D2T7C, one to seven thymines from the end of the duplex blocker were successively replaced by cytosines. These coding segments were extended with a common address, which bound a common staple (SEQ ID NO: 13) to form a double-stranded segment (
TTTTTTTTTCCAGCATGTACTTCTCGACC
TTTTTTTTCCCAGCATGTACTTCTCGACC
TTTTTTTCCCCAGCATGTACTTCTCGACC
TTTTTTCCCCCAGCATGTACTTCTCGACC
TTTTTCCCCCCAGCATGTACTTCTCGACC
TTTTCCCCCCCAGCATGTACTTCTCGACC
TTTCCCCCCCCAGCATGTACTTCTCGACC
TTCCCCCCCCCAGCATGTACTTCTCGACC
TCCCCCCCCCCAGCATGTACTTCTCGACC
CCCCCCCCCCCAGCATGTACTTCTCGACC
Driven by the voltage, each coding segment was pulled into the MspA nanopore from the cis opening, immobilized temporarily in the cavity by the double-stranded segment while characteristically regulating the nanopore ion current, and finally translocated through the pore once the blocker was unbound. This single-molecule procedure was recorded by the nanopore current signatures (
Without being bound by particularly theory, it is believed that, when the duplex blocker was trapped into and anchored in the nanopore cavity, the first four bases connecting to the duplex exactly occupied the sharp end of the funnel-shaped MspA pore, i.e. the sequence-reading zone, whereas the remaining unpaired sequence was left out of the pore without influencing the nanopore conductance. Therefore, the first four bases directly connecting to (from the end of) the duplex blocker served as codons for data encoding. Codons were distinguished from each other by the signature pattern, including the blocking level, duration, noise, and other pattern characters.
This example shows design of codon sequences for studying reading sequential data codons and discriminating consecutive identical codons.
A group of DNA templates were designed with three-quadromeric codons. These templates, from D000 through D111, encoded all the 3-bit binary numbers, i.e. 000, 001, 010, 011, 100, 101, 110, and 111 (
CCCCCCAGCATGTACTTCTCGACCC
CCCCCCAGCATGTACTTCTCGACC
CCCCCCAGCATGTACTTCTCGACCT
TTTTCCAGCATGTACTTCTCGACC
TTTTCCAGCATGTACTTCTCGACCC
CCCCCCAGCATGTACTTCTCGACC
TTTTCCAGCATGTACTTCTCGACCT
TTTTCCAGCATGTACTTCTCGACC
CCCCCCAGCATGTACTTCTCGACCC
CCCCCCAGCATGTACTTCTCGACC
CCCCCCAGCATGTACTTCTCGACCT
TTTTCCAGCATGTACTTCTCGACC
TTTTCCAGCATGTACTTCTCGACCC
CCCCCCAGCATGTACTTCTCGACC
TTTTCCAGCATGTACTTCTCGACCT
TTTTCCAGCATGTACTTCTCGACC
Two nanopore-discriminable codons, CCCC and TTTT (
This example shows reading the sequences of Example 6.
Since D010 and D101 did not contain codon repeats, their codons were directly read out from the blocking levels. Their nanopore signatures were separated into three stages with sequential blocking levels (I/I0) at 25.7±0.6% m(high)/21.4±0.5% (low)/26.2±0.7% (high) for D010, and 21.6±0.2% (low)/25.3±0.4% (high)/22.5±0.4% (low) for D101. According to the fact that the blocking level for CCCC was higher than TTTT (
This example presents evidence of inter-codon markers for use in demarcating codon stops and starts, particularly between codon repeats
When re-examining the D010 and D101's signatures, two downward current flicks were identified with distinctively lower conductance at Stage 1/2 and Stage 2/3 transitions (marked by triangles). These ‘inter-codon markers’ recognized the end of one codon signal and the beginning of the next codon signal, therefore becoming another codon identifier in addition to the blocking level. The signatures for all the six DNAs containing consecutive identical codons exhibited two inter-codon markers (
This example presents design of barcodes using the 3-bit system.
Because these stapled DNA were easily and unmistakably read out by the nanopore, they could be linked as a barcode module to eight different probes/receptors (via synthesis or conjugation) to simultaneously detect eight biomarkers. It uses the values of three binary bits to form eight barcodes, 000, 001, 010, 011, 100, 101, 110, and 111. Their resulting nanopore electric signatures included two modular components, a barcode signal followed by a biomarker binding signal. By reading the barcode signals, which biomarker was being detected was able to be discriminated, and by reading the biomarker signal, if the probe/receptor is bound or not bound by its target biomarker was able to be identified.
This example shows how NP-Unzip-Seq is used to read long DNAs that encodes the 8-bit binary ASCII code of the letter ‘M.’
To test the capability in reading long DNAs, a 200-nt template (D8) was designed that uses CCCC (0) and TTTT (1) to encode the eight-bit binary sequence 01001101 for the letter ‘M’ (
TTTTCCTGCTGCTCTGACCCCCCCCCTGCTGCTCTGACCC
CCCCCCTGCTGCTCTGACCTTTTTCCTGCTGCTCTGACCT
TTTTCCTGCTGCTCTGACCTCCCCCCTGCTGCTCTGACCT
TTTTCCTGCTGCTCTGACCCGATGCCTGCTGCTCTGACC
The nanopore signatures (
This example presents a method for encoding data and reading the encoded data stored in codons in nucleic acid chains via blockers that bind to address sections, encoders that allow for frameshift encoding, and nanopore sequencing.
TTTTCCCCC was used as the coding window to exemplify the Frameshift Encoding process in a universal template (
TTTTTCCCCCCCAGCATGTACTTCTCGACC
TTTTTCCCCCGGATTTCAAGTTCTCCCTCC
TTTTTCCCCCGCTCTTCAAGGTGCACATGG
CCCC
TTTT
CCCC
TTTT
CCCC
TTTT
Upon binding to address i, Staplei1-FS's encoder was null, thus did not block the coding window, allowing its last four bases CCCC to form Codoni1; Staplei2-FS had a GGGGG encoder, which blocked the last five bases CCCCC of the coding window, therefore upstream shifting the codon frame by 5 bases (−5 frameshift) to form another codon TTTT. As a result, the staple encoders enabled the same coding window to form two nanopore-discriminable codons, CCCC and TTTT, which allowed for freely ‘writing’ either ‘0’ or ‘1’ at each address, and thereby storing various binary datasets in the universal template.
To write 000, Staple11-FS/Staple21-FS/Staple31-FS (SEQ ID NOs: 12, 26, and 28) were selected and this staple panel was mixed with the template. To write 001, Staple31-FS (SEQ ID NO: 28) was replaced with Staple32-FS (SEQ ID NO: 29) and the panel of Staple11-FS/Staple21-FS/Staple32-FS (SEQ ID NOs: 12, 26, and 29) was mixed with the same template. Similarly, to write 010 and 011, the panels of Staple11-FS/Staple22-FS/Staple31-FS (SEQ ID NOs: 12, 27, and 28) and Staple11-FS/ Staple22-FS/Staple32-FS (SEQ ID NOs: 12, 27, and 29) were used, and again, mixed with the same template, respectively. The nanopore signatures for all the four tandem codon-templatestaple duplex complexes (
Thus, each stage in the signature was assigned to either codon CCCC for ‘0’ or TTTT for ‘1’. With these codon assignments, the four signatures accurately output the binary numbers 000, 001, 010, and 011, proving that the universal template was translated into different codon sequences for storing different datasets.
In summary, for the first time a frameshift strategy for encoding different data into a universal template was shown, establishing a model of DNA hard drive capable of rapid, synthesis-free data writing, retrieval, and rewriting.
Examples 12-15 are directed to frameshift encoding and decoding multinary data in native DNA sequences.
This example presents a method for frameshift encoding data and reading the encoded data stored in native sequences via blockers that bind to address sections, encoders that allow for frameshift encoding, and nanopore sequencing.
Frameshift Encoding/NP Unzip-Seq decoding of multinary data using sequences from native DNAs was validated. Four different segments from the M13mp18 DNA were randomly truncated, a popular template for DNA origami construction (
AAGCGGTGCcggaaagctggctgGGTTCGCAGaattgggaatcaactgt
GTTT
TTTT
TTTA
TTAC
TACA
ACAA
ACGT
CGTT
GTTA
TTAC
TACC
ACCC
GTGC
GGTG
CGGT
GCGG
AGCG
AAGC
GGTT
GTTC
TTCG
TCGC
CGCA
GCAG
The goal was to identify four distinguishable codons among six by NP Unzip-Seq to realize quaternary data encoding. A total of 4×6 different codons formed by their staples were detected in six tests (Table 7).
GTTT
TTTT
TTTA
TTAC
TACA
ACAA
ACGT
CGTT
GTTA
TTAC
TACC
ACCC
GTGC
GGTG
CGGT
GCGG
AGCG
AAGC
GGTT
GTTC
TTCG
TCGC
CGCA
GCAG
In each test, four staples were selected, each from a panel, to sequentially ‘write’ four codons in the template, equivalent to encoding a 4-bit quaternary data. For example, in Test 1, Staple11 (SEQ ID NO: 32), Staple21 (SEQ ID NO: 38), Staple31 (SEQ ID NO: 44) and Staple41 (SEQ ID NO: 50) were used to generate Codon11 (GTTT), Codon21 (ACGT), Codon31 (GTGC) and Codon41 (GGTT), and in Test 2, Staple12 (SEQ ID NO: 33), Staple22 (SEQ ID NO: 39), Staple32 (SEQ ID NO: 45) and Staple42 (SEQ ID NO: 51) were used to generate Codon12 (TTTT), Codon22 (GCTT), Codon32 (GGTG) and Codon42 (GTTC). For coding windows 1, 2, and 4, the six staples in the panel were tested in the order of +1 (5′→3′ 1-nt step) frameshift. For example, Staple11 used the ATGTT to generate Codon11 (GTTT). From Staple12 to Staple16 (SEQ ID NOs: 33-37), their encoders were successively shortened by one base from the 3′ end to TGTT, GTT, TT, T, and null. As a result, the codon frame shifted base by base in the 5′→3′ direction to generate five new codons Codon12 through Codon16 (TTTT, TTTA, TTAC, TACA, and ACAA). For coding window 3, the six staples in the panel were tested in the order of −1 (3′→5′ 1-nt step) frameshift.
The nanopore current signatures for all the six template⋅staples complexes revealed four stages, as identified from their characteristic signature patterns, including blocking levels and noise, and/or inter-codon markers (
In Test 1, the nanopore signature showed four stages that were separated by the blocking levels, 10.1±1.7%, 15.5±1.1%, 19.1±0.3%, and 16.7±0.7%, suggesting that the nanopore sequentially read the four different codons, Codon11 (GTTT), Codon21 (ACGT), Codon31 (GTGC), and Codon41 (GGTT) formed in the template. Similar findings were also obtained from Test 2 and Test 4-6. The Test 3's signature only revealed three stages based on the blocking levels, with the first and last stages at 16.5±0.8% and 22.6±0.1%. However, the middle stage was split by an inter-strand marker (marked by triangle) into two separate stages with identical blocking levels at 19.8±0.2% and 19.6±0.4%. Therefore, again, the universal inter-strand marker was proven to be a powerful codon identifier, which was jointly used with the blocking levels to accurately identify the four stages. This result demonstrated the writing of the four sequential codons, Codon13 (TTTA), Codon23 (GTTA), Codon33 (CGGT), and Codon43 (TTCG) in the template. Overall, the four stages in all the six signatures were assigned to the four sequential codons, confirming the capability of NP-Unzip-Seq in a sequential reading of various codons in the template by unzipping of each template⋅staple duplex. In conclusion, all the four 9-nt coding windows in the universal template were proven to be able to generate a panel of six different codons by Frameshift Encoding.
This example presents statistics to show how codons are discriminable.
The blocking levels (I/I0) of all the six codons that were written in a coding window were presented in the order of blocking level from low to high (
GTTT
TTTA
TTTT
ACAA
TTAC
TACA
ACGT
TACC
GTTA
TTAC
ACCC
CGTT
GCGG
GTGC
CGGT
AAGC
GGTG
AGCG
GGTT
GTTC
TTCG
CGCA
TCGC
GCAG
All pairs of codons were analyzed by Tukey's multiple comparison test to determine their discrimination capability and were ranked as highly discriminable (p<0.001), discriminable (p<0.05), and indiscriminate (NS, not significant) (
Note that all the comparisons not shown in
This example presents a method for identifying and selecting coding window sequences for frameshift encoding of multinary data.
Frameshift encoding was applied in writing, re-writing, and retrieving multinary data in a long native DNA that functions as a hard drive. To use a native DNA as the universal template, it was necessary to identify Coding Window sequences according to the criteria, including capacity, step and conductance difference between two codons formed in a Coding Window (
To encode an n-nary dataset (n is 2, 4, 8 and 16 for binary, quaternary, octal and hexadecimal data), each coding window must be able to form n different (nanopore-discriminable) codons by frameshift, i.e. the Capacity is n. The resulting coding window sequences were also dependent on the number of bases for each shift, i.e. Step. Step can be set to, but not limited to 1. The search with multiple Step values resulted in more qualified candidate sequences. If the frameshift step is set to 1-nt, a coding window needs to contain at least n+3 nucleotides. Most importantly, the nanopore current difference produced between any two codons should be larger than a cut-off value ΔIshift, a measure of discriminability at the instrument resolution. A large ΔIshift increased codon discrimination accuracy, but also caused fewer qualified candidates.
First, how to select qualified coding window sequences for encoding quaternary data was shown. The sequence of M13mp18 (7249 nucleotides) was obtained from New England Biolabs' website (SEQ ID NO: 56). The criteria for searching criteria for Coding Window sequences include: (i) Codon length=4 bases, (ii) Capacity=4 bases, (iii) Step=1-nt, and (iv) ΔIshift=6 pA. All the coding window sequences satisfying these criteria were 7-nt long and formed 4 codons by frameshift that have a >ΔIshift=6 pA between any two codons among four. ΔIshift were calculated based on the current levels of all 256 quadromers obtained from the previous work, which used the enzyme phi29 DNAP to control stepwise ssDNA translocation and measured at 180 mV in 0.3 M KCl in both side at pH 8.0) (Laszlo et al., 2014), a different condition from that in the current work. The screening based on the above condition finally identified 78 7-nt coding window sequences and highlighted in the M13 DNA sequence (
The program was further used to simulate the coding window screening in M13 DNA for octal encoding. The criteria for searching criteria for Coding Window sequences include: (i) Codon length=4 bases, (ii) Capacity=8, (iii) Step=1-nt, and (iv) ΔIshift=1 pA. Each coding window should contain 8+3=11 nucleotides. Under these conditions, 201 coding windows were identified and highlighted in the M13 DNA sequence (
This example presents an example of multinary data encoding and decoding by NP Unzip-Seq.
A data DNA was designed with eight different codons (D-Octal) to simulate octal encoding (
In summary, the above finding verified using NP Unzip-Seq to discriminate sequential multi-nary codons, making the selected codon panel or its sub-panels a potential multinary encoding/decoding system. For example, the total eight codons were used to represent eight states for octal encoding, while selected codons among all, such as GATG, CGAA, CGTC, and TTTT, formed a sub-panel to encode quaternary information, with broad applications such as image storage where each codon encodes a grey level or a color. As such, the nanopore functions as a multi-pixel image decoder (
This example demonstrates advantages of this invention as a synthesis-free DNA data writing/reading strategies.
This method utilized a set of unmodified staples to selectively recode (or translate) the template sequence into different codon sequences for writing various binary and multinary target datasets. It offered high precision and capacity in data writing by single-base manipulation. Shifting a single base was sufficient to generate completely different codons, allowing a short coding window to generate the desired number of codons for multinary data writing. For example, a short 5-, 7-, 11-, or 19-nt (n) coding window in the template sequence could generate 2, 4, 8, or 16 (n-3) quadromeric codons by 1-nt step frameshifting, to represent all bit values in binary, quaternary, octal, or hexadecimal data. Conceptually different from other data writing approaches, Frameshift Encoding does not need an enzymatic or chemical synthesis of long DNA. This hybridization-based writing strategy only needs a universal DNA template and a staple pool, which is then used for any data writing, thus is both rapid and cost-saving. First, since this data storage method encoded data by exposing the information hiding in the universal DNA, there was no need to introduce protein tags or other labels to produce the coding signal. Secondly, it did not need to read the data by next-generation sequencing, which still is a sequencing-by-synthesis technology. Third, the capacity of the data writing was highly enhanced by frameshifting multinary codons and the small size of the staples (20˜30 nt). Fourth, it did not involve any enzyme, thus eliminating the concern about the enzyme specificity and efficiency in both the writing and reading processes. Lastly, the simple mix-then-read mode without any chemical or enzymatic reaction further significantly decreased the time and cost. Overall, Frameshift Encoding represents a new model of DNA hard drive that can use native long genome DNAs as templates for synthesis-free, label-free, rapid, low-cost, rewritable, high-density, multinary information storage.
This was used to develop native DNAs as universal templates. The single-stranded M13 DNA remains a preferred model for early-stage exploration because data was directly written by hybridization with staples. In other systems, long single-stranded templates are extracted from double-stranded DNAs by approaches such as asymmetric PCR, denatured electrophoresis, and enzymatic digestion. The most important issue is identifying coding windows in the native DNA sequence for Frameshift Encoding. To write a multinary (n=2, 4, 8, and 16) dataset by frameshifting, each coding window needs to generate n different codons, and these codons need to be nanopore-discriminable. Therefore, it is a priority to screen all the 44=256 quadromeric codons in the nanopore, characterizing their signatures and evaluating their nanopore discriminability, requiring the facilitation of high throughput nanopore devices due to a large number of parallel tests (over 1,000). The outcome from such screening test would be a 256×256 discriminability chart, in which each codon-pair is ranked as discriminable or indiscriminate, useful for coding window design and search. The result would also vary with different detection methods and conditions, such as the salt concentration, pH and the voltage applied. Simulation work envisions a process for an automatic large-scale search of qualified coding windows in native DNA (
When introducing elements of the present disclosure or the preferred embodiments(s) thereof, the articles “a”, “an”, “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising”, “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
In view of the above, it will be seen that the several objects of the disclosure are achieved and other advantageous results attained.
As various changes could be made in the above methods, processes, and compositions without departing from the scope of the disclosure, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
This application claims priority from U.S. Provisional Patent Application No. 63/025,402, filed May 15, 2020, which is incorporated herein by reference in its entirety.
This invention was made with Government support under NIH BM114204 and NIH HG009338 awarded by the National Institute of Health. The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/32538 | 5/14/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63025402 | May 2020 | US |