The instant application contains a Sequence Listing which has been submitted electronically as a text file in ASCII format and is hereby incorporated by reference in its entirety. The Sequence Listing was created on Feb. 20, 2023, is named “22-0232-US_SequenceListing.xml” and is 4 kilobytes in size.
The present disclosure relates to DNA-based data storage systems, and methods of preparing, using and reading the same.
DNA is emerging as a robust data storage medium that offers ultrahigh storage densities greatly exceeding conventional magnetic and optical recorders. Information stored in DNA can be copied in a massively parallel manner and selectively retrieved via polymerase chain reaction (PCR). However, existing DNA storage systems suffer from high latency caused by the inherently sequential writing process. Despite recent progress, a typical cycle time of solid-phase DNA synthesis is on the order of minutes, which limits the practical applications of this molecular storage platform. Using current technologies, writing 100 bits of information (or, roughly two words) requires nearly two hours and costs more than US$1, assuming that each nucleotide stores its theoretical maximum of two bits. To overcome these challenges, new synthesis methods and information encoding approaches are required to accelerate the speed of writing large-volume data sets (Fan J, Han F, Liu H. Challenges of Big Data analysis. National Science Review. 2014 Jun. 1; 1(2):293-314).
Expanding the alphabet of a DNA storage media by including chemically modified DNA nucleotides can both increase the storage density and the writing speed because more than two bits are recorded during each synthesis cycle. However, designing chemically modified nucleotides as new letters for the DNA storage alphabet must be tightly coupled to the process of reading the encoded information via DNA sequencing, because current DNA sequencing methods, including single-molecule nanopore sequencing, have been developed and optimized to read natural nucleotides. Prior work reported an expanded nucleic acid alphabet of synthetic DNA and RNA nucleotides that can be replicated and transcribed using biological enzymes (Hoshika S, Leal N A, Kim M-J, Kim M-S, Karalkar N B, Kim H-J, et al. Hachimoji DNA and RNA: A genetic system with eight building blocks. Science. 2019 Feb. 22; 363(6429):884-7), but this alphabet was not designed for molecular storage applications and was not accurately read using a nucleic acid sequencing method. Aerolysin nanopores were used to detect synthetic polymers flanked by adenosines, where each monomer of the polymer carries one bit of information (Cao C, Krapp L F, Al Ouahabi A, Konig N F, Cirauqui N, Radenovic A, et al. Aerolysin nanopores decode digital information stored in tailored macromolecular analytes. Sci Adv. 2020 December; 6(50): eabc2661). Recently, it was reported that a base pair containing a single chemically modified nucleotide can be detected using biological nanopores (Ledbetter M P, Craig J M, Karadeema R J, Noakes M T, Kim H C, Abell S J, et al. Nanopore Sequencing of an Expanded Genetic Alphabet Reveals High-Fidelity Replication of a Predominantly Hydrophobic Unnatural Base Pair. J Am Chem Soc. 2020 Feb. 5; 142(5):2110-4). Despite recent advances, single-molecule detection and sequencing of an expanded molecular alphabet based on a library of chemically diverse modified nucleotides has not yet been demonstrated.
Accordingly, there remains a need to develop new DNA-based storage system along with efficient and high fidelity methods of decoding.
The present disclosure concerns DNA-based storage systems incorporating synthetic DNA nucleotides. This approach allows high-density information storage. Further, methods of accurately reading novel sequence comprised of mixtures of synthetic and natural DNA are demonstrated.
Accordingly, one aspect of the present disclosure is DNA data storage systems comprising a covalently linked sequence of nucleotides, wherein the sequence of nucleotides comprises a modification region, wherein the nucleotides comprise synthetic nucleotides.
In another aspect, the present disclosure provides for methods of reading a DNA sequence, the method comprising:
In another aspect, the present disclosure provides for methods of training a neural network comprising:
Other aspects of the disclosure will be apparent to those skilled in the art in view of the description that follows.
Here, an expanded molecular alphabet for DNA data storage comprising four natural and seven chemically modified nucleotides is disclosed that is readily detected and distinguished using nanopore sequencers (
Accordingly, one aspect of the present disclosure is DNA data storage systems comprising a covalently linked sequence of nucleotides, wherein the sequence of nucleotides comprises a modification region, wherein the nucleotides comprise synthetic nucleotides. For examples, in various embodiments as otherwise described herein, the synthetic nucleotides are each independently of the formula:
wherein R is H, or is a heterocycle. For example, in particular embodiments, wherein R is not H, R is capable of making at least one hydrogen bond to a natural nucleotide. Motifs that are suitable for hydrogen bonding to natural nucleotides are known in the art, and the skilled person would be able to ascertain, in light of the present disclosure, whether a particular group is capable of hydrogen bonding a natural nucleotide. For example, suitable hydrogen-bond forming groups include heterocycles comprising an electronegative element, such as N, O, or S.
As otherwise described herein, in various embodiments, synthetic nucleotides are those that are structurally distinct from natural nucleotides, such that they give a distinct signal when read according to methods as described herein.
In certain embodiments as otherwise described herein, R is H or a nitrogen-containing heterocycle, wherein the heterocycle is monocyclic or fused bicyclic (e.g., an optionally substituted heterocycle). For examples, in particular embodiments, R is H,
Unless otherwise indicated herein, the disclosed structures contemplate any suitable salts thereof. In various embodiments as otherwise described herein, the heterocycles as disclosed herein may be optionally substituted, e.g., substituted with 0-3 R groups. For example, in some embodiments, each R is halogen, —NO2, —CN, C1-C10 alkyl, C1-C10 haloalkyl, —NH2, —NH(C1-C10 alkyl), —N(C1-C10 alkyl)2, —OH, C1-C10 alkoxy, C1-C10 haloalkoxy, —SH, hydroxy(C1-C10 alkyl), alkoxy(C1-C10 alkyl), amino(C1-C10 alkyl), —CONH2, —CONH(C1-C10 alkyl), —CON(C1-C10 alkyl)2, —OC(O)NH2, —OC(O)NH(C1-C10 alkyl), —OC(O)N(C1-C10 alkyl)2, —CO2H, —CO2(C1-C10 alkyl), —CHO, —CO(C1-C10 alkyl), or —OC(O)(C1-C10 alkyl). As used herein, each alkyl group is optionally substituted with 1-5 RA group, wherein each RA is halogen, —NO2, —CN, NH2, —OH, —CO2H, or —CONH2.
Heterocycles as described herein may be heteroaromatic cycles or heterocycloalky moieties. The term “heteroaryl” refers to an aromatic ring system containing at least one aromatic heteroatom selected from nitrogen, oxygen and sulfur in an aromatic ring. Most commonly, the heteroaryl groups will have 1, 2, 3, or 4 heteroatoms. The heteroaryl may be fused to one or more non-aromatic rings, for example, cycloalkyl or heterocycloalkyl rings, wherein the cycloalkyl and heterocycloalkyl rings are described herein. In one embodiment of the present compounds the heteroaryl group is bonded to the remainder of the structure through an atom in a heteroaryl group aromatic ring. In another embodiment, the heteroaryl group is bonded to the remainder of the structure through a non-aromatic ring atom. Examples of heteroaryl groups include, for example, pyridyl, pyrimidinyl, quinolinyl, benzothienyl, indolyl, indolinyl, pyridazinyl, pyrazinyl, isoindolyl, isoquinolyl, quinazolinyl, quinoxalinyl, phthalazinyl, imidazolyl, isoxazolyl, pyrazolyl, oxazolyl, thiazolyl, indolizinyl, indazolyl, benzothiazolyl, benzimidazolyl, benzofuranyl, furanyl, thienyl, pyrrolyl, oxadiazolyl, thiadiazolyl, benzo[1,4]oxazinyl, triazolyl, tetrazolyl, isothiazolyl, naphthyridinyl, isochromanyl, chromanyl, isoindolinyl, isobenzothienyl, benzoxazolyl, pyridopyridinyl, purinyl, benzodioxolyl, triazinyl, pteridinyl, benzothiazolyl, imidazopyridinyl, imidazothiazolyl, benzisoxazinyl, benzoxazinyl, benzopyranyl, benzothiopyranyl, chromonyl, chromanonyl, pyridinyl-N-oxide, isoindolinonyl, benzodioxanyl, benzoxazolinonyl, pyrrolyl N-oxide, pyrimidinyl N-oxide, pyridazinyl N-oxide, pyrazinyl N-oxide, quinolinyl N-oxide, indolyl N-oxide, indolinyl N-oxide, isoquinolyl N-oxide, quinazolinyl N-oxide, quinoxalinyl N-oxide, phthalazinyl N-oxide, imidazolyl N-oxide, isoxazolyl N-oxide, oxazolyl N-oxide, thiazolyl N-oxide, indolizinyl N-oxide, indazolyl N-oxide, benzothiazolyl N-oxide, benzimidazolyl N-oxide, pyrrolyl N-oxide, oxadiazolyl N-oxide, thiadiazolyl N-oxide, triazolyl N-oxide, tetrazolyl N-oxide, benzothiopyranyl S-oxide, benzothiopyranyl S,S-dioxide. Preferred heteroaryl groups include pyridyl, pyrimidyl, quinolinyl, indolyl, pyrrolyl, furanyl, thienyl and imidazolyl, pyrazolyl, indazolyl, thiazolyl and benzothiazolyl. In certain embodiments, each heteroaryl is selected from pyridyl, pyrimidinyl, pyridazinyl, pyrazinyl, imidazolyl, isoxazolyl, pyrazolyl, oxazolyl, thiazolyl, furanyl, thienyl, pyrrolyl, oxadiazolyl, thiadiazolyl, triazolyl, tetrazolyl, isothiazolyl, pyridinyl-N-oxide, pyrrolyl N-oxide, pyrimidinyl N-oxide, pyridazinyl N-oxide, pyrazinyl N-oxide, imidazolyl N-oxide, isoxazolyl N-oxide, oxazolyl N-oxide, thiazolyl N-oxide, pyrrolyl N-oxide, oxadiazolyl N-oxide, thiadiazolyl N-oxide, triazolyl N-oxide, and tetrazolyl N-oxide. Preferred heteroaryl groups include pyridyl, pyrimidyl, quinolinyl, indolyl, pyrrolyl, furanyl, thienyl, imidazolyl, pyrazolyl, indazolyl, thiazolyl and benzothiazolyl. The heteroaryl groups herein are unsubstituted or, when specified as “optionally substituted”, can unless stated otherwise be substituted in one or more substitutable positions with various groups, as indicated.
The term “heterocycloalkyl” refers to a non-aromatic ring or ring system containing at least one heteroatom that is preferably selected from nitrogen, oxygen and sulfur, wherein said heteroatom is in a non-aromatic ring. The heterocycloalkyl may have 1, 2, 3 or 4 heteroatoms. The heterocycloalkyl may be saturated (i.e., a heterocycloalkyl) or partially unsaturated (i.e., a heterocycloalkenyl). Heterocycloalkyl includes monocyclic groups of three to eight annular atoms as well as bicyclic and polycyclic ring systems, including bridged and fused systems, wherein each ring includes three to eight annular atoms. The heterocycloalkyl ring is optionally fused to other heterocycloalkyl rings and/or non-aromatic hydrocarbon rings. In certain embodiments, the heterocycloalkyl groups have from 3 to 7 members in a single ring. In other embodiments, heterocycloalkyl groups have 5 or 6 members in a single ring. In some embodiments, the heterocycloalkyl groups have 3, 4, 5, 6 or 7 members in a single ring. Examples of heterocycloalkyl groups include, for example, azabicyclo[2.2.2]octyl (in each case also “quinuclidinyl” or a quinuclidine derivative), azabicyclo[3.2.1]octyl, 2,5-diazabicyclo[2.2.1]heptyl, morpholinyl, thiomorpholinyl, thiomorpholinyl S-oxide, thiomorpholinyl S,S-dioxide, 2-oxazolidonyl, piperazinyl, homopiperazinyl, piperazinonyl, pyrrolidinyl, azepanyl, azetidinyl, pyrrolinyl, tetrahydropyranyl, piperidinyl, tetrahydrofuranyl, tetrahydrothienyl, 3,4-dihydroisoquinolin-2(1H)-yl, isoindolindionyl, homopiperidinyl, homomorpholinyl, homothiomorpholinyl, homothiomorpholinyl S,S-dioxide, oxazolidinonyl, dihydropyrazolyl, dihydropyrrolyl, dihydropyrazinyl, dihydropyridinyl, dihydropyrimidinyl, dihydrofuryl, dihydropyranyl, imidazolidonyl, tetrahydrothienyl S-oxide, tetrahydrothienyl S,S-dioxide and homothiomorpholinyl S-oxide. Especially desirable heterocycloalkyl groups include morpholinyl, 3,4-dihydroisoquinolin-2(1H)-yl, tetrahydropyranyl, piperidinyl, aza-bicyclo[2.2.2]octyl, γ-butyrolactonyl (i.e., an oxo-substituted tetrahydrofuranyl), γ-butryolactamyl (i.e., an oxo-substituted pyrrolidine), pyrrolidinyl, piperazinyl, azepanyl, azetidinyl, thiomorpholinyl, thiomorpholinyl S,S-dioxide, 2-oxazolidonyl, imidazolidonyl, isoindolindionyl, piperazinonyl. The heterocycloalkyl groups herein are unsubstituted or, when specified as “optionally substituted”, can unless stated otherwise be substituted in one or more substitutable positions with various groups, as indicated.
Terms used herein may be preceded and/or followed by a single dash, “-”, or a double dash, “=”, to indicate the bond order of the bond between the named substituent and its parent moiety; a single dash indicates a single bond and a double dash indicates a double bond. In the absence of a single or double dash it is understood that a single bond is formed between the substituent and its parent moiety; further, substituents are intended to be read “left to right” (i.e., the attachment is via the last portion of the name) unless a dash indicates otherwise. For example, C1-C6alkoxycarbonyloxy and —OC(O)C1-C6alkyl indicate the same functionality; similarly arylalkyl and -alkylaryl indicate the same functionality.
The term “alkenyl” as used herein, means a straight or branched chain hydrocarbon containing from 2 to 10 carbons, unless otherwise specified, and containing at least one carbon-carbon double bond. Representative examples of alkenyl include, but are not limited to, ethenyl, 2-propenyl, 2-methyl-2-propenyl, 3-butenyl, 4-pentenyl, 5-hexenyl, 2-heptenyl, 2-methyl-1-heptenyl, 3-decenyl, and 3,7-dimethylocta-2,6-dienyl.
The term “alkoxy” as used herein, means an alkyl group, as defined herein, appended to the parent molecular moiety through an oxygen atom. Representative examples of alkoxy include, but are not limited to, methoxy, ethoxy, propoxy, 2-propoxy, butoxy, tert-butoxy, pentyloxy, and hexyloxy.
The term “alkyl” as used herein, means a straight or branched chain hydrocarbon containing from 1 to 10 carbon atoms unless otherwise specified. Representative examples of alkyl include, but are not limited to, methyl, ethyl, n-propyl, iso-propyl, n-butyl, sec-butyl, iso-butyl, tert-butyl, n-pentyl, isopentyl, neopentyl, n-hexyl, 3-methylhexyl, 2,2-dimethylpentyl, 2,3-dimethylpentyl, n-heptyl, n-octyl, n-nonyl, and n-decyl. When an “alkyl” group is a linking group between two other moieties, then it may also be a straight or branched chain; examples include, but are not limited to —CH2—, —CH2CH2—, —CH2CH2CHC(CH3)—, and —CH2CH(CH2CH3)CH2—.
The term “halo” or “halogen” as used herein, means —Cl, —Br, —I or —F. For example, in certain embodiments, halogen is —F.
In certain applications, the addition of bulky groups to the sequence of nucleotides may aid in their application, for example by preventing complete translocation through nanopores. Accordingly, in various embodiments as otherwise described herein, the sequence of nucleotides further comprises biotin, for example, a 5′-bound biotin. In particular embodiments, the sequence of nucleotides further comprises streptavidin bound to a 5′-bound biotin.
As described herein, calibration of the DNA sequence can be used to assist in data storage and recovery. Accordingly, in certain embodiments as otherwise described herein, the covalently linked sequence of nucleotides comprises a calibration region. For example, the calibration region may be a known sequence so that a known signal will be read in order to standardize or otherwise calibrate signal output. For example, in particular embodiments, the calibration region comprises a poly-A region.
As described herein, the DNA sequence may contain a plurality of synthetic nucleotides. In various embodiments, the synthetic nucleotides are of a variety of structures, and each structure may or may not be repeated, for example to encode information. Accordingly, in certain embodiments as otherwise described herein, the DNA data storage system comprises at least 2 and no more than 10 distinct synthetic nucleotides. For example, in some embodiments, the DNA data storage system comprises 2-8 distinct synthetic nucleotides, or 3-8 distinct synthetic nucleotides, or 4, 5, 6, or 7 distinct synthetic nucleotides. In various embodiments, the synthetic nucleotides may be provided in sequence with natural nucleotides, for example, wherein the modification region comprises both synthetic nucleotides and natural nucleotides.
In another aspect, the present disclosure provides for methods of reading a DNA sequence, the method comprising:
As described herein, a neural network is a type of machine learning algorithm that can be modeled after the structure of the human brain. In such scenarios, the neural network may include a plurality of interconnected nodes or neurons that process information and communicate with each other. The neural network may include three main types of layers: input, hidden, and output. The input layer is where the data is initially fed into the network, the output layer produces the final output or prediction, and the hidden layer(s) are where the majority of the computation takes place. Each neuron in the network takes in inputs from other neurons, applies a mathematical function to these inputs, and produces an output that is sent to other neurons in the network.
Each neuron is associated with a set of weights, which are parameters that determine the strength and direction of the connections between neurons. When an input signal is received by a neuron, it is multiplied by the weights associated with that neuron, and the resulting value is passed through an activation function to produce the output of the neuron.
During a training process, the neural network adjusts the weights and biases of its neurons in order to minimize the difference between its predictions and the actual output. The training process may include a process called backpropagation, which involves propagating errors backwards through the network and adjusting the weights and biases accordingly.
In some examples, the neural network may be trained with training data. In such scenarios, training data could include a set of labeled examples that may teach a neural network how to make predictions or classifications. In some example embodiments, the data could include inputs and corresponding outputs, where the inputs represent the features or attributes of the data, and the outputs represent the desired outcome or label for each input. Additionally or alternatively, the neural network may be trained using unsupervised learning, where the training data consists of only the inputs, and the network learns to identify patterns and features in the data without explicit output labels.
In some embodiments, the neural network may include one or more convolutional layers. In such scenarios, the convolutional layer is a type of layer in a neural network that is designed to analyze data that has a grid-like structure, such as an image. The convolutional layer applies a set of filters, or kernels, to different parts of the input data, allowing the network to identify patterns and features in the data. In some embodiments, the filters in a convolutional layer are small matrices of weights that slide over the input data, performing element-wise multiplication and addition to produce a single output value for each location the filter is applied to. This process is known as a convolution operation. The resulting output of convolutional operation is called a feature map, which may contain information about the presence or absence of certain patterns or features in the input data.
Convolutional layers may be followed by pooling layers, which downsample the feature maps by taking the maximum or average value of a small region of the feature map, allowing the network to focus on the most important features while reducing the dimensionality of the data.
In some examples, the neural network may include one or more fully connected layers, also known as dense layers. A fully connected layer is a type of layer in a neural network where every neuron in the layer is connected to every neuron in the previous layer. In other words, the neurons in a fully connected layer receive input from all of the neurons in the previous layer.
The output of each neuron in a fully connected layer is calculated by taking a weighted sum of the inputs from the previous layer, and passing this sum through an activation function. The weights and biases associated with each neuron are learned during the training process, allowing the network to learn complex nonlinear relationships between the input and output.
In another aspect, the present disclosure provides for methods of training a neural network comprising:
In some embodiments, the neural network comprises a 1-dimensional residual neural network. In such scenarios, the 1-dimensional residual neural network could include:
The Examples that follow are illustrative of specific embodiments of the disclosure, and various uses thereof. They are set forth for explanatory purposes only, and should not be construed as limiting the scope of the disclosure in any way.
Results and Discussion
To determine whether natural and chemically modified DNA nucleotides can be distinguished using the biological nanopore MspA, a series of single-stranded DNA (ssDNA) molecules with the general sequence 5′-biotin-(dT)12-XXXX-(dT)24-3′, where X={A, T, C, G, B1-B7} was designed (
Following molecular design and synthesis of ssDNA oligos, MspA nanopore experiments were performed where ssDNA oligos containing streptavidin at the 5′ terminus were electrophoretically attracted inside MspA nanopores. The bulky streptavidin protein prevents the oligos from fully translocating through the pore without appreciably affecting the measured ionic currents. Consequently, ssDNA molecules are effectively immobilized within MspA nanopores, exposing the four nucleotides at positions 13-16 from the tethering point to the constriction of the MspA pore (
MspA nanopores were used to determine residual currents for homotetramernc sequences of all natural and chemically modified monomers (FIG. 2B3). Our results show that MspA accurately discriminates all four natural (A, G, C, T) and nearly all chemically modified nucleotides (1B1-1B7) at an applied bias of 150 mV. The abasic nucleotide B7 shows the largest residual current, which likely arises due to its small molecular size and reduced ability to interact with the reading head of MspA. The residual current levels are sensitive to the chemical identity of the nucleotides but do not directly correlate with their molecular size (
MspA was further used to detect and identify heterotetrameric sequences with compositions 2X+2Y, where X, Y={B2, B3, B4, B5} (
The ability of MspA pores to resolve different tetramers containing both natural and chemically modified nucleotides is also described (
In theory, sequence context allows for high-resolution readout of arbitrary combinations and arrangements of natural and modified nucleotides (A, C, G, T, B1-B7). Although specific sets of tetramers might be confused during MspA reading, the method of shift reconciliation allows for such sequences to be fully resolved using the information provided by different shifts of the tetramers within the constriction of the nanopore (
Moving beyond tetramer detection via MspA, the present disclosure demonstrates that commercially available nanopore-based sequencing technology (ONT GridION) can be used to classify/sequence oligos containing the proposed molecular alphabet. For GridION experiments, the same ssDNA oligos used in MspA experiments were extended at the 3′ terminus with a polyA tail of random length >100 nts, which is used to increase the length of the oligos and guide them inside the pore (
Analysis of raw current signals is challenging because nanopore current signals exhibit extreme variations known as level drifts (
Results from neural network-guided identification tasks pertaining to five independent experimental runs are shown in
Stable bonding of chemically modified nucleotides within a DNA double helix is important for DNA-based storage because it enables durable preservation of recorded information, as well as random access to the stored data by means of PCR reactions. To better understand the interactions between chemically modified and natural nucleotides, the stability of modified DNA duplexes was investigated by carrying out all-atom molecular dynamics (MD) simulations of the Dickerson dodecamers containing a pair of chemically modified nucleotides (Drew H R, Wing R M, Takano T, Broka C, Tanaka S, Itakura K, et al. Structure of a B-DNA dodecamer: conformation and dynamics. Proceedings of the National Academy of Sciences. 1981 Apr. 1; 78(4):2179-83). Out of many possible variants, the stability of B1-T, B2-G, B3-A, and B5-A base pairs was investigated, as suggested by Integrated DNA Technologies (IDT), as well as the pairing of B4 and B6 with all four types of natural nucleotides. Each modified dodecamer was solvated in electrolyte solution and simulated for approximately 350 ns. Five modified-natural base pairs, (B2-G, B3-A, B5-A, B6-A, and B6-C) were found to form stable hydrogen bond patterns within the duplex forming either two or three hydrogen bonds per base pairs (
Thus, the enclosed results demonstrate an expanded alphabet for DNA data storage compatible with nanopore sequencing technology. A unique feature of this approach is coupled, iterative selection and testing that involves determining suitability for forming stable duplex structures and nanopore sequencing. Overall, the described system enables the recording of digital data with increased storage density and more bits per synthesis cycle. In particular, the disclosed storage system, when utilizing with 11 unique nucleotides, enables a maximum recording density of log211 bits in each cycle, compared to log24=2 bits for natural DNA. This strategy also theoretically increased the rate (speed) of the recorder by (log211/log24)=1.73 fold. Our extensive nanopore experiments provide strong evidence that many more chemically modified nucleotides can be used for molecular storage because many ionic current levels remain available, i.e., the ionic current spectrum is sparsely populated. In addition, our system allows for high-fidelity readouts and PCR-based random-access features for encodings restricted to duplex formation competent monomers. Although not all pairings of chemical modifications may be suitable for amplification using natural enzymes, and some duplex formations may be unstable, the proposed system provides the first example of a coupled coding alphabet and channel selection and optimization paradigm. In conclusion, this work demonstrates fundamentally new directions in molecular storage that hold the potential to advance the field of DNA-based data storage.
Materials and Methods
Oligo design and synthesis. All oligos tested are of fixed length 40 nt and synthesized by Integrated DNA Technologies (IDT). For MspA experiments, the content of the oligos was chosen to include two polyT sequences at locations 1-12 and 17-40, and a chemically modified tetramer at positions 13-16. All oligos were biotinylated at the 5′ end.
PCR Amplification. DNA amplification was performed via PCR using Q5 DNA polymerase, 5×Q5 buffer and pUC19 plasmid as template (New England Biolabs) in 50 μl. The 1.4 kb sequence is:
All primers were purchased from Integrated DNA Technologies (IDT). Both B1 and B2 were purchased from TriLink Biotechnologies in the form of triphosphates (https://www.trilinkbiotech.com/2-amino-2-deoxyadenosine-5-triphosphate-n-2003.html and https://www.trilinkbiotech.com/5-hydroxymethyl-2-deoxycytidine-5-triphosphate.html). All natural and chemically modified nucleotides were added in equimolar ratios in all PCR reactions.
MD Simulations. The molecular mechanics models of modified nucleotides B1, B3, B4, B5 and B6, including their topology and force field parameter files, were generated using the CHARMM General Force Field (CGenFF) (Vanommeslaeghe, et al. CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J Comput Chem. 2009). The charge of the atom connecting to the sugar was adjusted so that the total charge of the base is zero, which is the case for all the natural nucleotides in CHARMM36. The parameters for B2 were adopted from a previous study (Frauer, et al. Recognition of 5-Hydroxymethylcytosine by the Uhrfl SRA Domain. Xu S, editor. PLoS ONE. 2011 Jun. 22; 6 (6): e21306). Eight systems each containing a modified Dickerson dodecamers (CGCGAATTCGCG) (SEQ ID NO:02) (Drew H R, et al. Structure of a B-DNA dodecamer: conformation and dynamics. Proceedings of the National Academy of Sciences. 1981 Apr. 1; 78(4):2179-83.) were created starting from a B-DNA conformation to contain two different pairs of modified and natural bases while all other bases remained as in the original sequence. Each DNA duplex was immersed in a 75 Å×75 Å×75 Avolume of 1M KCl solution. After 2000 steps of energy minimization, the systems were equilibrated with the DNA backbone phosphate atoms restrained (ks=1 kcal/mol/Å2) for the first 10 ns. Each system contains approximately 39,000 atoms. Additional restrains were applied to enforce the expected hydrogen bonds between the modified and natural nucleotides for the first 20 ns. The systems were simulated for 350 ns in the absence of any restrains in the constant number of particles, pressure (1 atm) and temperature (295 K) ensemble using NAMD2 (Phillips J C, Hardy D J, Maia J D C, Stone J E, Ribeiro J V, Bernardi R C, et al. Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys. 2020 Jul. 28; 153(4):044130). If prominent structural disruptions had developed in both base pairs surrounding the modified nucleotide base pair, the simulation was terminated. Specifically, the simulation of the systems containing the B4 nucleotide lasted only 250 ns. Simulations of all the systems were performed using periodic boundary conditions. The simulations employed the particle mesh Ewald (PME) algorithm (Darden T, York D, Pedersen L. Particle mesh Ewald: An N·log(N) method for Ewald sums in large systems. The Journal of Chemical Physics, 1993, 98(12):10089-92) to calculate long-range electrostatic interaction over a 1 Å-spaced grid. RATTLE (Andersen H C. Rattle: A “velocity” version of the shake algorithm for molecular dynamics calculations. Journal of Computational Physics. 1983 October; 52(1):24-34) and SETTLE (Miyamoto S, Kollman P A. Settle: An analytical version of the SHAKE and RATTLE algorithm for rigid water models. J Comput Chem. 1992 October; 13(8):952-62) algorithms were adopted to constrain all covalent bonds involving hydrogen atoms, allowing 2-fs time step integration used in the simulations. van der Waals interactions were calculated using a smooth 10-12 Å cutoff. The NPT ensembles used the Nose-Hoover Langevin piston pressure control (Martyna G J, Tobias D J, Klein M L. Constant pressure molecular dynamics algorithms. The Journal of Chemical Physics. 1994 September; 101(5):4177-89), which maintained a constant pressure by adjusting system's dimension. Simultaneously, Langevin thermostat was adopted for temperature control, with damping coefficient of 0.5 ps applied to all heavy atoms in the systems. CHARMM36 (Hart K, et al., Optimization of the CHARMM Additive Force Field for DNA: Improved Treatment of the BI/BII Conformational Equilibrium. J Chem Theory Comput. 2012 Jan. 10; 8(1):348-62), output of CGenFF, TIP3P water model as long as custom NBFIX corrections to nonbonded interactions were employed as the parameter set of the simulation. The hydrogen bonds occupancy, the distances between hydrogen bond donors and acceptors as well as the short/long axis lengths of bases are calculated from the well equilibrated last 100 ns fragment of the trajectory using VMD (Humphrey W, Dalke A, Schulten K. VMD: Visual molecular dynamics. Journal of Molecular Graphics. 1996 February; 14(1):33-8). The hydrogen bonds were defined to have the donor-accepter interaction distance of less than 3A and the cutoff angle of 20°. Given the largely planar shape of the bases, their short/long were determined by first computing the three principal axes of the bases and then choosing the largest two values. Simulations/analysis of the B4 pairing with natural bases in longer DNA strands were conducted using the same methodology, but with only one modified base contained in the dodecamer. Besides, extra bonds were applied to the donor (N1) and accepter (N3) atoms on the terminal pairs to prevent the ends from fraying in these simulations to adapt the situation of long DNA strands. These simulations ran 550 ns except if unstable configurations were observed.
MspA nanopores and purification of M2-NNN MspA. All chemicals were purchased from Fisher Scientific unless stated otherwise. Streptavidin was ordered from EMD Millipore (Burlington, MA) (Catalog #189730). Phenylmethylsulfonyl fluoride (PMSF) was ordered from GoldBio (St. Louis, MO) (Catalog #P-470). DNA of M2-NNN MspA construct was a gift from Dr. Giovanni Maglia (University of Groningen, Netherlands). The pT7-M2-NNN-MspA was transformed into BL21 (DE3) pLyss cells and grown in LB medium at 37° C. until the OD600 reached 0.5-0.6. The cells were then induced with 0.5 mM isopropyl P-D-1-thiogalactopyranoside (IPTG) and continued to grow at 16° C. for 16 hours. Cells were harvested and centrifuged at 19,000×g for 30 min at 4° C. Cells were resuspended in the lysis buffer containing 100 mM Na2HPO4/NaH2PO4, 1 mM ethylenediaminetetraacetic acid (EDTA), 150 mM NaCl, 1 mM phenylmethylsulfonyl fluoride (PMSF) pH 6.5, before heating at 60° C. for 10 minutes. The cells were sonicated by using VWR Scientific Branson 450 sonicator (duty cycle of 20% and output control of 2) for 8 minutes. The lysate was centrifuged at 19,000×g for 30 min and the supernatant was discarded. The pellet was resuspended in the solubilization buffer containing 100 mM Na2HPO4/NaH2PO4, 1 mM EDTA, 150 mM NaCl, 0.5% (v/v) Genapol X −80, pH 6.5. After completely resuspending the pellet, it was centrifuged at 19,000×g for 30 min. The supernatant, containing solubilized membrane extract, was collected for Ni-NTA purification. MspA was further purified using a 5 mL HisPur™ Ni-NTA resin (GE Healthcare) and eluted in a buffer of 0.5 M NaCl, 20 mM HEPES, 0.5% (v/v) Genapol X −80, pH 8.0 by applying an imidazole gradient. MspA oligomers were further purified by SDS-PAGE gel extraction. The purified MspA protein was run in 7.5% SDS-PAGE gel. The band of MspA oligomer was cut from the gel and extracted in the extraction buffer containing 50 mM Tris-HCl, 150 mM NaCl, 0.5% Genapol X −80, pH 7.5. The protein was extracted at room temperature (23° C.) for 6 hours before centrifuged at 9,000×g for 30 min to collect the protein solution. The purified MspA oligomer was fast frozen and stored at −80° C. for further use.
Single-channel recording using MspA. The experiments were performed in a device containing two chambers separated by a 25 μm thick polytetrafluoroethylene film (Goodfellow) with an aperture of approximately 100 μm diameter located at the center. A hexadecane/pentane (10% v/v) solution was first added to cover both sides of the aperture. After the pentane evaporated, each chamber was then filled with buffer containing 1 M KCl 10 mM HEPES pH 8.0. 1, 2-diphytanoyl-sn-glycero-3-phosphocholine (DPhPC) dissolved in pentane (10 mg/mL) was dropped on the surface of the buffer in both chambers. After the pentane evaporated, the lipid bilayer was formed by pipetting the solution in both chambers below the aperture several times. An Ag/AgCl electrode was immersed in each chamber with the cis side grounded. M2-NNN MspA proteins (around 1 nM, final concentration) were also added to the cis chamber. To promote MspA insertion, a≥+200 mV voltage was applied. After a single MspA was inserted into the planar lipid bilayer, the applied voltage was decreased to 150 mV (or 180 mV) for recording. The current was amplified with an Axopatch 200B integrating patch-clamp amplifier (Axon Instruments, Foster City, CA). Signals were filtered with a Bessel filter at 2 kHz and then acquired by a computer (sampling at 100 s) after digitization with a Digidata 1440A/D board (Axon Instruments).
DNA immobilized in MspA. After recording a single MspA pore for 5-10 minutes at positive voltages to check its stability, 5′-biotinylated DNA sample (final concentration of 0.25 μM) was added to the cis chamber. Streptavidin (0.1 μM), added to solutions in the cis chamber, can bind to biotin to prevent the full translocation of the DNA strand through the nanopore. To collect the signal generated from each DNA samples, a sweep protocol was applied. The amplifier applied either 150 mV or 180 mV for 10 s then applied −150 mV to force the DNA out of the pore back into the cis compartment. The voltage was then returned to the original value and the sweep protocol repeated for at least 40 times at each voltage.
ONT sequencing protocol. NEB terminal transferase was used for A-tailing the 3′ end of the 40-mer control oligos. The reaction mixture was made by 5 ul 10×TdT buffer, 5 ul 2.5 mM CoCl2, 5 pmole DNA, 0.5 ul 10 mM dATP, 0.5 ul terminal transferase, and 38 ul H2O. The reaction was Incubated at 37° C. for 30 mins, followed by inactivation at 70° C. for 10 mins. The DNA was then purified using the Zymo DNA clean up kit (ssDNA Buffer:sample=7:1) and eluted in 10 ul warm H2O. The Oxford Nanopore SQK-RNA002 kit was used for library preparation.
The RT adaptor was ligated for 10 min at room temperature, then mixed with reverse transcription master mix. 2 uL of Superscript IV were added and the mixture was Incubated at 50 C for 50 mins, followed by 70° C. for 0 mins and cooled down to 4° C. Bead clean-up was performed using 40 ul samples with 72 ul RNAClean XIP beads, rotated for 5 mins, washed by 70% EtOH and eluted by 20 ul H2O. The RMVX adaptor was ligated in 10 mins at room temperature, then 40 ul RNA Clean XIP beads clean-up was used, and the product was washed with 150 ul of the wash buffer twice. It was then eluted in 21 ul of the elution buffer. The reaction was loaded onto an R9.4.1 flowcell and sequenced on a GridION X5 (Oxford Nanopore) for 24 hs.
Two-Step Event Identification Scheme for ONT Readouts with NN Processing
The main challenges faced when analyzing nanopore current signals are illustrated in
Summary of Results from Model-Based Classification Procedure
ResNet models were trained on 12 permutation classes in which the composition is fixed, but the orderings of the modified nucleotides are different. What is referred to as a “superclass” combines different choices and orderings of the modified nucleotides (the superclass contains 66 out of 77 tetramers, as for 11 tetramers an insufficient number of training samples was available). The number of valid sequenced reads (i.e., reads containing modified nucleotides) for each class is shown in Table 5. To perform unbiased training, the sizes of the classes was balanced by setting a lower bound for subsampling of reads in different classes. An upper bound was also set on the number of training samples used for each class, in order to prohibit one/several classes to dominant the training set. For finer classification involving permutations of monomers within a class, the lower bound was set to 1000, and the upper bound to 5000. For the classification task on all 66 classes, the lower bound was set to 2000, and the upper bound to 3500. These choices are necessitated by two conflicting requirements: To balance out the class sizes and retain a training set as large as possible. The classification results are shown in
While particular aspects and embodiments are disclosed herein, other aspects and embodiments will be apparent to those skilled in the art in view of the foregoing teaching. The various aspects and embodiments disclosed herein are for illustration purposes only and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
This application claims the benefit of priority of U.S. Provisional Application No. 63/312,334, filed Feb. 21, 2022, and incorporated herein by reference in its entirety.
This invention was made with government support under 1618366, 1807526, and 200815 awarded by NSF. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63312334 | Feb 2022 | US |