The invention relates to technologies for storing data in sequences of nucleic acids.
DNA has been considered an attractive media for data storage for several decades. Advantages include its minimal physical footprint and energy consumption required for maintenance, the availability of molecular biology techniques to copy existing DNA sequences, and the ability to perform selective isolations from a complex and disordered mixture. There has been a massive increase in the amount of digitally stored data over the past 20 years, as well as substantial advances to DNA sequencing technologies. These developments have spurred a renewed interest in DNA data storage schemes, particularly for large-scale archival purposes. It is now widely held that the speed and cost of the ‘writing’ step is the primary bottleneck limiting further adoption of this technology. These slow writing speeds have arisen largely due to the emphasis on developing DNA synthesis for life science applications, which has led to a mismatch between current synthesis chemistries, coding schemes, and the engineering needed to realize data storage at the exabyte scale.
Storage of even 10 GB (1010 bytes) of data in chemically unmodified DNA requires 40 billion data encoding nucleotides (nt) (see
The longest DNA molecules that are chemically synthesized using phosphoramidite chemistry are generally between 100 and 200 nt in length. Accordingly, between 200 and 400 million distinct oligonucleotide sequences are needed to store that exemplary 10 GB data without accounting for the use of index elements or error correction techniques. This poses a problem of scale for current generations of oligonucleotide synthesizers, all of which generate sequences in a stepwise cycle of base-by-base addition to a solid support.
Some techniques make multiple sequences in parallel by performing spatially selective reactions at different sites on the surface. This is typically accomplished by mechanical delivery of individual monomers to the reaction site or through controlled deprotection reactions using light exposure or electrochemical techniques. There are few currently available piezo or inkjet delivery solutions capable of performing 40 billion deliveries with the accuracy needed for data recording within a practical time period or device footprint. Instrumentation which directs the synthesis by controlling deprotection steps need to flood the entirety of the reaction cell with reagents that react only at a subset of the locations on the solid support. This means that the number of instrument cycles is proportional to both the length of the data sequences as well as the number of nucleotide analogs used for the encoding. The effect is that compressing the data through expanded encoding with unnatural nucleotide analogs does not enable significantly faster writing speeds.
The invention provides methods for enhancing the write speed of DNA-based data storage systems with a layered coding approach. The strategy allows the generation of materials, such as planar wafer surfaces, that are patterned with nucleic acid molecules independently of performing data writing operations on the material. The patterned sequence specifies a physical address within the material and acts as an index for the data written at that site. Each location can be identified by a unique index sequence and may contain one or more indexed initiators upon which data-encoding nucleic acid sequences can be synthesized. The number of indexed initiators per discrete location may be dictated by the error rates of the synthesis (writing) and sequencing (reading) methods used and the associated storage density as discussed below.
Due to the redundancy of multiple indexed data strands at any given location and the encoding schemes described herein, it is not necessary to alter the entirety of the molecules at that location. Accordingly, faster synthesis/writing techniques can be used without the need for 100% incorporation at a given site. When the nucleic acid molecules are cleaved from the material and sequenced, the data writing operations can be associated with the location that the writing operation, such as an enzymatic extension, occurred. The library of nucleic acid molecules acts as a disordered yet compact record of a physical storage media, such as an optical disk drive or holography plate. Aspects of the invention may include techniques for reorganizing and storing the barcoded DNA, concatenation or size selection of DNA oligonucleotides to reduce the number of sequencing reads required to interpret the data, surface preparation techniques, as well as techniques to copy an existing indexed recording medium in a molecular analogy to an imprinting process. While preferred embodiments utilize enzymatic techniques for DNA synthesis, many concepts, coding strategies, and writing approaches can also be applied to oligonucleotides synthesized with phosphoramidite chemistry. The combined effect of these coding improvements and enzymatic processes is such that the write speed is increased by several orders of magnitude at the consequence of a less than 10-fold reduction to storage density.
The core of the invention is to enhance the write speed by minimizing dependence on the high-fidelity synthesis steps currently required for all existing ‘storage by synthesis’ techniques. The approach is to use the layered coding architecture shown in
The data recording steps are performed by using a series of spatially selective reactions upon a suitable substrate or recording media. These may be of any type that can be induced by a write head capable of accessing the location of each unique index site within the media, preferably by physically translating over a larger surface. Suitable write-heads include high-frequency deposition or printing systems, electrode arrays, optical components such as lasers, digital micromirror devices (DMDs), liquid crystal masking systems, or any suitable combination thereof as employed for photolithography, interference lithography or holography. Each spatially selective operation results in the localized addition of one or more nucleotides to a subset of the oligonucleotides immobilized on a two or three-dimensional solid support. Preferred embodiments conduct this addition enzymatically with a template independent polymerase, such as terminal deoxynucleotidyl transferase (TdT), polymerase theta (pol Θ), or a closely related variant to append the nucleotides to the 3′-termini of the surface features. Exemplary template-independent polymerases and template-independent synthesis techniques that may be used with systems and methods of the invention are described, for example, in U.S. Pat. Pub. Nos. 2020/0190491, 2018/0274001, 2018/0305746, and 2019/02755492, the content of each of which is incorporated herein by reference.
The writing approaches compatible with this coding approach can be divided into four classes depending upon the combination of chemistry, enzymology, and engineering employed. These are summarized in
In class I embodiments, the write cycle may be conducted by flooding the oligonucleotide-patterned recording media with a mixture of enzyme, buffer, and protected dNTP. The protecting group may be of any type which prevents enzymatic extension and may be removed by any spatially selective techniques known to those skilled in the art such as photolysis, a pH change, or change in oxidative or reductive potential. Nucleotide analogs and protecting groups compatible with the systems and methods described herein are described, for example, in U.S. Pat. Pub. Nos. 2020/0190491, 2018/0274001, 2018/0305746, and 2019/02755492. Spatially selective reactions are then performed to remove the protecting group from the dNTP, so that the decaged molecules become substrates for enzymatic addition to the local oligonucleotide sequences. The occurrence or absence of a spatially selective reaction (a ‘null reaction’ hereafter) encodes for a bit of data depending on the precise coding approach that is used. The enzymatic extension reaction may be stopped by quenching, heat killing, or may instead proceed until the vast majority of substrate molecules have been consumed. The media is then flushed to remove traces of reactive material, completing a single write cycle. These steps are then repeated by utilizing a different dNTP for each subsequent write cycle without recycling the type of dNTP. The number of consecutive write cycles may be limited by the number of dNTPs which can be uniquely identified by a given sequencing technology. For a set of N distinguishable dNTPs, up to N−1 consecutive write cycles may be employed. After N−1 cycles, the remaining dNTP is used in a global addition reaction, where it is added to every molecule in the recording media to denote a boundary of a data layer, thus refreshing the ability to use dNTPs employed in previous write cycles. The order in which the dNTPs are added may be known and fixed (e.g. A, C, G, with T defining the layer boundary) for each layer to aid in decoding/reading. In some instances, it may instead be useful to designate two dNTPs for use as layer boundaries, particularly in embodiments where the length of the layer boundary may not be limited to precisely one nucleotide. This enables reads containing adjacent layer boundary nucleotides to be interpreted correctly.
The class II write cycle is conducted upon a recording media where the oligonucleotides contain a protecting group that prevents or significantly limits extension. In preferred embodiments, this protecting group has been installed by performing a global addition reaction to add a protected dNTP to every site in the recording media. The protecting group may be of any type which modulates enzymatic extension on the subsequent cycle and may be removed by any spatially selective techniques known to those skilled in the art such as photolysis, a pH change, or change in oxidative or reductive potential. Localized exposure to the removal conditions render the DNA at an intended site reactive to subsequent extensions. Removal conditions may be selected so that they remove a precise fraction of the protecting groups at a given site. The recording media is then flushed with a mixture of enzyme, buffer, and triphosphate to append a nucleotide or short series of nucleotides to these reactive sites to complete a write cycle. This process may then be repeated by using a different dNTP for each subsequent write cycle without recycling the type of dNTP. When the entirety of the protecting groups at the reaction sites are depleted or there are no remaining unique dNTPs, the reactivity of the surface can be refreshed by another global addition of the protected triphosphate to complete the layer and allow the process to be repeated.
The first write cycle is conducted by performing a partial photolysis reaction only at the second index region, thereby writing a 0 bit at the first index and 1 at the second bit [01]. The reaction chamber is then flooded with A nucleotides and enzyme so that the strands lacking a protecting group undergo extension. The reagents are then flushed away to complete the write cycle. In the second write cycle, both index regions undergo a partial deprotection reaction, writing [11], before the reaction chamber is flooded with C nucleotides and enzyme. This adds nucleotides to the newly deprotected strands at both sites and may also produce extensions on strands which reacted in the first write cycle. The chamber is then flushed prior to the third write cycle. The data written in the third cycle is [00], so no deprotection reactions are conducted. When the chamber is flooded with G nucleotides and enzyme, only the strands which had previously undergone an addition will react. The reagents are again flushed from the chamber, completing the third write cycle. A global deprotection reaction is then conducted to remove all residual protecting groups from the strands before the media is flushed with a T nucleotide that contains a photolabile protecting group and an enzyme to catalyze addition. Enzymatic addition of this T to all the sequences installs the layer boundary and refreshes the photosensitivity of the surface for writing the next layer.
As with class I, the number of nucleotides or nucleotide analogs that can be used within each data layer is limited in part by the ability of the sequencing technology to distinguish the various analogs from one another.
A write cycle for both class III and class IV embodiments require only the selective placement of the necessary reagents at a reaction site in the recording media. Any mixture of reagents capable of inducing a nucleotide addition to immobilized strands may be used. Unlike approaches that repurpose gene synthesis instrumentation for data writing, feature sizes may be larger than the delivery zone of the instrumentation (
In the first write cycle of the exemplary class III embodiment shown in
In the exemplary class IV embodiment shown in
The coding architectures described herein provides several advantages over nearly all existing storage-by-synthesis technologies. First, the approach enables a greater quantity of data to be written with each instrument cycle. In a light-directed array synthesizer utilizing four nucleotides and operating in a synchronous mode (see A. Kahng, I. Mandoui, S. Reda, X. Xu, A. Zelikovsky, “Design Flow Enhancement for DNA Arrays,” Proceedings of the 21st International Conference on Computer Design (ICCD'03), 2003, incorporated herein by reference), each payload nucleotide type will encode the equivalent of 2 binary bits of data and a single addition will occur once at each site in 4 cycles, thus writing at a rate of ½ bit/cycle/site. By contrast, a layered coding approach utilizing one nucleotide type (i.e. T) as a layer boundary and 3 nucleotides (i.e. A, C, and G) within the data layers may write at a rate of ¾ bit/cycle/site. Part of this gain results from capability to use the absence of a reaction, or a ‘null reaction’ to encode information (a 0 bit for example). Second, the approach reduces write time within each synthesis cycle by eliminating the requirement that each addition reaction progress to completion, reducing the dwell time that a write-head such as a laser, micromirror grid, liquid-crystal masking system, or mechanical deposition apparatus, must spend at each location while traversing a recording media. Many protecting groups employed for DNA synthesis exhibit first-order photolysis kinetics and up to seven half-lives may be required to achieve near complete photolysis (Agbavwe, C., Kim, C., Hong, D. et al. Efficiency, error and yield in light-directed maskless synthesis of DNA microarrays. J Nanobiotechnol 9, 57 (2011), incorporated herein by reference). Embodiments which utilize partial (i.e. 20%) photolysis for each reaction may therefore easily realize order of magnitude improvements to write speed. Third, the sub-stoichiometric additions may either be imprecise, wherein the extent of reaction is not crucial to decoding the data and each write cycle encodes information in binary, or precise, where the exact extent of the reaction encodes information at bitrates above binary. Both scenarios are described in further detail below. Fourth, the reduced dependence on high-fidelity synthesis may enable drastic cost reductions in reagents, such as enzymes, triphosphates, or phosphoramidites, particularly at the unprecedented scale required for nucleic acid memory systems. When coupled with the low write-head dwell time, this may in turn enable data storage devices with physical footprints and write speeds closer to that of modern optical drives than existing DNA synthesizers. Sixth, separating the index and payload writing reduces dependence on in situ synthesis techniques, enabling new routes for generating, copying, or recycling indexed recording media, each of which are discussed further below.
In some instances, the DNA molecules may remain bound to the support for long term archival. In other embodiments, the DNA may be removed from the support by the aminolysis of an alkaline labile linker, sequence specific enzymatic digestion with uracil specific excision reagent (USER) or a restriction enzyme, or other techniques known to a skilled artisan. The method and technique for the DNA removal from the solid phase is dependent solely on the covalent linkage chemistry and the subsequent processing steps required for archival.
The DNA may be made more amenable to reading by first converting the single stranded data strands to double stranded DNA (dsDNA) and concatamerizing the short duplexes into larger constructs. The conversion to dsDNA can be accomplished with random priming strategies or by installing a common 3′-primer binding site to the data strands after their cleavage from the surface. Preferred embodiments utilized 5′-phosporylated primers so that the short dsDNA products can be used directly in a blunt-end concatamerization reaction. The average length of the concatamers can be tailored by reaction time or varying the ratio of data strands to duplexed adaptors like those used in commercial sequencing library preparation kits, which act as end groups to stop chain growth. The adapters may be either entirely double-stranded to facilitate amplification of the concatemers or partially single-stranded to enable sequence-specific immobilization on an archival surface. The effect is that many short molecules which would otherwise comprise a single sequencing read are assembled into larger constructs, reducing the total reads necessary for sequencing. Appending a barcode to each dataset allows them to be archived together in a smaller physical footprint than the original recording media. Attachment of sequence-specific immobilization handles allow data to be reorganized by new physical addresses upon archival wafer arrays, in turn facilitating selective access to subsets of data within a larger archive. Such archives can be comprised of surfaces patterned with oligonucleotide sequences to selectively capture the dataset barcodes at a defined site. The surface capture sequences may be designed to be orthogonal to one another by any technique known so that only the intended sequences are localized in the intended site. The datasets may then be accessed by any technique that permits selective access to surface features, such as photolysis of the material from the surface when the archival addresses contain a suitable linkage, sequence specific cleavage, toehold-mediated release, or approaches for physical separation that rely on precise mechanical handling.
The sequencing of a given dataset may be conducted by any number of current techniques, though long-read, single molecule sequencing approaches such as single-molecule real time sequencing or nanopore technologies are preferred. The only restriction on sequencing modality is that the read length be sufficient to encompass the data payload and index region. The data is recovered by first identifying sections of reads that correspond to index sequences then grouping the payload elements by their affiliated index. The payload sections for each index are then aligned by the nucleotides which denote the layer boundaries. Within each layer, the identity of a specific nucleotide is associated with a specific write cycle. These coordinates of the index, layer, and write cycle uniquely identify the placement of each bit within the full dataset. The bit values themselves are assigned by examination of the nucleotides within each layer to detect whether a spatially selective reaction was conducted. There are distinct ways of making this determination which depend on the writing strategy used. In both class I and class III writing strategies, the occurrence of a writing operation is indicated by the presence of a given nucleotide at any position within the layer. In class II and class IV writing strategies, a writing operation is indicated only by the presence of a nucleotide adjacent to the preceding layer boundary or index. Analog embodiments are also possible, where the frequency of a given nucleotide occurring at a position is assessed relative to the number of reads.
Some embodiments may include size selection strategies to increase the storage density, particularly when lower numbers of write cycles are utilized. These may include capillary electrophoresis, gel purifications, ultrafiltration techniques, selective binding to solid resins, or other chromatographic approaches used by those skilled in the art. The operating principle is that not all strands in a dataset are equally informative. After multiple addition reactions are conducted, there is a distribution of sizes which depends upon the characteristics of the individual addition.
Selection for sequences at various points in this size distribution can alter the apparent fractional addition relative to that of the individual addition as shown in
This lowers the depth penalty required for confident identification of the null reaction steps. The resultant gain in storage density may depend on the distribution of the individual extension reactions and the relative size of the data coding regions compared to the index and layer boundaries.
In some embodiments the order of the write cycles are selected to modulate the properties of the extension reaction. In some embodiments where the fractional addition approaches 100%, it may also be effective to select instead for the smallest oligonucleotides in the distribution.
More complex information can be embedded into each write step of embodiments which utilize photochemically controlled nucleotide addition. Sequencing of a suitable depth can reveal the extent of protecting group removal at each step, so that the intensity of light received at a site can be reconstituted from the sequencing reads and known kinetics of protecting group removal. This allows complex illumination patterns, such holograms and other forms of optical interference patterns, to be encoded and recovered upon sequencing. In class I embodiments, the relative amount of sublayer nucleotide added to each layer encodes the illumination intensity at a given site. In class II embodiments where addition is controlled with a photocleavable group on the surface material, the extent of deprotection at each cycle of addition encodes information about the light intensity utilized.
It is extremely difficult to decipher the encoded data without knowledge of the indexing scheme used during the write steps. Some embodiments may also randomly alter the order of the nucleotide delivery between different layers to further obscure the encoded data. Other embodiments may use additional chemical security features to prevent unintended access or other manipulation of the material. In such embodiments, it is desirable to utilize modified nucleotide analogs which cannot be amplified or copied enzymatically. The material is instead retained as single stranded DNA in an archival system. In contrast to traditional archival systems which emphasize long-term stability, these encrypted data strands may utilize highly labile modifications so that the message will not survive repeated manipulations, attempts at copying, or extended storage.
A wide variety of surface preparation techniques are appropriate for generation of DNA indexed recording media. The primary requirement is that the material can be engineered to the precision and homogeneity required to achieve a given feature density. In preferred embodiments, feature size is of the same order as the laser spot size (350-700 nm) required to write 10 GB into a surface of comparable size to a compact disk. Accordingly surfaces need to be defect free at this scale to avoid optical defects or disruption to uniform fluid delivery. Suitable substrates include functionalized silicon wafers or silanized glass, polymer sheets, or combinations of surfaces and polymers achieved through spin coating or other deposition techniques known to those skilled in the art. In certain instances, phosphoramidite chemistry may be used to synthesize the index sequences directly from functional groups on the substrate.
In preferred usages, the index DNA is instead generated enzymatically from a universal initiator sequence common to all sites which is covalently linked to the substrate through its 5′-termini. The initiator sequence may contain designed elements to facilitate removal from the surface such as sites for restriction enzyme cleavage, internal deoxyuracil residues, or linkers that are cleaved only under certain conditions such as a specific pH, presence of oxidizing or reducing agents, or exposure to a specific wavelength of light. The selection of covalent linker coupling chemistry is limited only by the orthogonality with the downstream processing conditions so that material is not inadvertently removed from the surface during writing steps. Initiators may be immobilized either so that they are separated by regions of hydrophobicity or so that they uniformly coat the substrate without detectable gaps between features. Any technique which enables the spatially selective extension of the initiators can be used to generate the index strands. Though instrumentation such as mechanical spotting, inkjet-delivery mechanisms, or microelectrode arrays may be used, preferred embodiments utilize lithographic systems such as physical masks, liquid crystal masks, DMDs or any other form of spatial light modulator such as dip pen lithography, so that the extension reactions can be controlled at a suitable resolution. Any coding approach may be used for the indices provided that the sequencing modality can distinguish each index from one another. Some embodiments may utilize a precisely defined nucleotide sequence at each location, while others may utilize a homopolymer-encoded sequence, where the sequence of homopolymer tracts defines the index (i.e. string ‘AAAACCTGGAA’ codes as ‘ACTGA’). Suitable embodiments include those where a triphosphate containing a photocleavable chain terminator is enzymatically added to the 3′-OH of the initiator. Other embodiments utilize caged triphosphates that are prevented from incorporation until a photocleavable group is removed. Suitable molecules may include 3′-ortho-nitrobenzyl protected derivatives, 3′-NPPOC derivatives, nitrobenzyl protected species, or 3′-BODIPY protected species.
Other embodiments may generate indexed surfaces by the random immobilization of complex sequence libraries. Embodiments where index sites are separated from one another and are discontinuous are preferred so that libraries can be hybridized to the surface strands and amplified using primer walking or bridge amplification techniques. Synthesis and hybridization conditions may be selected so that both 1) the probability of 2 identical sequences being present in the library is negligible, and 2) that the probability of two sequences residing at a feature is small. Random template libraries can be synthesized by extending initiator sequences using template independent polymerases and mixtures of dNTPs tailored to reflect the desired base composition. The 3′-end of diverse sequence libraries can be homogenized either through the attachment of a common adaptor sequence or through the enzymatic synthesis of any sequence that allows for primer attachment and/or surface amplification chemistries. Sequencing-by-synthesis approaches can then be used to determine the identity of the indices after immobilization. In other embodiments, composition of the DNA library is known prior to random immobilization so that hybridization-based approaches may be used to decode the location of the strands after attachment. These DNA libraries may also be derived from biological sources of known composition and fragmented, either enzymatically or mechanically, to generate sequences of smaller size and predictable composition. Further approaches may instead utilize the random immobilization and spatial decoding of DNA-coated beads. It may be desirable in some embodiments to employ beads or particles which exhibit distinctive spectral signatures to aid in the decoding. The method of surface preparation does not limit the scope of the invention, in that any approach where the sequence of a polymer or population of polymers can be associated with an address on or within a solid rigid support may be suitable.
The preparation of such highly indexed materials containing billions of features is the time-consumptive step in the writing process. Accordingly, aspects of the invention may include methods to transfer the pattern of indexes from one material to another. Preferred embodiments for such copying reactions utilize surfaces wherein the features are separated from one another with hydrophobic patches, so that aqueous solutions preferentially form droplets on the oligonucleotide functionalized sites. The features on the master ‘template’ wafer are then registered with those on the ‘blank’ wafer, which are both then pressed into close alignment with one another to form a column of aqueous solution bridging the two features (
No prior description of the writing strategies is intended to restrict the invention to scenarios where indexing is performed prior to the data writing operations. There are some instances where it instead may be preferable to perform the writing operations before installation of the index elements. In embodiments utilizing enzymatic synthesis techniques, a common initiator sequence may be installed at all locations on the surface so that there is a suitable substrate for a template-independent polymerase. Surfaces may be patterned as previously described, where the initiators form a continuous uninterrupted lawn of oligonucleotides or are localized between regions of hydrophobicity or otherwise passivated region. The data writing operations are performed using any suitable approach as previously described. The index elements may then be installed using any aforementioned approach suitable for appending new sequence information to the 3′-end of the data-encoding strands. In situ enzymatic synthesis may be preferred using similar instrumentation as used for data recording for the subsequent installation of the indices, though it is also conceivable to mechanically deposit and ligate existing indices to the data strands. An advantage of post-write indexing is that surfaces which are relatively easy to prepare can act as a rapid, high-density data recording material that can be indexed and interpreted later at a core archival facility.
References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.
Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.
This application claims the benefit of U.S. Provisional Application No. 63/069,975, the content of which is incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/47508 | 8/25/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63069975 | Aug 2020 | US |