The present invention relates to DNA synthesis and sequencing and, more particularly, to DNA nanoarrays.
DNA Nanotechnology
Deoxyribonucleic Acid (DNA) is composed of chemical building blocks called nucleotides. Each nucleotide has a sugar group, a phosphate group, and one of four nitrogen nucleobases: Cytosine [C], Guanine [G], Adenine [A] or Thymine [T]. Nucleotides are linked into chains (strands), with the phosphate and sugar groups alternating. Two DNA strands bind together with hydrogen bonds according to specific base-pairing rules: A binds with T, and C binds with G. This unique feature enables protein molecular machines, known as enzymes, to replicate long chains of DNA with extremely high precision. The fault-tolerant replication process holds the key to all life forms, as we are currently aware of, where DNA is the information molecule.
In 1982, inspired by the symmetry and spatial arrangements of a school of fish in M. C. Escher's wood engraving Depth (see
In 2006, a new DNA assembly approach—now called “DNA Origami”—was invented by Paul Rothemund with Erik Winfree, using a single-stranded 7.3-kb-long genome (extracted from a natural-occurring M13 bacteriophage) as a scaffold, linked together at ˜200 points using short synthetic DNA strands (“staples”), to fold planar, arbitrarily shaped, 2D objects with length scales around 100 nm [5]. Since then, DNA origami has been developed to form complex nanoscale 2D and 3D structures (see
DNA Synthesis and Sequencing
The core of all DNA-based technologies is our ability to engineer and manipulate nucleic acids. The process of “reading” DNA fragments into a digital file (encoded as a string of A/C/G/T's) is known as sequencing. The method of “writing” physical DNA given a digital input is known as synthesis. DNA sequencing has been revolutionized over the past two decades, mostly thanks to the invention of next-generation sequencing (NGS) technology, leading to a drop in the price per base by six orders of magnitude [7]. Nature, by evolution, had granted us extremely efficient and high-fidelity enzymes to synthesize DNA given a template strand. However, we are currently unaware of any efficient natural system that can reliably synthesize long DNA strands in a template-independent (de novo) matter. Therefore, the cost of de novo synthesis has only slightly declined over the last few decades and continues to be significantly higher than the cost to sequence. (
The gold standard method for de novo DNA synthesis is phosphoramidite chemistry, a powerful technique that has matured over several decades [8]. For each base in the desired sequence, a cycle of four chemical steps (detritylation, coupling, capping, and oxidation) is executed to extend an existing DNA strand on solid support. The chemical synthesis process is sequential per base yet is simultaneous for quadrillions or more DNA strands on the surface. The method produces accurate DNA sequences up to 200 bases, after which the yield drops exponentially. Interestingly, despite the high cost of synthesizing long DNA sequences, the phosphoramidite method allows synthesis of an enormous library of unique short strands at a fixed cost, using mixed bases (also known as wobbles or randomized bases). In each coupling step, instead of adding a single nucleoside (a nucleotide without the phosphate group), a mixture of nucleosides with different bases is added. By a random chemical process, the specific location on the synthesized strand could now contain any of the mixed bases. Since this process occurs for a vast amount of strands concurrently, a sequence library is generated. The synthesis cost per mixed base is equal to the cost per single base. According to degenerate bases nomenclature [9], the letter ‘N’ represents “any base”. Synthesizing ten consecutive Ns would result in a library that has 410≈106 unique, random sequences. In 2020, Meiser et al. demonstrated how this strategy can be employed as a true random number generator [10].
DNA Microarrays
DNA microarrays (also known as DNA chips) are an invaluable tool in many modern genomic studies. Essentially, microarrays are a group of DNA technologies where nucleic acid sequences are either deposited or synthesized in a 2-D (or sometimes 3-D) array, such that the sequences are immobilized to a surface (chip). Typically, each spot on the array (also known as a feature) is composed of >picomoles (>1012 copies) of oligonucleotides sharing the same sequence. Each spot acts as a probe and the microarray serves as a sensing platform. See
A single probe can detect whether a single sequence (typically short, 25-100 bases long) exists in a target solution. By massively multiplexing to a large number of probes (some overlapping by sequence), the target solution can be tested for arbitrarily larger sequences. Therefore, there is a constant demand to maximize the number of unique probes on the surface, typically by minimizing the feature size. Fabrication techniques, that were originally developed for microelectronics, are now used in combination with chemical synthesis to create microscale features [15]. Currently, the smallest feature size for DNA microarrays is 5 μm (Affymetrix GeneChip® Human Gene 2.0ST). While micro- and nano-fabrication methods exist to generate <1 μm features, nanoscale probes face other significant challenges for detection such as a binding limit due to the law of mass action [16], and importantly, the diffraction limit, which restricts the ability of any optical instrument to distinguish between two features separated by a lateral distance less than approximately half the wavelength of light used for imaging.
DNA Nanoarrays
Over the last few decades, the field of nanofabrication has been revolutionized by technologies such as electron beam lithography and nanoimprint lithography, as previously reviewed in [17]. Nonetheless, it is still exceptionally expensive to dynamically pattern at the nanometer regime, where the lowest cost of a deep sub-micron lithography tool is on the order of >$100K. Furthermore, nanoscale placement of multiple uniquely addressable features, such as a variety of proteins, DNA probes or small-molecule ligands using current top-down approaches remains a challenging task.
Artificial DNA nanostructures, such as DNA origami, have great potential as nanoarray platforms. DNA can be folded from a long single-stranded viral DNA to complex nanostructures with the help of more than 200 helper strands called “staples”. Specifically, DNA can be folded to a 100 nm wide flat surface. Each of the ˜200 staples has a unique address with ˜6 nm resolution on the assembled structure, constituting a “pixel” on a uniquely addressable DNA nanoarray. Each array can carry up to 200 elements, ranging from organic dyes, metal nanoparticles, quantum dots, carbon nanotubes, and proteins. For a comprehensive review on conjugating DNA with nanomaterials see [18]. DNA origami-based addressable nanoarrays have been demonstrated for a wide range of applications, ranging from single nucleotide polymorphism (SNP) detection [19] to studying protein interactions [20]. For an in-depth review on applications see [21].
Large-Scale 2D DNA Nanoarrays: State of the Art
Direct Origami Placement (DOP) [22, 23] combines top-down fabrication techniques, such as electron beam lithography, with bottom-up self-assembly to accurately place and orient an array of ˜100 nm-wide DNA origami nanostructures on a micro- or macro-scale surface, as shown in
In 2017, Tikhomirov et al. [26] fabricated the largest 2D uniquely addressable array to date, by a pure self-assembly approach titled fractal assembly. DNA nanoarrays spanning a total area of 0.5 μm2 while containing 8,704 unique addresses/pixels were synthesized using fractal assembly. To demonstrate how each pixel is uniquely addressable, Tikhomirov et al. patterned the arrays to form nanomolecular paintings of the Mona Lisa and various other patterns (
In biological systems, DNA serves as a carrier of hereditary information, facilitated by predictable and programmable base-pairing rules. The field of DNA nanotechnology takes the DNA molecule out of its original context and using the same set of rules to construct complex structures and molecular machines at the nanoscale regime. At the nanoscale, precise organization of biological and non-biological materials in 2D or 3D space holds great promise for a vast range of applications in areas such as biophysics, point-of-care diagnostics, biomolecule structure determination, drug delivery, and more. Nucleic acid scaffolds, especially DNA origami, have emerged as a promising approach, by enabling <10 nm assembly of nanomaterials such as gold particles, carbon nanotubes, and quantum dots. Two-dimensional DNA nanostructures with a plurality of uniquely addressable linkage sites (“nanopixels”) are known as DNA nanoarrays.
Expanding the size of DNA nanoarrays is desired for a variety of applications, from whole genomic sequencing at a fraction of the cost to sustainable digital information storage. Yet, due to the stochastic nature of self-assembly, DNA origami-based approaches suffer from an inherent scale limit. Top-down fabrication techniques enable nanometric precise patterning, yet single-molecule placement remains a daunting challenge. Currently, no method enables independent nanoscale manipulation of more than 10K diverse single-molecules.
The synthesis cost per base still remains a significant hurdle for the widespread adoption of DNA storage. The current cost of standard synthesis methods is approximated at $800 per Megabyte [107]. Recently, Lee et al. demonstrated a proof-of-principle enzymatic synthesis method [108] that reduces the cost per base to an order of $1 per Megabyte. Still, standard storage devices prices are multiple orders of magnitude below this threshold, with magnetic tape cost at $16 per Terabyte [109]. Even if we take into account the maintenance costs of magnetic tape, which are three orders of magnitude higher compared to DNA storage maintenance [102], still, DNA storage expenses must decrease in four to five orders of magnitude to serve as a financially competitive alternative.
As can be seen, there is a need for a low cost method of independent nanoscale manipulation of more than 10K diverse single molecules to produce uniquely addressable DNA nanoarrays.
In one aspect of the present invention, a DNA nanoarray comprises a milliscale chip substrate; a binder bound to the milliscale chip substrate as a microscale spot having a uniform surface; and immobilized oligonucleotide sequences, each having a linker linked to the binder such that the immobilized oligonucleotide sequences form a monolayer and each of the immobilized oligonucleotide sequences having a length of at least N. N is a minimum length operative to guarantee within a statistical certainty that the immobilized oligonucleotide sequences are each unique.
In another aspect of the present invention, a method of producing a DNA nanoarray comprises providing a streptavidin-coated substrate; patterning the streptavidin-coated substrate by photolithography to produce a patterned surface having an array of microscale spots with active streptavidin binding sites; and immobilizing biotin-tagged oligonucleotides on the patterned surface by applying a solution containing the biotin-tagged oligonucleotides to the array of microscale spots; applying a buffer over the patterned surface; and washing the patterned surface in buffered saline solution. The biotin-tagged oligonucleotides each have a string of bases with a length operative to guarantee within a statistical certainty that the string of bases of the immobilized biotin-tagged oligonucleotide are each unique.
In another aspect of the present invention, a method of storing information on and retrieving information from the DNA nanoarray comprises providing the DNA nanoarray; writing bits to and/or storing spatial patterns to a subset of the nanopixels on the DNA nanoarray; and reading and/or visualizing the bits and/or the spatial patterns.
These and other features, aspects and advantages of the present invention will become better understood with reference to the following drawings, description, and claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The following detailed description is of the best currently contemplated modes of carrying out exemplary embodiments of the invention. The description is not to be taken in a limiting sense but is made merely for the purpose of illustrating the general principles of the invention, since the scope of the invention is best defined by the appended claims.
As used herein, the term “nanopixels” refers to an ultra-dense nanoarray of uniquely addressable oligonucleotides in solid phase.
Unless indicated otherwise by context, the term “unique” indicates that a string of bases is generally not duplicated within a set of oligonucleotides.
The term “room temperature” refers to a well-known temperature range of about 15° C. to about 25° C., such as about 20-22° C.
The term “oligo” is used herein as an abbreviation of the word “oligonucleotide”.
The term “ligation events” is used herein to describe DNA copies, made by a set of co-localized enzymatic reactions, that indicating proximity information for each oligo.
Except where otherwise noted, the term “sequencing depth” is defined as the average number of times a particular oligo is represented in a collection of random raw ligation events.
The terms “primer site” and “priming site” are used interchangeably herein, and may include the term “primer binding site”.
Broadly, one embodiment of the present invention is a new type of DNA nanoarray and a method of producing the nanoarray referred to herein as DNA Canvas.
DNA CANVAS refers to a low-cost process that relies on pairwise barcode association. At the end of the process, each unique barcode is spatially located within 10 nm resolution. The immobilized barcode can later be addressed by complementary strands carrying varied molecules of interest. The association mechanism is based on co-localized enzymatic reactions. Thus, the yield error is invariant to the array size. Notably, array size is governed by sequencing cost. Overall, pairwise association enables a new generation of affordable, scalable, uniquely addressable DNA nanoarrays.
DNA Canvas, the result of the fabrication process, is a milliscale chip that contains a microscale spot composed of millions of uniquely addressable DNA nanopixels. Alongside a physical chip is a digital file that maps the sequence of each nanopixel to a 2D coordinate with nanoscale precision. DNA Canvas can be used for a wide range of applications that require precise nanometric positioning of molecular elements in 2D space, at a low cost.
The present subject matter blurs the lines between engineering and biology to enable accessible prototyping for a wide range of nanoscale applications using nature's machinery. DNA Canvas offers a few significant gains over state-of-the-art methods in DNA nanoarray fabrication. First, the sheer number of pixels and addressability of the array are potentially far greater compared to any current method. Self-assembly approaches such as fractal assembly suffer from an exponential relationship between the number of pixels and yield [26], due to inherent errors propagating globally. Meaning, any self-assembly mistake between two tiles prevents larger tiles in the assembly tree from being assembled correctly. Like any biological process, DNA Canvas must suffer from a certain error. However, since the enzymatic reactions are co-localized on a surface, errors strictly affect a fixed size neighborhood and do not propagate to other parts of the array. This simple notion implies there is no assembly scale limit on DNA Canvas size. As will be described infra, the scale laws governing DNA Canvas are directly tied to the costs of DNA sequencing, which have decreased by six orders of magnitude in the last two decades [7]. Hybrid methods such as DOP do not suffer from exponential decays in yield. However, while these techniques can manipulate relative placement, orientation, and quantity of DNA nanostructures, direct control over a diverse single-molecule library is still not in reach. DNA Canvas allows for specific molecules to be tagged with unique barcodes. Assuming the number of hybridized molecules is large enough (by the law of mass action), we can directly place molecules of interest in complex nanometric precise patterns. DNA origami-based nanoarrays have been used as nanofabrication assembly platforms [6, 4, 21]. However, the nanostructures are synthesized in solution and are typically deposited on a surface, leading to spatial stochasticity. DNA Canvas builds upon the bottom-up nanofabrication techniques developed for DNA origami, while enabling single-structure assembly at a fixed location on-chip, surrounded by visible alignment markers, enabling a wide range of applications that require external interface with the assembled nanostructure.
Previous work has been done to reconstruct spatial organization strictly based on local neighborhood data, within the context of DNA. Glaser et al. [29] presented puzzle imaging, where spatially disordered samples are combined into a high-resolution image based on local properties. Specifically, local information can be generated by a chemical process, e.g., DNA polymerase encoding a unique string of DNA. Recently, Hoffecker et al. [30] introduced Polony Adjacency Reconstruction for Spatial Inference and Topology (PARSIFT), a computational framework designed for decoding pairwise associations between adjacent DNA polonies into images. Distinguishably, these methods operate without a priori knowledge to reconstruct an unfamiliar topological map. DNA Canvas is distinct since the topology is known a priori—a mono-layer grid.
Barcode Uniqueness
Each oligonucleotide (oligo) contains a barcode—a unique string of bases (e.g., ATGTACCA . . . ) of predetermined length. We assume each barcode is unique on the DNA Canvas. Experimentally, this assumption is enforced by statistical means. Traditionally, oligonucleotides are synthesized base by base, in a linear series of chemical reactions. A barcode is created when at a specific step, multiple bases are used in the chemical reaction. The random choice is represented by the letter N. For example, synthesizing the sequence ATNCG will result in four different oligo sequences: ATACG, ATTCG, ATCCG, ATGCG. Similarly, synthesizing the sequence ATNNCG will result in 16 (=42) different sequences. A barcode of length k has 4k distinct possibilities. Denote d as the diameter of a circular DNA Canvas (in nanometers). It follows that there are
binding sites. The probability, by counting, of having a non-unique barcode is:
P (non-unique barcode, k, d,
For example, a circular DNA Canvas of diameter d=5 μm contains m=π·106 binding sites. Given a barcode length of k=14, we approximate an upper-bound on the probability of having a non-unique barcode, by calculating the product of the largest 200,000 factors:
P (non-unique barcode, k=14, d=5000,
Essentially, the relationship between k and m is logarithmic (see Proof “Barcode Length vs. Number of Binding Sites”, below), which implies that a larger DNA Canvas requires synthesis of oligos containing only slightly longer barcodes to validate the uniqueness assumption. The Barcode Length vs. Number of Binding Sites Proof gives an exact formula for choosing k given m.
Global Mapping from Pairwise Measurements
In essence, fabricating a DNA Canvas is fairly simple: oligos are synthesized with a unique barcode and tagged with a ligand molecule (e.g., Biotin). A highly concentrated solution of tagged oligos is suspended over a coated surface that binds the ligand molecule (e.g., Streptavidin-coated slide). After a short time, the surface is washed, such that only oligos that had physically bound to the surface remain. Thus, we now have a surface with a plethora of oligos, each with a unique address and location. Alas, both the barcode and the spatial location for each oligo are unknown. Algorithms that can recover these unknowns with high accuracy using available information are described herein. We aim to obtain a mapping such that each barcode (a string of bases of a predetermined length, e.g., ATGAAT . . . ) is mapped to a spatial two-dimensional coordinate (e.g., 55,100). The input to the problem is local, pairwise measurements.
As will be described experimentally, we perform a set of carefully designed enzymatic reactions—ligation, amplification, and restriction. In the ligation step, each pair of adjacent oligos is connected. Next, copies of connected oligos are synthesized. Each copy contains two barcodes—one for each oligo in the pair. Then, the two oligos are disconnected. The last step is critical to have independent iterations. In each cycle, a different pair of oligos is ligated (each oligo can have multiple neighbors). The output of this process is the DNA copies, where each copy containing two barcodes, that must be adjacent on the surface. Last, we perform next-generation sequencing. Sequencing transforms physical DNA copies into digital information. In effect, we get a long list of DNA sequences, each with two unique barcodes.
This computational problem of global mapping from local proximities is also known as the Molecule Problem [33] or the Graph Realization Problem [34] and has been applied in a variety of settings from wireless sensor networks to structural biology. The present disclosure enables one to leverage an important topological assumption—the oligos are uniformly distributed in 2D space. This assumption gives rise to a set of algorithms that can solve the problem with high accuracy, short time, and limited computational resources.
Graph Representation
The DNA Canvas computational problem can be treated as a graph. Each barcode is a node. An edge exists between two nodes if a pair of barcodes appears in the list of measurements. Edges and nodes are unweighted. The goal is to find a coordinate for each node—an embedding of the graph in 2D space. In our setting, each edge represents a ligation and amplification event between a 5′ oligo to a 3′ oligo. These enzymatic reactions can only happen if the two oligos can reach each other. Furthermore, we assume that the oligos are uniformly distributed on the surfaces, hence the embedding of nodes should be uniformly distributed in 2D space. The number of edges is dictated by the number of ligation rounds and sequencing depth.
Parameter Space
There are a few key parameters in DNA Canvas design. Each of the parameters acts as a “knob” that can be fine-tuned to optimize for resolution.
Methods
All experiments in this disclosure were written in Python 3. Graph data structures and algorithms (Fruchterman and Reingold [37], Scalable Force Directed Placement—SFDP [40]) were implemented by the graph-tool [46] library, a highly efficient Python module manipulation and statistical analysis of graphs. The high-dimensional embedding method [42] was implemented by the author, with the support of NumPy [47] and scikit-learn [48] libraries. Figures were generated using the Matplotlib [49] and seaborn [50] libraries. All experiments were executed on a single- or multi-core Intel Core i7-8550U CPU @ 1.80 GHz.
Fruchterman and Reingold was executed for 100 iterations in grid mode, where repulsive forces only act on a local neighborhood. High-dimensional embedding was implemented with a random sample of 1% of the nodes act as pivots. SFDP was run using an adaptive cooling scheme.
Since 2007, Next-Generation Sequencing (NGS) costs had plummeted by six orders of magnitude, with a current price tag of approximately $0.01 per 106 bases [7]. DNA Canvas relies on NGS of ligation events between immobilized oligonucleotides (oligos). Each oligo serves as a nanopixel, with a 2D coordinate and address. Importantly, to achieve medium-to-high resolution we must sequence an ample amount of the entire DNA Canvas surface. Briefly, we need to sequence at least four times (4×) more ligation events than the number pixels to be close to 10 nm accuracy. The size of each ligation event is twice the length of the oligos. Meaning, to fabricate a DNA Canvas composed of N nanopixels (each k bases long), we have to sequence at least 8 kN bases, which currently costs $8·10−8 kN. For example, a DNA Canvas with 1M nanopixels (k=50) could cost merely $4.
The enzymatic reactions that generate the pool of ligation events all happen in a liquid volume on the surface where the oligos are immobilized. Assuming the 2.5 nm S-B model described herein, a 1M nanopixels circular Canvas would span a spot of diameter
For reference, a commercially available lab pipette can handle droplets of volume 0.1 microliter, while state-of-the-art inkjet printers can dispense droplets of 1-10 picoliter. Due to the cubic relationship between the diameter and the volume of a droplet, the volume of half a droplet of diameter 2.8 μm is 16 fL—two orders of magnitude below the capabilities of advanced inkjet printers.
Thus, we are left with a few alternatives. One, we could aim to fabricate a Terascale DNA Canvas that spans >1 mm, which would imply an exorbitant sequencing cost (>$1M). Nonetheless, as part of the roadmap, we will chart an alternative path to a Terascale Canvas at a reasonable cost. Alternatively, advanced methods exist for direct femtoliter [51] or even attoliter [52] droplet dispensing, yet these require complex setups. Recently, advancements in nanofabrication and photolithography had led to a wide adoption across academia and industry. This set of tools and techniques serves as an ideal solution for the problem of microscale DNA spots fabrication, in a matter that is both affordable, scalable, and reproducible.
DNA Immobilization
Glass slides have been widely used for DNA immobilization for decades to fabricate DNA chips for a wide range of applications [53]. Glass serves as an ideal substrate due to excellent optical properties and minimal compatibility issues with most organic solvents [54]. Numerous techniques have been applied to effectively immobilize DNA onto glass surfaces [55]. Overall, methods usually require some chemical modification of the surface as well as tagging the oligonucleotide (oligo) with a linker upon synthesis. Roughly, methods can be divided by the linking chemistry: covalent and non-covalent bonds.
Covalent linking methods include Epoxy or Aldehyde modified surfaces combined with Amine-tagged oligos, or alternatively gold (Au) modified surfaces with thiol-tagged oligos. These techniques offer excellent stability, high binding strength, and long-term usage, yet can suffer from problems such as crowding effect or island forming [55].
Streptavidin-Biotin binding is the most common non-covalent based approach and is used in a wide variety of scenarios [56]. Streptavidin is a tetramer protein that exhibits extremely high affinity (Ka=1015M−1), nearly covalent, to ligand protein Biotin [31]. Generally, the surface is first modified with a layer of Biotin, and then Streptavidin is added (
Streptavidin-Biotin-based DNA immobilization is an attractive option for DNA Canvas due to the simple and quick protocol as well as the predicted surface density that is governed by the size of the Streptavidin molecule (5 nm [32]). Determining the area of a single nanopixel is important for two main reasons: first, it is a key assumption for the computational model in charge of localization; second, it allows for efficient hybridization. In contrast to prior art DNA microarrays, where every spot is a colony of oligos with the same sequence, each barcode in our DNA nanoarray appears on a single molecule. Therefore, we need to assert that each oligo has adequate spacing for a complementary probe to bind. Nonetheless, covalent methods still serve as an excellent alternative, especially for production-grade devices, thanks to their long-term stability. The present disclosure also envisions use of Epoxy-Amine linking.
Chiplet Form Factor
Generally, coated glass surfaces are manufactured either in a form factor of a microscope slide (25×75 mm) or as a wafer (4 or 6 inches). The fabrication process of a DNA Canvas entails three steps: immobilization, enzymatic reactions, and computational localization. The first step can be done by suspending a nonreactive droplet over a coated slide or wafer for a short period of time (15-60 minutes). However, the second step requires iterative cycles where three different reaction solutions are suspended over the surface.
Enzymatic reactions on a surface (“chip”) are a wide and prominent research area, also known as Lab-On-Chip [57]. A relevant technique to our setting is Solid Phase PCR (SP-PCR) [58], where DNA primers are immobilized to a surface and the amplification happens on-chip. Every reaction has an aqueous media, a carefully designed mixture of salts, buffer, enzymes, and nucleotides aimed to optimize the reaction efficiency. Therefore, the solution cannot be allowed to dehydrate over time. Commonly, SP-PCR and other on-chip reactions use either seal chambers [59], oil encapsulation [60] or microfluidics devices [61] to guarantee the reaction conditions stay stable. While seal chambers or encapsulation in mineral oil are both simple and accessible methods, our protocol requires frequent changes to the reaction solution. Furthermore, we need to store the solution for downstream sequencing. A custom microfluidic device could serve as an excellent solution, yet we opted for a simpler approach.
Instead of suspending an aqueous media on the surface of a slide, we dice the slide into small pieces (chiplets) that can be suspended inside an aqueous volume. Specifically, we dice the coated slide to chiplets that can fit exactly into the bottom of a commercially available PCR tube (3×6 mm), a common lab instrument that can host reactions in volumes of 10-50 μl. Three 50 μl tubes are prepared containing the needed buffers and enzymes, each tube having a specific reaction condition, and the chiplet is cycled between the tubes, as shown in
The present disclosure envisions upscaling the fabrication process per chiplet, in terms of time and costs. EBL may, in some cases, be used for nanoscale writing onto DNA Canvas (e.g., DNA Information Storage), although for fabrication purposes, the EBL time of 15-60 minutes per 1 cm2 does not scale well. The direct-write photolithography tool, MLA-150, proved to be especially useful for prototyping, with an exposure time of ˜2 minutes per 0.5×0.5 inch2. The evident next step beyond prototyping is mask aligner-based fabrication. Mask aligners are tools to use a predetermined optical mask and can transfer a pattern over to a large area of resist at a fixed constant time. Thus, using a commercially available mask aligner, a 6-inch wafer that contains ˜1000 chiplets can be fabricated in 30 seconds.
Last, while we chose to focus on Streptavidin-Biotin chemistry, the present disclosure further envisions various linking methods, specifically covalent ones.
At present, DNA microscopy had been experimentally applied only a handful of times. In 2017, Schaus et al. [67] were the first to validate this approach, by immobilizing seven synthetic probes on a DNA Origami substrate with predetermined distances between each probe. Using Auto-cycling Proximity Recording (APR) method, immobilized hairpin probes designed with palindromic domains are interchangeably hybridized and amplified. The main limitation for APR is the probe design, which requires synthesis of a unique barcode and its complement on a single strand, such that mixed bases synthesis is irrelevant, making synthesis cost linear in the number of probes. In 2019, Weinstein et al. [28] applied DNA microscopy at scale for the first time to visualize the spatial organization of mRNA molecules (105-106) inside a cell, strictly based on DNA interactions. Each ligation and amplification event with a Unique-End-Identifier (UEI) can diffuse through the cell, such that the output of the computational model is a diffusion map for each unique probe identifier or unique molecular identifier (UMI). In 2020, Gopalkrishnan et al. [70] showed how hundreds of probes immobilized on a DNA Origami surface can be precisely localized at 2 nm precision, using a molecular ruler approach. Using a molecular ruler, each ligation event holds information on the physical distance between the two immobilized probes. Further in 2020, Ambrosetti et al. [66] experimentally demonstrated how DNA microscopy can map membrane protein nanoenvironments, using a technique called NanoDeep where a synthetic comb made of oligos is built to scan a membrane. Amplification and sequencing can reveal adjacency information of the membrane proteins and therefore a global map.
We build upon the work described above by applying DNA microscopy in a novel scenario—on-chip. We fabricate a chiplet able to store vast amounts of compact unique oligos (>106). This particular setting allows us to scale beyond previous attempts. Our goal is not topological imaging, as we already know the surface is flat and packed with binding molecules. Instead, our aim is precision mapping between a barcode and a spatial location, in order to generate a device useful for downstream applications. Here, we present Chip-based Iterative Proximity Ligation and Extension (ChIPLEx), an on-chip enzymatic reaction method to generate proximity data encoded into DNA.
Sensitivity to Elevated Temperatures
The bond between Streptavidin and Biotin has long been regarded as the strongest, non-covalent, biological interaction known in nature (Ka≈1015M−1) [31]. The bond forms very rapidly and is stable in wide ranges of pH and temperatures. [74, 75]. Nonetheless, Holmberg et al. [71] had rigorously studied the conditions under which Streptavidin-Biotin bonds can be reversibly broken (
The results demonstrate that Biotin-Streptavidin bond is sensitive to elevated temperatures for relevant buffers. Consequently, we design the reaction and enzymes to be optimal at room temperature.
Steric Hindrance
A design constraint when running enzymatic reactions on-chip is steric hindrance due to surface proximity. Therefore, it is a common practice to add a “spacer” between the attachment molecule and the enzyme's recognition site for on-chip reactions [78, 79]. Spacers can be either nucleic acid sequences or molecules such as Polyethyleneglycol (PEG) that are synthesized within the oligonucleotide. Spacers have been shown to increase hybridization yield [80] and enzymatic efficiency [81].
We devise an assay to validate the optimal spacer length for DNA Canvas. We synthesize a Biotin-tagged oligo, composed of a spacer, a priming site, and a ˜20 bases sequence simulating the barcode region, yet is limited to nucleotides A/C/T. At the 3′ end of the oligo, there is either a Guanine (G) or a Cytosine (C). The protocol (on-chip enzymatic fluorescent labeling) starts by immobilizing a pool of oligos to the surface. Afterward, an enzymatic extension reaction is carried on-chip by a DNA Polymerase. The building blocks for every extension reaction are nucleoside triphosphates: dATP, dCTP, dGTP and dTTP. DNA Polymerase is able to synthesize these blocks onto the coding strand using base-pairing rules. If the template sequence contains the nucleic acid Guanine (G), dCTP will be attached, and so forth. The assay works by supplying ample amounts of dATP, dGTP, dTTP yet no dCTP. Instead, a dCTP-Cy3 (Jena Biosciences) is used, which is a modified dCTP attached with a fluorescent dye (Cy3). Thus, if the surface has oligos that have Guanine (G) at their 3′ end, we expect to observe fluorescence. If there is no Guanine in the template strand, no significant fluorescent signal is expected after the surface is washed.
Using this assay, a spacer strategy for 3′ and 5′ Biotin-tagged oligos has been validated. For 3′ Biotin tagged oligos, a spacer composed of a single Hexaethylenglycol (HEG, sometimes referred to as PEGS) along with nine Thymines (polyT9) was sufficient to observe a strong fluorescent signal strictly for oligos containing Guanine. Moreover, for 5′ Biotin tagged oligos, a spacer composed of two HEGs and three Thymines was sufficient.
Computational Sequence Design
NUPack [82] is a comprehensive computational framework for analysis of nucleic acid sequences over complex scenarios. Using the recently released NUPack Python module, we computationally designed the sequence to minimize the probability for unwanted interactions between the oligonucleotides and the primers. Table 1, below, presents an example DNA sequence with barcode length 14 for ChIPLEx. In bold are endonuclease recognition sites. N represents a random (mixed) base.
GGGCCC
Notably, the endonucleases recognition sites are 6+ bases long. To prevent the endonucleases from cleaving or nicking the random barcode region, the barcode sequence was interlaced with two Thymines to prevent the 6+ bases recognition sequence (that does not contain TT) from ever appearing. The spacer regions are designed as result. The reverse priming site length is 15 bases, with estimated melting temperature (Tm) of 53.9° C. The forward priming site is 25 bases long and contains a nicking site for Nb.BbvCl. Post nicking, the resulting priming site is 15 bases long with a melting temperature (Tm) of 53.4° C.
Real-Time Quantitative Near Using Molecular Beacons
Molecular Beacons (MB) are specifically designed DNA hairpin structures that are widely used as sensitive fluorescent probes [83, 84]. As can be seen in
MB have been widely employed as a quantitative assay for PCR [85] and LAMP [86]. Here, we leverage MB to quantify NEAR in real-time. We synthesize a DNA probe using the same sequence design mentioned above yet fixing the barcode region to a predetermined sequence. The complementary sequence to the fixed barcode acts as the MB's target. For all MB experiments, we follow Vet and Marras [85] strategy of applying MB in real-time PCR assays. Particularly, we use the metric signal-to-background ratio (S/B). S/B is calculated by measuring three quantities—the fluorescence intensity of the buffer (without MB), denoted Fbuffer. The fluorescent intensity of a solution containing MB without the target, denoted Fclosed. The fluorescent intensity of the sample of interest, containing the target and MB, denoted Fopen.
S/B is defined as:
First, we generate a calibration curve for different known concentrations of the target oligo. As can be seen in
Next, we perform NEAR starting with a template oligo concentration of 0.1 μM. The template oligo does not contain the target sequence, but a nicking priming site and the fixed barcode region. As a positive control, we set up a reaction containing the target oligo at 1 μM. For negative control, we set up two reactions: one lacks the enzymes (DNA Polymerase and nicking endonuclease) and the second lacks the primer. All reactions are incubated at room temperature and sampled every 10 minutes using a fluorescent microplate reader. The results are shown in
The present disclosure envisions performing NEAR on-chip. To validate the results of such a reaction, we can either use the Molecular Beacon strategy (Real-time Quantitative NEAR using Molecular Beacons, supra) or simply by gel electrophoresis. The present disclosure envisions attaining a deeper understanding of the system kinetics by analyzing NEAR amplification rates as a function of parameter space. Last, combining the three reactions: ligation, extension, and cleavage into a single temperature-controlled reaction, similarly to PCR, would allow to truly scale the number of ligation events per oligo, thus minimizing the RMSD error.
DNA Sequencing and Fluorescence Sampling Methods
All DNA sequences, including the Molecular Beacon, were analyzed and optimized using the NUPack® [82] computational framework. All protocols for the biological experiments are described in Appendix B. Real-time fluorescence sampling was done using Synergy H1 Microplate Reader (Biotek®), using fluorescence mode with excitation/emission of 488/510 nm, corresponding to fluorophore 6-FAM (6-carboxyfluorescein).
Patterning/Decoration
The organization of nanomaterials such as gold nanoparticles, quantum dots, and carbon nanotubes with nanoscale precision is one of the central challenges in nanotechnology. A wide variety of nanomaterials can interface with DNA through a range of conjugation techniques, which have been well summarized in a previous review by Samanta and Medintz [18]. Currently, the leading DNA-based self-assembly method for nanoparticle manipulation is DNA Origami, as previously reviewed in [21, 4, 87]. DNA Canvas provides an attractive, complementary approach for nanomaterials placement. DNA Canvas can leverage techniques originally developed for DNA Origami, where carbon nanotubes or gold particles are modified with nucleic acids to enable hybridization at predetermined locations on the DNA-based scaffold. DNA Canvas acts as a different kind of DNA-based scaffold, that differs in phase (solid vs. liquid) and scale (>1M vs. 10K) compared to state-of-the-art DNA Origami.
Prior to patterning nanomaterials, the first step would be a purely DNA-based proof-of-concept. We will design and synthesize DNA strands that form secondary structures such as hairpins that can later be imaged by Atomic Force Microscopy (AFM), a key tool for nanoscale imaging of DNA structures [88]. Specifically, given the list of spatial locations and barcodes, we can digitally transform any pattern to a list of complementary sequences (
Combinatorial Approach to Low-Cost Decoration
Decorating a DNA Canvas using complementary strands requires nucleic acid synthesis. Explicitly, to create a pattern composed of 100K pixels on top of a 1M DNA Canvas, it seems we would have to synthesize 100K unique oligonucleotides. Current DNA synthesis costs for short strands are on the order of $0.15 per base (Genewiz), not including complex modifications such as linker molecules. Thus, trivial patterning of 100K pixels with barcode length 14 per pixel pattern would cost at least $200K.
For this reason, we introduce an approach to dramatically reduce decoration costs. Oligonucleotides can be synthesized with mixed bases, also known as wobbles or randomized bases, without incurring extra costs. Using a degenerate bases notation, we can specify any subset of {A,C,G,T}, for example, R is NG; B is C/G/T; N is any base, etc. For the full nomenclature see [9], the disclosure of which is incorporated by reference in its entirety. For any library of ACGT sequences, we can reduce the number of synthesized oligos using degenerated oligo synthesis, such that the tube containing the synthesis results contains the same sequence library. Compression from ACGT to mixed bases is akin to Karnaugh maps [89], a method for simplifying a Boolean expression to a minimum number of logic gates in electrical circuit design. Similar to Karnaugh maps, defining “Don't Cares” (pixels that are irrelevant whether hybridized or not, or barcodes that do not appear on the Canvas) can greatly reduce the library size. Furthermore, for many applications the absolute location and orientation on the chiplet are insignificant. As a result, there is a vast conformations space to encode the same pattern, up to translations, rotations, or symmetrical transformations. Thus, the algorithm for computing a low-cost decoration synthesis library will search over the conformation space for a global minimal cost, dictated by the number of mixed bases sequences. This problem can be reformulated as the NP-complete, set cover problem, although, approximation methods exist, especially since our input domain (four-letter options, barcode length k) is relatively limited compared to classic set cover.
From DNA Canvas to DNA Blackboard
Reusability is a crucial step on the path to accessible nanoscale patterning. The melting temperature (Tm) of double-stranded DNA is defined as the temperature for which half of the strands are in the random coil or single-stranded state. Therefore, a pattern can be removed from the surface post hybridization simply by controlled heating. For barcodes of length 14, the melting temperature can vary between 37 C to 80 C, depending on the ratio of C/G vs. NT. A computational tool can be built such that for any given conformation, the Tm to melt the pattern off the surface is calculated. This approach also opens an opportunity for nanometamaterials patterning, meaning, patterns that transform as a response to temperature.
Streptavidin-Biotin bonds are sensitive to elevated temperatures. This property could be mitigated by a well-designed mixture of buffer and salts. Moreover, switching to a different linking chemistry, especially one based on covalent bonds to the surface, would function as a sustainable solution allowing for reusability with minimal degradation.
Low Cost Replication
The cost per pixel of a DNA Canvas chip is dictated by next-generation costs, as elaborated supra. Next-generation sequencing (NGS) costs have been dropping exponentially since 2007, with a current price of $0.01 per 106 bases [7]. Each pixel on a DNA Canvas requires sequencing of at least four distinct ligation events, where each event contains two barcodes of length k (along with some fixed length components such as primers and enzyme recognition sites). Briefly, the relationship between the number of nanopixels and the sequencing cost is linear. For example, a 1M DNA Canvas with barcode length 14 would cost merely $4, while a 1G DNA Canvas with barcode length 20 would cost ˜$4,000 to fabricate.
Why do we need to scale beyond 1G nanopixels? Some applications require milliscale or larger chips that support accurate nanoscale immobilization by DNA. Since every nanopixel must have at least four neighbors within its reach distance for the computational step to succeed, milliscale chips are bound to have >1012 nanopixels. Notice that even if the application requires only a limited number of nanopixels across the milliscale chip, the fabrication process requires sequencing of the entire surface.
How do we scale beyond 1G nanopixels, at a reasonable cost? First, given the trends of NGS costs over the last few decades, perhaps we need not worry, as we will continue to see significant price drops per base. Alternatively, an independent approach would be to produce an expensive, main copy of a DNA Canvas, along with low-cost replicates. Ideally, each of the replicates would contain the same barcodes at the exact same locations, thus sparing the need for the costly sequencing step. In 2019, Krämer et al. [90] developed a copy-paste technique for DNA microarrays, using polydimethylsiloxane (PDMS) cavity chips that hybridize and extend off the surface of an existing DNA microarray. The PDMS chip can later be transformed to a blank slide to produce a replicate. This disclosure envisions scaling down the copy technique to the single-molecule level while minimizing the copy error rate.
Applications
DNA Canvas is a versatile platform. By exploring the design space and focusing on a low-cost fabrication scheme, we hope to enable a plethora of applications and use-cases across academia and industry. Nonetheless, on the roadmap to wide adoption lies the “killer app” milestone, (at least) one impactful application enabled by the novel DNA nanotechnology described here.
Whole Genome Resequencing
Now that reference genome sequences for many organisms (Homo sapiens included) are available, research of genomic variation and its biological consequence with regards to a reference genome, known as genome resequencing, is an active field within academia and industry. Commonly, DNA microarrays, such as the Affymetrix GeneChip® [91], are fabricated with a predetermined selection of probes, for example, single nucleotide polymorphisms (SNPs) in genes of interest in humans.
The whole human genome is 6.4 billion bases. We can fabricate DNA chips with billions of unique nanoscale features, allowing for whole human genome coverage, at a fraction of the cost of a GeneChip® (currently priced at ˜$450 per chip). However, there is a catch. DNA arrays with microscale features can be read using optical means, such as a fluorescence microscope. Nanoscale features lie beyond the diffraction limit and cannot be optically observed. For a successful whole genome resequencing application, the challenge of reliable readout must be solved. One approach would employ Atomic Force Microscopy (AFM), a method of nanoscale topology imaging. Alternatively, we could attempt to fabricate a DNA Canvas on a gel substrate, that could be uniformly expanded upon imaging, in a method known as Expansion Microscopy [92].
Complex Environments for Molecular Machines
Molecular machines that perform a specific task in response to a stimulus are a key component in all forms of life. Artificial molecular machines are human-engineered nanoscale objects designed to mimic or eventually surpass the roles of existing molecular machines in a controlled manner. The 2016 Nobel Prize in Chemistry was awarded for the design and synthesis of molecular machines that used chemical synthesis to bridge functional chemical groups together. Every task holds information regarding its execution. Traditional electro-mechanical machines, also known as robots, often store an internal representation of the goals, environment, and actions to be executed. When designing robots at a single-molecule level, our current ability to precisely control dynamic and mechanical properties is largely limited. To overcome this difficulty, one strategy is to create simple molecular machines that operate in complex environments [93]. Inspired by the behavior of social insects, such as ants and termites, studies have shown that surprisingly complex tasks can be performed by agents with limited capabilities [94].
DNA is a useful programmable material to build molecular machines. Over the last decade, DNA-based walking motors, also known as DNA walkers, have been designed to traverse multi-step tracks, often embedded on a DNA Origami grid [93, 95]. Furthermore, DNA walkers have been constructed to perform tasks such as cargo-sorting [96] or maze-solving [97]. DNA robots are not the only means where DNA is used to solve complex tasks. Molecular computing is an active research field where pure DNA is designed to form digital logic gates [98] and even neural networks [99]. Recently, a spatial architecture where DNA circuit elements are co-localized on a DNA origami flat surface had shown promise as a new approach for fast and modular molecular circuit design [100].
DNA Canvas holds a few key advantages to DNA Origami as a molecular computing platform. The sheer number of pixels available on a DNA Canvas is beyond the current limits of DNA Origami, allowing more information to be embedded into the environment of the robot/circuit than ever before. Furthermore, DNA Origami is self-assembled in solution and can be difficult to localize. DNA Canvas exists in solid phase on a predetermined location surrounded by visible alignment markers, thus unlocking a range of experiments where an external stimulus or process interacts with the system.
DNA Digital Data Storage
Humanity is generating data at exponential rates, leading to daunting challenges for traditional information storage methods. According to current forecasts [101], global data storage demands by 2025 will exceed beyond the maximal density of any currently available storage method. Traditional digital data storage usually works by changing the properties of materials: electrical properties in flash and phase-change memories or magnetic properties in hard disk drives and tape. Despite the high impact these technologies have made over the last few decades, they are all approaching their density limit. Furthermore, technologies such as magnetic tape pose high maintenance costs. For example, a data center storing an Exabyte (1018) of data on tape, will require as much as $1 billion and hundreds of millions of kilowatts of electricity to build and maintain for 10 years [102].
Molecular data storage, prominently DNA, has been proposed as an attractive alternative for archival storage, thanks to its extreme density, stable nature, and low energy cost. Storing information into DNA was first demonstrated by artist Joe Davis in 1988, by encoding a 5×7 pixel image to a 28 bases DNA strand [103]. Since then, hundreds of Megabytes [104] have been encoded into DNA, and recent advancements have been extensively reviewed in [101, 105, 106].
DNA Canvas could serve as an attractive platform for DNA information storage.
By rethinking nanopixels as bits, the cost per bit is shifted to depend on sequencing cost, which is both lower and dropping faster compared to DNA synthesis cost. Specifically, a nanopixel fabrication cost is linear in the cost of sequencing ($10−8 per base [7]) multiplied by a factor (2×oligo length×sequencing depth 100), meaning, a few dollars per “Megabyte” of nanopixels, similar to state-of-the-art. However, following the path to low-cost replication disclosed herein, the cost per nanopixel is disentangled from sequencing, such that the price per chip drops down to the negligible cost of materials and enzymatic replication. Yet, a significant challenge remains. Writing bits to nanopixels by complementary strand hybridization requires nucleic acid synthesis, which brings back the cost to a few dollars per Megabyte. Rather, we could employ nanofabrication tools to write information as spatial patterns. For example, using an electron beam lithography tool that can write 10 nm features, we encode data to a pattern by disabling a plurality of nanopixels that form the encoded pattern. Then, the reference DNA Canvas can be amplified and stored along with a copy that contains the erased nanopixels. Sequencing the two copies enables information retrieval, at the cost of sequencing. Overall, this method implies the cost per written bit is the cost per nanopixel multiplied by the number of nanopixels per feature. For 10 nm features, that factor is just ˜16, yet for 100 nm features it is ˜1600. Furthermore, the cost of the machine, as well as the writing speed (Mb/sec), has to be taken into account, as tools that dynamically write 10 nm features are currently in the price range of >$100K.
Proof of Concept
Inspired by Gopinath et al. 65,536 pixels Starry Night [24] and Tikhomirov et al. 8,704 pixels Mona Lisa [26] (
Barcode Length Versus Number of Binding Sites
A barcode is a string composed strictly of the letters A/C/G/T. A barcode of length k has 4k unique possibilities. We define m has the number of available binding sites. Assuming that each binding site is randomly assigned a barcode i.i.d. The probability pm,k of having a non-unique barcode is:
Proof by counting.
Now we prove that for any choice of m>1 and a probability P, there exists
k<<m such that pm,k≤P
Proof:
Let A>0. Choose k=A logo m 4k=mA
We can reorganize the products, and find an upper bound:
Biotin-Tagged DNA Immobilization on Streptavidin Coated Surfaces
Mix Biotin-tagged oligonucleotides (2-100 μM) with Micro Spotting Solution (2×, Arraylt Corp.) to get a 1-50 μM 1× oligo-spotting solution. Print 1 μL of the solution to the surface of the Streptavidin coated slide, for a droplet of diameter ˜1.5 mm. Take care that the tip of the pipette does not touch the surface of the slide in any time. This can be achieved by slowly lowering the droplet from the tip of the pipette, until the droplet touches the surface and sticks to it thanks to surface tension. Incubate for 15-30 minutes in room temperature. To prevent droplet dehydration, the incubation is done on top of a wet tissue inside a clean petri dish sealed with film. Next, suspend SuperStreptavidin Microarray Blocking Buffer (1×, Arraylt Corp.) over the entire surface of the slide/chiplet. Incubate for 30-60 minutes in room temperature. Now wash three times in 1× phosphate buffered saline (PBS) by soaking the chiplet/slide in 1×PBS for five minutes for each step and replace the wash buffer between steps. Last, wash for 10 seconds in 0.1×PBS and blow-dry with Nitrogen. Store in 4° C. in a sealed dry chamber.
On-Chip Enzymatic Fluorescent Labeling
Immobilize G-oligos on one chiplet (Positive) and C-oligos on another chiplet (Negative) according to “Biotin-tagged DNA Immobilization on Streptavidin Coated Surfaces protocol, supra. Next, in two PCR tube, prepare on ice the following reaction for a total volume of 50 μl: 1×NEBuffer 2.1, 1 mM dATP, 1 mM dCTP, 1 mM dTTP, 5U DNA Polymerase I Large Klenow Fragment (New England Biolabs), 5 μM dCTP-Cy3 (Jena Biosciences), 10 μM Primer (Synthesized by IDT) and distilled water (dH2O). Suspend each chiplet in a tube and move to room temperature for 15-60 minutes. Last, wash three times in 1×PBS and 10 seconds in 0.1×PBS followed by nitrogen dry, similarly to “Biotin-tagged DNA Immobilization on Streptavidin Coated Surfaces protocol, supra. Observe the chiplet under a fluorescence microscope with Ex/Em˜550/570.
Nicking Enzyme Amplification Reaction
Prepare on-ice the following reaction for a total volume of 50 μl: 1× NEBuffer 2.1, 0.5 mM dNTP, 5U Nb.BbvCl, 2.5U Large Klenow Fragment (New England Biolabs), 2 μM Primer, 0.1 μM Template Oligo (synthesized by IDT). Incubate at room temperature.
Quantitative Assay Using Molecular Beacons
To perform a quantitative fluorescent assay, simply add 0.2 μM Molecular Beacon (synthesized by IDT) to the above reaction. For positive control, add 1 μM of the target oligo (synthesized by Genewiz). For negative control, remove either the enzymes or the primer from the reaction.
In an embodiment of the present invention, a DNA nanoarray is disclosed, comprising: a first random oligonucleotide sequence of length N, the sequence comprising a linker molecule; a uniform surface comprising a binder that binds the linker molecule, whereby the sequence is randomly immobilized to a monolayer on the surface; and a second through M random oligonucleotide sequence, the second through M sequences also each comprising a linker molecule; wherein N is a large enough to guarantee within a statistical certainty that every sequence on the surface is unique, and whereby each sequence is bound to a different location on the surface; wherein the uniform surface is a microscale spot; and further comprising a physical chip larger than the uniform surface, wherein the physical chip holds the uniform surface and is milliscale; and whereby enzymatic processes may convert local information concerning each sequence and sequence location into a global mapping at high nanoscale precision, to result in a digital file that maps each sequence to a 2D coordinate. The linker may comprise biotin.
Referring to
Simulation Model
The following base assumptions were made. The core idea behind a DNA Canvas is iterative amplification of adjacent barcoded oligonucleotides, such that each barcode can be precisely mapped to a spatial location. The oligos are immobilized to the surface. For the purpose of in-silico simulations, we define the 2.5 nm Streptavidin-Biotin (S-B) model, shown in
Furthermore, we assume two populations of oligos, named 5′ and 3′. These correspond to the location of the Biotin tag in relation to the DNA strand. When performing a ligation step, we assume 5′ strands can strictly bind to 3′ strands and vice versa. Moreover, we assume enzymatic reactions between neighboring pairs have no bias—every oligo is free to ligate to any adjacent strand within reach, without bias.
Moreover, we expect a uniform distribution of oligos on the surface. Meaning, at any given area of a specific size we expect to see a similar number of oligos. There should be no clusters and no gaps. This assumption holds for both populations of oligos: there should not be 5′ clusters or 3′ clusters.
Evaluation
With the aim of evaluating various algorithms for this setting, we have constructed a computational evaluation framework. First, we build a grid. The grid is composed of 2.5 nm cells according to the 2.5 nm S-B model. Next, oligos are randomly generated by classic dart-throwing. Graph drawing algorithms are prone to inaccuracies around corners. Therefore, we choose the shape of DNA Canvas to be a circle (the same circular shape would later be fabricated). Each oligo is randomly assigned to be either 3′ or 5′. See
Once the random nodes and edges are generated, the graph is given as input to the evaluated algorithm. Each oligo/node has a known location, the ground truth, that is not revealed to the algorithm. The output of the algorithm is rotated using the Kabsch algorithm [43]. Next, the root-mean-squared-deviation (RMSD) between the ground truth and the aligned prediction is calculated. Numerous random graphs are generated to calculate a confidence interval. As a means to evaluate the accuracy of different methods under a controlled setting, the following parameters are fixed such that local adjacency information is not scarce: sequencing depth: 10×, reach distance: 21 nm (=50 bases long oligos), compactness: 95%.
Graph Connectivity
Connectivity is an important property in graph theory. For DNA Canvas, when the graph is disconnected, the subsequent realization problem becomes fundamentally harder, since there is limited information on how to place the components with respect to each other. According to random graph theory, connectivity can be estimated by the average node degree (d), as first demonstrated in the seminal work by Erdös and Rényi [35]. In our scenario, we can approximate (d) in two ways. On one hand, we define NLR as the number of ligation rounds. In each round, a single edge might be generated per node (up to a constant factor determined by the fidelity of the enzymes involved). Therefore, (d)≤NLR. On the other hand, the parameter sequencing depth, denoted NSD, defines how many reads are sequenced as part of the next-generation sequencing process. Specifically, the total number of reads is NSD·|V| where |V| is the number of nodes in the graph. Thus, the average node degree is
By combining these two equations we get d
=min (NLR, NSD).
Practically, performing more ligation rounds endures no extra cost, except for the time of the lab technician (approx. one hour per round). Conversely, raising the sequencing depth leads to a linear rise in sequencing cost, which is the major portion of the total fabrication cost. Therefore, we can safely assume NLR>NSD⇒NSD=min (NLR, NSD)=(d). Henceforth, we will explore sequencing depth as the major “knob” influencing graph connectivity (and later, reconstruction accuracy), by strictly setting the number of ligation rounds higher than the sequencing depth.
Graph Drawing
Graph drawing is a class of algorithms, where the input is a graph and the output is a 2D embedding that is visually aesthetic. While there is no formal definition for graph aesthetics, it is generally agreed that an “aesthetic” graph is one where there is minimal edge crossing and vertices are uniformly distributed.
A prevalent class of graph drawing algorithms is force-directed graph drawing, where the graph is modeled as a physical system of bodies and forces that act between them, usually as a spring model. The algorithm proceeds to find a placement of the bodies by minimizing the energy of the system.
Fruchterman and Reingold [37] method models the graph as a system of springs, such that a spring applies an attractive force between every two neighboring nodes, and at the same time, a repulsive (“electrical”) force exists between all nodes. Kamada and Kawai [38] method assumes springs between every pair of nodes in the graph, where the length of a spring is associated with its graph distance. In force-directed methods, vertices are slowly moved by the forces acting on them, usually with a decreasing step size. Unfortunately, these methods suffer from high-computational complexity for large graphs. Where |V| is the number of nodes and |E| is the number of edges in the graph, Fruchterman and Reingold requires O(|V|2) calculations per iteration, while Kamada and Kawai complexity is O(|V∥E|). Walshaw [39] proposed a multilevel algorithm, where vertices are hierarchically grouped to form clusters, which define a coarser graph. Each graph is drawn, starting at the coarsest and ending at the original. While Walshaw's method is able to run on large (>250K nodes) graphs in a matter of minutes, it ignores long-range forces between the original vertices. Hu [40] offers an efficient, high-quality improvement to Walshaw's—Scalable Force Directed Placement (SFDP). Briefly, SFDP applies a similar multilevel approach to overcome local minima, while utilizing a Barnes and Hut [41] octree data structure to efficiently approximate short and long-range forces. Moreover, SFDP includes an adaptive cooling scheme to further improve the quality of results over alternative force-directed methods.
Another approach to graph drawing is by high dimensional embedding [42]. First, m pivot nodes are chosen. Next, the shortest travel distance from the m pivot nodes to all nodes on the graph is calculated. Thus, every node now has an m-dimensional embedding. Then, principal component analysis (PCA) is applied to reduce the dimensionality (in our case—to a 2D embedding) while preserving as much variation as possible. High dimensional embedding approach offers very fast running times and straightforward parallelized implementation.
Henceforth, we apply SFDP to reconstruct global locations from pairwise measurements.
Sequencing Depth
The term sequencing depth is taken from Next-Generation Sequencing (NGS) technology. It is defined as the average number of times a particular nucleotide is represented in a collection of random raw sequences [44]. Sequencing depth acts as a mean to control the inherent errors in the sequencing process. Consequently, the deeper the sequencing, the better the accuracy and completeness of the genomic analysis. Deeper sequencing directly implies higher sequencing costs. A common choice of sequencing depth for NGS applications is 20×.
For DNA Canvas, we borrow the term and define it as the average number of times a particular oligo is represented in the collection of random raw ligation events. As seen in
Compactness
One of our key assumptions is that Streptavidin molecules are tightly packed across the surface. Yet, we have so far assumed that Biotin-tagged oligos are occupying all available binding sites (100% compactness). In practice, we can adjust the compactness of oligos by introducing a competitive ligand (e.g., Biotin without oligo). For the purposes of simulation, the compactness is adjusted between 1% to 99%.
As expected, when the compactness is low (<25%), the accuracy error increases, since local adjacency information is sparse. Notably, after a certain point (>75% for reach distance of 21 nm, 25% for 42 nm), the compactness has no significant effect on the RMSD. Similar to sequencing depth, this insight has important implications both for hybridization efficiency and fabrication costs. Sufficient spacing between oligos has been known to increase hybridization and enzymatic efficiency, which are both important steps in fabricating as well as using a DNA Canvas device. Moreover, the denser the array of oligos, the more edges we need to sequence, thus entailing higher sequencing costs. Therefore, DNA Canvas applications can choose a compactness setting as a second trade-off between price and precision, along with sequencing depth. Additionally,
Also envisioned is a computational workflow that explicitly abides by the uniformity of DNA Canvas, along with other assumptions, for example by binning.
Reach Distance
Each edge in the graph is originated from a ligation and amplification event of two oligos immobilized to the surface. Therefore, two oligos can ligate only if their ends can reach each other. Reach distance is defined as the maximal distance for which two immobilized oligos can ligate across. Roughly, an oligo composed of k bases is k×0.34 nm long. Thus, the reach distance between two oligos of length k is:
Reach Distance (nm)=2×0.34×k=0.68k
We can adjust the reach distance by adding or removing bases in our sequence design. Alternatively, it is possible to insert special molecules called spacers (e.g., Polyethanolglycol) to increase the oligo's length without adding new bases. As a lower bound, there must be enough bases such that two oligos can reach each other over the gap defined by the Streptavidin molecule structure—at least 6 bases. Furthermore, the oligo sequence must have enough bases to accompany components that are integral to the iterative amplification process: a barcode, priming site, nicking site, restriction site. For barcode length k, the minimal oligo design is approximately 23+k, which corresponds to a reach distance of 15.64+0.68k nanometers. For an upper bound, while it is theoretically possible to synthesize oligos of various lengths, currently vendors that offer oligo synthesis services (IDT, Genewiz), can synthesize oligos up to 100 bases long (for oligos that include complex modifications, such as a 3′ Biotin tag). Therefore, we will examine various reach distances as a function of k in the range of 6-100 bases, which corresponds to reach distances of 4-68 nm.
The fabrication process of a DNA chiplet, the milliscale device containing a microscale spot composed of DNA nanopixels, is shown (at multiple scales) in
We present a set of assays to validate the quality of the fabrication process, as well as experimentally adjust some of the “knobs”. Moreover, we present challenges (and solutions) that arise in the intersection between traditional microfabrication and Streptavidin-coated substrates. Our contributions span beyond the scope of a supporting device for DNA Canvas to hold interest to any user interested in nano- or micro-scale patterning of surfaces coated with proteins and/or nucleotides.
Precision Dicing
The alignment markers have an additional benefit. The last step of the fabrication process is dicing the slide into chiplets, such that each spot lies approximately at the center of each chiplet. The alignment markers allow for visual alignment in a dicing saw. The result can be seen in
It should be noted that the dicing procedure could potentially introduce unwanted particles on the surface of our slide. Therefore, wash and dry steps were done after dicing. The fluorescence based-immobilization assay was then applied to ensure the surface does not exhibit unwanted artifacts.
Fluorescence-Based Immobilization Assay
Uniform density is a key feature of the inventive product. Briefly, if there are islands or holes on the surface, the computational model would fail to localize nanopixels with high precision. Therefore, we use a fluorescence-based assay to test the quality and uniformity of coated slides.
The Fluorescence-based Immobilization Assay (Streptavidin-Biotin Linking) has two versions. In the basic version (
Ideally, a uniform fluorescence signal is present wherever the droplet touched the surface, usually in the form of a circle. The edge of the circle is an attractive feature to observe the difference between the signal to the background noise.
Next, we applied the second assay version, a hybridization-based assay (
Using these assays, we have evaluated Streptavidin-coated surfaces from various vendors as well as in-house coated glass slides. Further, we have optimized our immobilization protocol under various conditions (“Biotin-tagged DNA Immobilization on Streptavidin Coated Surfaces protocol, supra). As a result, we choose to use slides from a specific vendor (MicroSurfaces Inc.) as the fluorescence signal and uniformity exhibited on their slides were superior to other vendors as well as our own in-house attempts. The cost per slide is $50 (which is diced into 100 chiplets, thus $0.5 per chiplet), compared to ˜$10 fabrication cost in-house.
A) Oligos tagged with Biotin (green triangle) and a fluorophore (yellow star) are directly immobilized to the surface. B) Two-step hybridization assay. First, Biotin-tagged oligos with a predetermined sequence are immobilized. Next, oligos with the complementary sequence, tagged with a fluorophore are hybridized to the surface.
Microfabrication
In order to attain reasonable fabrication costs, the active area of a DNA Canvas must be less than 10 μm in diameter. A custom microfabrication process using photolithography is disclosed herein. The starting point of the process is a 25×75 mm Streptavidin-coated microscope slide. The output is a set of 3×6 mm chiplets, each with a single microscale spot with active Streptavidin binding sites. Importantly, the rest of the chiplet is nonreactive to DNA immobilization.
Photolithography is a process where a pattern is transferred to a photosensitive polymer (a photoresist) by exposure to a light source (e.g., ultraviolet [UV]) either through an optical mask (such as shown in
Preliminary Results Using Electron Beam Lithography
Experiments, in collaboration with Junichi Ogawa at the Massachusetts Institute of Technology (MIT) Media Lab, were performed with a microfabrication process using a negative photoresist (e.g., SU-8) and an Electron Beam Lithography (EBL) tool to pattern Fluorophore-tagged DNA at various micro-scales. The process involved spin-coating a Streptavidin-coated glass slide with SU-8 negative resist and patterning using EBL (in environmental mode) to direct-write microscale patterns. The resist was developed such that the unexposed areas were removed. Then, the fluorescence-based immobilization assay was applied. Biotin-tagged oligos with a fluorophore (Cy3) were suspended and immobilized. Streptavidin binding sites were available only in areas not exposed to the electron beam (as remaining binding sites were still covered by polymerized SU-8). Thus, a microscale crisp fluorescent pattern was achieved as shown in
The EBL+SU-8 approach was faced with a few significant challenges. First, the maximum exposure area in the EBL tool was too small to expose the entire chiplet at once. Thus, multiple exposures had to be manually stitched. Furthermore, the thick SU-8 film remaining on the chiplet raised compatibility issues with the downstream biological processes. Specifically, the resist can be “sticky”, especially around the edges of the pattern, such that DNA would unintentionally be immobilized to the resist. See
New Process Using Photolithography
For this reason, we developed a novel microfabrication process for patterning microscale spots on Streptavidin-coated surfaces that is optimized for compatibility for subsequent enzymatic reactions at the Baldo lab at MIT. As shown in
Importantly, at the end of the process, Streptavidin-coating remains strictly within the active microscale spot, and no resist is left on the surface. Furthermore, we introduce a compatible etching process to generate features that help to find the “invisible” spot under an optical microscope for downstream imaging (e.g., AFM).
To ensure the Streptavidin coating remains reactive to subsequent enzymatic reactions, exposure to ultraviolet (UV) light and/or plasma is minimized and heating steps are limited and shortened as proteins undergo denaturation under elevated temperatures over extended periods of time. Known solvents that are typically used in the art with inorganic substrates proved to be incompatible with Streptavidin coating.
Spots
Applicants have discovered that due to the comparatively high surface energy of glass, an adhesion promoter is required for resist application.
Typically, according to common wisdom in the art, the preferred stripping agent would be N-Methylpyrrolidone (NMP) or Dimethyl Sulfoxide (DMSO). However, applicants discovered that Streptavidin is incompatible with both by applying the Fluorescence-based immobilization assay and getting no fluorescent signal after using the process with either of those solvents. Therefore, we chose to use Acetone as it exhibited no compatibility issues. Generally, there are concerns with Acetone leaving residue. However, upon imaging, we did not observe any irregularities.
Markers
The spots are invisible under an optical microscope. To observe a DNA Canvas patterning/decoration, Atomic Force Microscopy (AFM) may be used, a technique that allows nanoscale topological imaging. AFM requires spatial alignment under an optical microscope to the region of interest followed by a scanning step that takes several minutes per 1 pmt. Thus, alignment markers visible under an optical microscope are needed to locate the spot. For that purpose, we have developed a second process to generate alignment markers around each spot. Here, we tried a resist-based approach, a lift-off process approach, and a subtractive approach. The resist-based approach left resist on the chiplet, which raised similar issues to the SU-8 approach described above. The lift-off procedure allowed deposition of only 50 nm thick silver film (as opposed to 1 μm resist), introducing a metallic surface in proximity to the spots that gave rise to problems both in fluorescence imaging as well as downstream enzymatic processes. Therefore, we chose the subtractive approach, which introduces no additional materials at the end of the procedure.
The inventive microfabrication process for patterning optically visible alignment markers, illustrated in
DNA Microscopy
DNA Microscopy [27, 28, 66, 67], is an emerging class of techniques that utilize DNA-based enzymatic reactions to perform optics-free imaging. Prior art microscopy generally relies on photons (e.g., in optical microscopy), electrons (e.g., in electron microscopy), or scanning probes (e.g., in atomic force microscopy) to decipher the spatial arrangement of a given sample. However, these techniques suffer from various disadvantages such as the diffraction-limited resolution in optical imaging, expensive instrumentation in electron beam imaging, and low throughput in atomic force imaging [68]. DNA microscopy relies on stochastic binding of proximal oligos followed by next-generation sequencing as a medium for molecular-scale imaging. DNA microscopy schemes, reviewed in and adopted from [27], are shown in
Scheme Overview
In each round, two oligos of opposite polarity are ligated and extension of one of the primers leads to a mobile copy containing two barcodes. Last, a restriction enzyme recognizes the now-full site and cleaves the oligos apart. The ligation and extension step described above repeats, such that in every round, a different pair of oligos may be ligated.
Boulgakov et al. [69] presented a simple approach to DNA microscopy, called Iterative Proximity Ligation (IPL), illustrated in
We opt for at least eight rounds of ligation and extension to make up for any inefficiencies introduced by the underlying enzymes. One shortcoming of IPL is the number of extensions per cycle. After two oligos are ligated, DNA polymerase and a primer oligo extend the pair to produce a dsDNA copy immobilized to the surface. In order to produce multiple copies per round, the new copy must be melted off, typically in a Polymerase Chain Reaction (PCR). This process requires precise temperature control over the surface of the chip. Additionally, Streptavidin-Biotin bond is reversed in elevated temperatures over time [71], thus rendering PCR not suitable for on-chip reactions where the oligos are assumed to be immobilized throughout the process. Here, we build upon IPL and present a novel scheme for DNA microscopy: Chip-based Iterative Proximity Ligation and Extension (ChIPLEx), that is optimized for scale and stability on-chip, using purely isothermal reactions.
Oligonucleotide Library Preparation
First,
It should be understood, of course, that the foregoing relates to exemplary embodiments of the invention and that modifications may be made without departing from the spirit and scope of the invention as set forth in the following claims.
This application claims the benefit of priority of U.S. provisional application No. 63/289,516, filed Dec. 14, 2021, the contents of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63289516 | Dec 2021 | US |