Advanced DNA sequencing technologies have given great optimism to the future of public health. These technologies provide vital information to support research throughout the field of disease diagnoses, prevention, and treatment. To sequence a complete human genome, which contains about 3 billion base pairs (bp), current sequencing technologies such as the Next Generation or Third Generation DNA sequencing require the DNA sample to be chopped into short segments. Through the approach of massively parallel processing, the human genome sequencing can be accomplished in weeks, however, the high cost of the facility and long lead time of sequencing persists due to the short reading length of each segment.
The world's first nanopore DNA sequencer, MinION from ONT (Oxford Nanopore Technologies) is based upon the technology of blockage current. Theoretically, when DNA is translocating through the nanopore, ionic current through the nanopore is blocked by the presence of DNA. The amplitude of blockage current depends on the interaction between the DNA bases and nanopore. However, existing nanopore configurations are relatively thick (a few nanometers) and measure the blockage current induced by multiple DNA bases instantaneously. Raw data performance assessments show that initially the ONT MinION achieved only a 60-70% sequencing accuracy because of the thickness of the nanopores.
A method of fabricating nanochannel systems for DNA sequencing and nanoparticle characterization is disclosed in U.S. Pat. No. 9,718,668 (Steve Tung et. al). While the patented method made important strides in the field of DNA sequencing, the patented method fails to address one critical challenge for the application of DNA sequencing: existing technologies (like the MinION ONT, for example) are not suitable for analyzing the tunneling current measured by nanoelectrodes with a width wider than a single DNA base (about 0.3 nm). Because of this issue, existing methods do not allow for the direct reading of the DNA sequence from the tunneling measurement based on its amplitude.
Furthermore, while ONT software has been developed further to increase sequencing accuracy, such software cannot be used to analyze data generated using devices such as the one described in U.S. Pat. No. 9,718,668 because (a) fundamentally the measurement mechanism is different; (b) the ONT software is based on the algorithms of deep learning, which cannot be adopted to current uses for the sequencing data training step; and (c) using nanoelectrode to measure transverse current involves DNA orientation considerations that were not considered in the ONT algorithms. While the basic concept of data processing is common (i.e. to reveal the DNA sequence information based on their context), a novel DNA base-calling method for tunneling current analysis is necessary to address these challenges.
The present invention is directed to a method for DNA base-calling from a nanochannel DNA sequencer. Base-calling is a process that converts raw signals into readable DNA sequences. The process consists of two major tasks (building a reference map and preparing experimental data) prior to the final step of data matching. In the present invention, the reference map refers to a series of numbers built based on a standard DNA sequence to describe the change of its corresponding tunneling current. Experimental data is prepared so that the change of electrical measurement can be described numerically. A section of match between the prepared experimental data and the reference map is used for DNA base-calling. The present invention utilizes seven sequential steps to execute these two major tasks, with mathematical models developed to accomplish the goal of each of the sequential steps. The novel DNA translocation protocol of the present invention utilizes AFM (atomic force microscope) based nanomanipulation to select and pick a single DNA molecule from a substrate surface. By moving the AFM tip in an aquaria environment, the DNA is stretched to linear during the process of DNA tunneling current measurement. This process is essential for allowing the DNA sequence to be output as the final results.
These and other objects, features, and advantages of the present invention will become better understood from a consideration of the following detailed description of the preferred embodiments and appended claims in conjunction with the drawings as described following:
With reference to
Experimental Background and Method Development: Before describing the method of DNA base-calling embodied by the present invention, it should be noted that the improved DNA base-calling process of the present invention is based off experimental analysis that used quantum simulation to investigate the effect various DNA base pairs have on tunneling current measurement. Such experimental process used for deriving the method of the present invention is described here for background purposes. For best results, the quantum simulation was first performed using a single chain-like Pt nanoelectrode. For simulation purposes, DNA having a sequence of GCAT (top strand reading from right to left) was used as a model. This DNA model is shown in
A further simulation with the goal of analyzing tunneling current without single base resolution was performed. The same DNA model as the first simulation was used and the width of the nanoelectrodes were increased from single chain to triple chain as shown in
DNA Base-Calling Process: The process begins with a single-stranded DNA (ssDNA) sequence. At step 1, the ssDNA is converted into a double-stranded DNA (dsDNA) sequence based on the ssDNA sequence. A dedicated notation is used to describe the base pair information along the DNA strand. For example, step 1 may use the basic DNA base pairing principle (A to T and C to G) to complement the double stranded DNA (dsDNA). It may be seen then, that an ssDNA sequence having the sequence shown in the forward translocation direction
When a translocating DNA strand hits a nanoelectrode gap, each DNA base pair will interact with the nanoelectrodes in a particular orientation. When a dsDNA is translocating through a pair of patterned nanoelectrodes, polarization of the DNA base pair and the direction of tunneling current will vary depending on the base pair's particular orientation. Theoretically, during DNA translocation, the orientation of each base pair is determined by its position along the DNA double helix structure. dsDNA has a helix structure and each base pair twists at an angle of 36 degrees. Thus, a complete 360 degree turn is achieved every 10 DNA base pairs. The effect of this orientation change can be represented by a sine wave as shown in
For experimental purposes in developing the invention, and based on the fact that every 10 base pairs completes a 360 degree turn, experimental base pairs were described by one of three orientations (0°, 36°, and 72°). To simplify the analysis even further, orientations of 0°, 36°, and 72° were approximated to 0°, 45°, and 90° to accommodate the theory of using equivalent circuits. These approximated orientations are shown in
Using this process, the orientation of each of the DNA base pairs can be successfully described numerically to describe its periodical property. It should be noted, of course, that in actual practice of the invention described herein the orientation of the first DNA base pair determines the orientation of all of the base pairs that follow. It may be noted, then, that the simplistic experimental view of the base-pairs of 0°, 36°, and 72° (approximated to 0°, 45°, and 90°) may no longer apply. Instead, in practice, the orientation of each DNA base pair along the strand is determined using the orientation of the leading pair, and the orientation of the leading pair can fall anywhere in the range of 0° to 360°. If the leading pair is 0°, for example, the second pair will be 36°. If, however, the leading pair is 1.5°, the second pair will be 37.5°. At step 2, a matrix is built and contains the dsDNA sequence in the first row and the orientation of each corresponding base pair in the second row. Using the dsDNA sequence of base pairs and the orientation of each base pair, matrices can be established. For example, for a short piece of ssDNA with a sequence of GCGTA, a dsDNA sequence of SSSWW may be determined based off of basic base pairing principles (as described above). Assuming the orientations of the first base pair is one of 0°, 36°, and 72° (which are approximated to 0°, 45°, and 90°), three rows of a matrix or three individual matrices may be generated (An example is shown in
At step 3, the DNA base pair information and its orientation are combined and an equivalent conductance for each base pair is generated. To calculate the equivalent conductance for each base pair, equivalent circuits are used to refer to corresponding base pair orientations to the nanoelectrodes, as shown in
Based on the conductance equations of the parallel and series circuit, the equivalent conductance of each base pair may be calculated given the base pair's orientation to the nanoelectrodes using the following equations:
The equations provided above (where Gp refers to the conductance of an equivalent parallel circuit and Gs refers to the conductance of an equivalent series circuit) are used to calculate the equivalent conductance of each base pair based on the relationships shown in
To better understand the relationship between the position of a DNA base pairs to their orientation, consider the following example using a dsDNA with the sequence of GCGTAC, where the first base pair is assumed to be in the 90 degree orientation. As noted above, the notations S and W may be used to refer to particular base pairs (G-C and A-T, respectively). Subscripts may be used to indicate the position of the base pair along the double helix structure. When this DNA section was translocated through the nanoelectrodes, the double helix structure twists the orientation of the DNA base pairs in steps of 36° for each (rounded to 45°) as previously described, and as shown in
At step 4, the system conductance is produced using the conductance of each DNA base pair. The system conductance is defined as the conductance that should be theoretically detected by the nanoelectrodes. Due to the width of the nanoelectrode detection range, the system conductance may be calculated by combining multiple DNA base pairs simultaneously. The number of DNA base pairs that should be included in this calculation is determined by the ‘window’ size as described previously. The equivalent system conductance is determined by combining the conductance of each DNA base pair based on the physical properties of the experimental setup to simulate the measured tunneling current. The conductance arrays generated through step 3 consist of the conductance of each individual DNA base pair.
In practice, each instantaneous tunneling measurement is composed by the tunneling effect of multiple DNA base pairs due to the large width of the nanoelectrodes.
where, C is the background baseline shift, σx is the conductance of each DNA base pair, Tx is the transmission probability based on the location of the DNA base pair
where, h is the reduced Planck number, v is the applied potential bias, and U is an evaluating number in the range from 0 to 1 for describing the alignment position. U=0 when the DNA base pair is in the middle of the nanoelectrodes where the transmission probability Tx=1. After this step, the conductance of each DNA base pair, stored in arrays of σt(x), were converted to measurement conductance, stored in arrays of Δσt(x).
At step 5, dedicated numbers are used to describe the system conductance change numerically. After this step, the reference map is ready to be used. In this final step of reference map construction, the theoretically established measurement conductance arrays Δσt(x) are used. In order to find a match between experimental data and theoretical data, the change of amplitude rather than the absolute value of the amplitude must be used. To accommodate the computer processing requirement, the change of the theoretical data must be described numerically using the following equations:
where, the array Δσ(x) is the measured conductance based on a group of DNA base pairs appearing in the nanoelectrodes detection range. After this process, in the reference map, the change of the system conductance due to the translocating DNA is expressed numerically without a physical vector. In this way, each time the DNA moved forward one base pair distance, the measurement conductance change of increase, decrease, and flat were represented by the number of 4, 0, and 2, respectively. For a DNA translocating through the gap of sensing nanoelectrodes, a series of numbers is generated to describe the change of measured conductance due to this translocation event. Once this process is repeated on all conductance arrays, the reference maps are prepared and ready to be used. An example reference map is shown in
Step 6 is the process of experimental preparation to interpret experimental electrical current change numerically in the same way as the reference map described. That is, experimental data processing follows the same principle by converting data to a series of numbers that represent the change of tunneling current. To do this, a section of experimental data where the DNA is believed to be stretched is selected for analysis. In one embodiment, it may necessary to process the raw data to reduce noise level. For example, noise level may be reduced using a 3rd-order Butterworth LPF with a cutoff frequency of 45 Hz or a Keithley 6485 with a sampling frequency of 1000 Hz. It is contemplated that various equipment or working conditions may be used as known in the art for reducing noise level of the data, and that the particular frequency and other parameters should be modified according to the particular equipment used. In any event, the noise reduction used in this step is only for the purpose of finding the stretched DNA sections. Data before and after processing is shown, for example, in
After data processing, the DNA tunneling current is plotted. An example is shown in
In the seventh step, the processed experimental data is used to find a match on the reference map. The obtained result indicates the position of that matching, which is used to retrieve the sequence information from the standard DNA sequence database. Thus, the developed DNA base calling method identifies the sequenced DNA by conducting a match study between the experimental data and theoretical reference maps (as shown in
In order to carry out the method, it may be seen that the following must be known: (a) the target DNA, (b) the DNA translocation speed, and (c) the width of the nanoelectrodes. Instead of directly ‘reading’ DNA sequence through challenging the current fabrication technology limitation to have a sub-0.3 nm wide sensing nanoelectrodes, the method of the present invention significantly reduces the cost of sequencing for applications where DNA identification is desired. It may be seen that the method of the present invention may be directly used by or embedded in a deep learning algorithm to work with sophisticated mathematical models for further analysis.
Using Experimental Data to Estimate the Accuracy of the DNA Base-Calling Process: As described below, additional experimental data was employed to describe the method of DNA base-calling accuracy determination. A long DNA sequence raw data from a piece of ADNA is partially coded and plotted in
To determine the gene information for the sequenced section of the ADNA starting with the sequencing of [CCACGCGGGATGA], the DNA mapping techniques were carried out using the VISTA tool. The DNA sequence information of the section A was managed into a text file in the format shown in
The rest of the data shown in
In order to successfully code sections E and G, the time interval of 1 ms has to be employed. For this particular group of experimental data, the 1 ms time interval reaches to the limit of the DAQ system used for data collection which only has a 1000 Hz maximum sampling frequency and causes the raw data to be unsuitable for further analysis. Therefore, the seqeunce data in sections E and G will not be counted for determing the base-calling accuracy.
In summary, in this particular group of DNA sequencing data, a total length of 40 base pair DNA was successfully processed using the disclosed base-calling method with 4 errors, which suggests a 90.47% local accuracy using the equation:
where, ε is simply the accuracy, δ is the count of errors, and N is the total number of DNA sequence embedded in the raw data. The DNA sections E and G were not counted as a successful processing result due to the limited sampling frequency. In section E, there were 21 base pairs during the time of 21 ms. Similarly, in section G, there were 14 base pairs DNA packed in a time duration of 13 ms.
In a macro scale, the success rate of the DNA sequencing result was low by giving a total of 36 base pairs correctly read out from a 75 bp DNA segment. It roughly gives the global accuracy of 48%. Though 48% is not a significant number, it still shows the potential when considering the 65% raw accuracy of Oxford Nanopore MinION that has been developed for a decade.
The improvement for using this disclosed DNA base-calling accuracy can be achieved from two major perspectives. The most obvious way for achieving higher base-calling accuracy is to use an advanced DAQ system with a higher sampling frequency. In this particular example, the major fall back of the global accuracy is due to the limit of the DAQ system. The other improvement can be realized through the dimension reduction of the sensing element to improve the signal to noise ratio. The data used in this study as an example was measured using a 100 nm wide nanoelectrodes. The changing of the conductance caused by the translocating DNA was described above using the equation of:
With a 1 nm width reduction of the nanoelectrodes, the Δσ is reduced by ˜1% in average. The change of the signal to noise ratio can be described using the following equation:
where the Δα is the improvement of the signal to noise ratio and Δσ′(x) is the overall conductance measured by nanoelectrodes with a reduced width. Based on the equations, reducing the width of nanoelectrodes from the current 100 nm to 50 nm will double the signal to noise ratio. The connect between the improved signal to noise ratio and the overall DNA base-calling accuracy enhancement is still under investigation.
The present invention has been described with reference to certain preferred and alternative embodiments that are intended to be exemplary only and not limiting to the full scope of the present invention.
This application claims the benefit of U.S. Provisional Application No. 62/819,783, entitled “Method for DNA Base-Calling from a Nanochannel DNA Sequencer” and filed on Mar. 18, 2019. The complete disclosure of said provisional application is hereby incorporated by reference.
This invention was made with government support from grant no. 1128660 awarded by the National Science Foundation and grant no. 1R21HG010055-01 awarded by the National Institute of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/023283 | 3/18/2020 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62819783 | Mar 2019 | US |