The present disclosure relates generally to systems, devices, and methods for basecalling, and more specifically to systems, devices, and methods for basecalling using deep machine learning in Sanger sequencing analysis.
Sanger Sequencing with capillary electrophoresis (CE) genetic analyzers is the gold-standard DNA sequencing technology, which provides a high degree of accuracy, long-read capabilities, and the flexibility to support a diverse range of applications in many research areas. The accuracies of basecalls and quality values (QVs) for Sanger Sequencing on CE genetic analyzers are essential for successful sequencing projects. A legacy basecaller was developed to provide a complete and integrated basecalling solution to support sequencing platforms and applications. It was originally engineered to basecall long plasmid clones (pure bases) and then extended later to basecall mixed base data to support variant identification.
However, obvious mixed bases are occasionally called as pure bases even with high predicted QVs, and false positives in which pure bases are incorrectly called as mixed bases also occur relatively frequently due to sequencing artefacts such as dye blobs, n-1 peaks due to polymerase slippage and primer impurities, mobility shifts, etc. Clearly, the basecalling and QV accuracy for mixed bases need to be improved to support sequencing applications for identifying variants such as Single Nucleotide Polymorphisms (SNPs) and heterozygous insertion deletion variants (het indels). The basecalling accuracy of legacy basecallers at 5′ and 3′ ends is also relatively low due to mobility shifts and low resolution at 5′ and 3′ ends. The legacy basecaller also struggles to basecall amplicons shorter than 150 base pairs (bps) in length, particularly shorter than 100 bps, failing to estimate average peak spacing, average peak width, spacing curve, and/or width curve, sometimes resulting in increased error rate.
Therefore, improved basecalling accuracy for mixed bases and 5′ and 3′ ends is very desirable so that basecalling algorithms can deliver higher fidelity of Sanger Sequencing data, improve variant identification, increase read length, and also save sequencing costs for sequencing applications.
Denaturing capillary electrophoresis is well known to those of ordinary skill in the art. In overview, a nucleic acid sample is injected at the inlet end of the capillary, into a denaturing separation medium in the capillary, and an electric field is applied to the capillary ends. The different nucleic acid components in a sample, e.g., a polymerase chain reaction (PCR) mixture or other sample, migrate to the detector point with different velocities due to differences in their electrophoretic properties. Consequently, they reach the detector (usually an ultraviolet (UV) or fluorescence detector) at different times. Results present as a series of detected peaks, where each peak represents ideally one nucleic acid component or species of the sample. Peak area and/or peak height indicate the initial concentration of the component in the mixture.
The magnitude of any given peak, including an artifact peak, is most often determined optically on the basis of either UV absorption by nucleic acids, e.g., DNA, or by fluorescence emission from one or more labels associated with the nucleic acid. UV and fluorescence detectors applicable to nucleic acid CE detection are well known in the art.
CE capillaries themselves are frequently quartz, although other materials known to those of skill in the art can be used. There are a number of CE systems available commercially, having both single and multiple-capillary capabilities. The methods described herein are applicable to any device or system for denaturing CE of nucleic acid samples.
Because the charge-to-frictional drag ratio is the same for different sized polynucleotides in free solution, electrophoretic separation requires the presence of a sieving (i.e., separation) medium. Applicable CE separation matrices are compatible with the presence of denaturing agents necessary for denaturing nucleic acid CE, a common example of which is 8M urea.
Systems and methods are described for use in basecalling applications, for example in basecalling systems based on microfluidic separations (in which separation is performed through micro-channels etched into or onto glass, silicon or other substrate), or separation through capillary electrophoresis using single or multiple cylindrical capillary tubes.
To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced.
Terminology used herein should be accorded its ordinary meaning the arts unless otherwise indicated expressly or by context.
“Quality values” in this context refers to an estimate (or prediction) of the likelihood that a given basecall is in error. Typically, the quality value is scaled following the convention established by the Phred program: QV=−10 log 10(Pe), where Pe stands for the estimated probability that the call is in error. Quality values are a measure of the certainty of the base calling and consensus-calling algorithms. Higher values correspond to lower chance of algorithm error. Sample quality values refer to the per base quality values for a sample, and consensus quality values are per-consensus quality values.
“Sigmoid function” in this context refers to a function of the form f(x)=1/(exp(−x)). The sigmoid function is used as an activation function in artificial neural networks. It has the property of mapping a wide range of input values to the range 0-1, or sometimes −1 to 1.
“Capillary electrophoresis genetic analyzer” in this context refers to instrument that applies an electrical field to a capillary loaded with a sample so that the negatively charged DNA fragments move toward the positive electrode. The speed at which a DNA fragment moves through the medium is inversely proportional to its molecular weight. This process of electrophoresis can separate the extension products by size at a resolution of one base.
“Image signal” in this context refers to an intensity reading of fluorescence from one of the dyes used to identify bases during a data run. Signal strength numbers are shown in the Annotation view of the sample file.
“Exemplary commercial CE devices” in this context refers to include the Applied Biosystems, Inc. (ABI) genetic analyzer models 310 (single capillary), 3130 (4 capillary), 3130xL (16 capillary), 3500 (8 capillary), 3500xL (24 capillary), 3730 (48 capillary), and 3730xL (96 capillary), the Agilent 7100 device, Prince Technologies, Inc.'s PrinCE™ Capillary Electrophoresis System, Lumex, Inc.'s Capel-105™ CE system, and Beckman Coulter's P/ACE™ MDQ systems, among others.
“Base pair” in this context refers to complementary nucleotide in a DNA sequence. Thymine (T) is complementary to adenine (A) and guanine (G) is complementary to cytosine (C).
“ReLU” in this context refers to a rectifier function, an activation function defined as the positive part of ints input. It is also known as a ramp function and is analogous to half-wave rectification in electrical signal theory. ReLU is a popular activation function in deep neural networks.
“Heterozygous insertion deletion variant” in this context refers to see single nucleotide polymorphism
“Mobility shift” in this context refers to electrophoretic mobility changes imposed by the presence of different fluorescent dye molecules associated with differently labeled reaction extension products.
“Variant” in this context refers to bases where the consensus sequence differs from the reference sequence that is provided.
“Polymerase slippage” in this context refers to is a form of mutation that leads to either a trinucleotide or dinucleotide expansion or contraction during DNA replication. A slippage event normally occurs when a sequence of repetitive nucleotides (tandem repeats) are found at the site of replication. Tandem repeats are unstable regions of the genome where frequent insertions and deletions of nucleotides can take place.
“Amplicon” in this context refers to the product of a PCR reaction. Typically, an amplicon is a short piece of DNA.
“Basecall” in this context refers to assigning a nucleotide base to each peak (A, C, G, T, or N) of the fluorescence signal.
“Raw data” in this context refers to a multicolor graph displaying the fluorescence intensity (signal) collected for each of the four fluorescent dyes.
“Base spacing” in this context refers to the number of data points from one peak to the next. A negative spacing value or a spacing value shown in red indicates a problem with your samples, and/or the analysis parameters.
“Separation or sieving media” in this context refers to include gels, however non-gel liquid polymers such as linear polyacrylamide, hydroxyalkyl cellulose (HEC), agarose, and cellulose acetate, and the like can be used. Other separation media that can be used for capillary electrophoresis include, but are not limited to, water soluble polymers such as poly(N,N′-dimethyl acrylamide) (PDMA), polyethylene glycol (PEG), poly(vinylpyrrolidone) (PVP), polyethylene oxide, polysaccharides and pluronic polyols; various polyvinyl alcohol (PVAL)-related polymers, polyether-water mixture, lyotropic polymer liquid crystals, among others.
“Adam optimizer” in this context refers to an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds. Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically, Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems), and Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy). Adam realizes the benefits of both AdaGrad and RMSProp. Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance). Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages. The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.
“Hyperbolic tangent function” in this context refers to a function of the form tanh(x)=sinh(x)/cosh(x). The tanh function is a popular activation function in artificial neural networks. Like the sigmoid, the tanh function is also sigmoidal (“s”-shaped), but instead outputs values that range (−1, 1). Thus, strongly negative inputs to the tanh will map to negative outputs. Additionally, only zero-valued inputs are mapped to near-zero outputs. These properties make the network less likely to get “stuck” during training.
“Relative fluoresce unit” in this context refers to measurements in electrophoresis methods, such as for DNA analysis. A “relative fluorescence unit” is a unit of measurement used in analysis which employs fluorescence detection.
“CTC loss function” in this context refers to connectionist temporal classification, a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. A CTC network has a continuous output (e.g. Softmax), which is fitted through training to model the probability of a label. CTC does not attempt to learn boundaries and timings: Label sequences are considered equivalent if they differ only in alignment, ignoring blanks. Equivalent label sequences can occur in many ways—which makes scoring a non-trivial task. Fortunately, there is an efficient forward-backward algorithm for that. CTC scores can then be used with the back-propagation algorithm to update the neural network weights. Alternative approaches to a CTC-fitted neural network include a hidden Markov model (HMM).
“Polymerase” in this context refers to an enzyme that catalyzes polymerization. DNA and RNA polymerases build single□stranded DNA or RNA (respectively) from free nucleotides, using another single□stranded DNA or RNA as the template.
“Sample data” in this context refers to the output of a single lane or capillary on a sequencing instrument. Sample data is entered into Sequencing Analysis, SeqScape, and other sequencing analysis software.
“Plasmid” in this context refers to a genetic structure in a cell that can replicate independently of the chromosomes, typically a small circular DNA strand in the cytoplasm of a bacterium or protozoan. Plasmids are much used in the laboratory manipulation of genes.
“Beam search” in this context refers to a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic. But in beam search, only a predetermined number of best partial solutions are kept as candidates. It is thus a greedy algorithm. Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, β, of best states at each level (called the beam width). Only those states are expanded next. The greater the beam width, the fewer states are pruned. With an infinite beam width, no states are pruned and beam search is identical to breadth-first search. The beam width bounds the memory required to perform the search. Since a goal state could potentially be pruned, beam search sacrifices completeness (the guarantee that an algorithm will terminate with a solution, if one exists). Beam search is not optimal (that is, there is no guarantee that it will find the best solution). In general, beam search returns the first solution found. Beam search for machine translation is a different case: once reaching the configured maximum search depth (i.e. translation length), the algorithm will evaluate the solutions found during search at various depths and return the best one (the one with the highest probability). The beam width can either be fixed or variable. One approach that uses a variable beam width starts with the width at a minimum. If no solution is found, the beam is widened and the procedure is repeated.
“Sanger Sequencer” in this context refers to a DNA sequencing process that takes advantage of the ability of DNA polymerase to incorporate 2′,3′-dideoxynucleotides—nucleotide base analogs that lack the 3′-hydroxyl group essential in phosphodiester bond formation. Sanger dideoxy sequencing requires a DNA template, a sequencing primer, DNA polymerase, deoxynucleotides (dNTPs), dideoxynucleotides (ddNTPs), and reaction buffer. Four separate reactions are set up, each containing radioactively labeled nucleotides and either ddA, ddC, ddG, or ddT. The annealing, labeling, and termination steps are performed on separate heat blocks. DNA synthesis is performed at 37° C., the temperature at which DNA polymerase has the optimal enzyme activity. DNA polymerase adds a deoxynucleotide or the corresponding 2′,3′-dideoxynucleotide at each step of chain extension. Whether a deoxynucleotide or a dideoxynucleotide is added depends on the relative concentration of both molecules. When a deoxynucleotide (A, C, G, or T) is added to the 3′ end, chain extension can continue. However, when a dideoxynucleotide (ddA, ddC, ddG, or ddT) is added to the 3′ end, chain extension 4 DNA Sequencing by Capillary terminates. Sanger dideoxy sequencing results in the formation of extension products of various lengths terminated with dideoxynucleotides at the 3′ end.
“Single nucleotide polymorphism” in this context refers to a variation in a single base pair in a DNA sequence.
“Mixed base” in this context refers to one-base positions that contain 2, 3, or 4 bases. These bases are assigned the appropriate IUB code.
“Softmax function” in this context refers to a function of the form f(xi)=exp(xi)/sum(exp(x)) where the sum is taken over a set of x. Softmax is used at different layers (often at the output layer) of artificial neural networks to predict classifications for inputs to those layers. The Softmax function calculates the probabilities distribution of the event xi over ‘n’ different events. In general sense, this function calculates the probabilities of each target class over all possible target classes. The calculated probabilities are helpful for predicting that the target class is represented in the inputs. The main advantage of using Softmax is the output probabilities range. The range will 0 to 1, and the sum of all the probabilities will be equal to one. If the softmax function used for multi-classification model it returns the probabilities of each class and the target class will have the high probability. The formula computes the exponential (e-power) of the given input value and the sum of exponential values of all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values is the output of the softmax function.
“Noise” in this context refers to average background fluorescent intensity for each dye.
“Backpropagation” in this context refers to an algorithm used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer. For backpropagation, the loss function calculates the difference between the network output and its expected output, after a case propagates through the network.
“Dequeue max finder” in this context refers to an algorithm utilizing a double-ended queue to determine a maximum value.
“Gated Recurrent Unit (GRU)” in this context refers to are a gating mechanism in recurrent neural networks. GRUs may exhibit better performance on smaller datasets than do LSTMs. They have fewer parameters than LSTM, as they lack an output gate. See https://en.wikipedia.org/wiki/Gated_recurrent_unit
“Pure base” in this context refers to assignment mode for a base caller, where the base caller determines an A, C, G, and T to a position instead of a variable.
“Primer” in this context refers to A short single strand of DNA that serves as the priming site for DNA polymerase in a PCR reaction.
“Loss function” in this context refers to also referred to as the cost function or error function (not to be confused with the Gauss error function), is a function that maps values of one or more variables onto a real number intuitively representing some “cost” associated with those values.
Referring to
Referencing
Referencing
Referencing
A basic deep neural network 500 is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.
In common implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function (the activation function) of the sum of its inputs. The connections between artificial neurons are called ‘edges’ or axons. Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold (trigger threshold) such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer 502), to the last layer (the output layer 506), possibly after traversing one or more intermediate layers, called hidden layers 504.
Referring to
An input neuron has no predecessor but serves as input interface for the whole network. Similarly, an output neuron has no successor and thus serves as output interface of the whole network.
The network includes connections, each connection transferring the output of a neuron in one layer to the input of a neuron in a next layer. Each connection carries an input x and is assigned a weight w.
The activation function 602 often has the form of a sum of products of the weighted values of the inputs of the predecessor neurons.
The learning rule is a rule or an algorithm which modifies the parameters of the neural network, in order for a given input to the network to produce a favored output. This learning process typically involves modifying the weights and thresholds of the neurons and connections within the network.
All RNNs have the form of a chain of repeating nodes, each node being a neural network. In standard RNNs, this repeating node will have a structure such as a single layer with a tanh activation function. This is shown in the upper diagram. An LSTMs also has this chain like design, but the repeating node A has a different structure than for regular RNNs. Instead of having a single neural network layer, there are typically four, and the layers interact in a particular way.
In an LSTM each path carries an entire vector, from the output of one node to the inputs of others. The circled functions outside the dotted box represent pointwise operations, like vector addition, while the sigmoid and tanh boxes inside the dotted box are learned neural network layers. Lines merging denote concatenation, while a line forking denote values being copied and the copies going to different locations.
An important feature of LSTMs is the cell state Ct, the horizontal line running through the top of the long short-term memory 900 (lower diagram). The cell state is like a conveyor belt. It runs across the entire chain, with only some minor linear interactions. It's entirely possible for signals to flow along it unchanged. The LSTM has the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through a cell. They are typically formed using a sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through”. An LSTM has three of these sigmoid gates, to protect and control the cell state.
Referring to
The input segmenter 1002 receives an input trace sequence, a window size, and a stride length. The input trace sequence may be a sequence of dye relative fluoresce units (RFUs) collected from a capillary electrophoresis (CE) instrument or raw spectrum data collected in the CE instruments directly. The input trace sequence comprises a number of scans. The window size determines the number of scans per input to the scan label model 1004. The stride length determines the number of windows, or inputs, to the scan label model 1004. The input segmenter 1002 utilizes the input trace sequence, a window size, and a stride length to generate the input scan windows to be sent to the scan label model 1004.
The scan label model 1004 receives the input scan windows and generates scan label probabilities for all scan windows. The scan label model 1004 may comprise one or more trained models. The models may be selected to be utilized to generate the scan label probabilities. The models may be BRNNs with one or more layers of LSTM or similar units, such as a GRU (Gated Recurrent Unit). The model may be have structure similar to that depicted in
The assembler 1006 receives the scan label probabilities and assembles the label probabilities for all scan windows together to construct the label probabilities for the entire trace of the sequencing sample. The scan label probabilities for the assembled scan windows are then sent to the decoder 1008 and the quality value model 1010.
The decoder 1008 receives the scan label probabilities for the assembled scan windows. The decoder 1008 then decodes the scan label probabilities into basecalls for the input trace sequence. The decoder 1008 may utilize a prefix Beam search or other decoders on the assembled label probabilities to find the basecalls for the sequencing sample. The basecalls for the input trace sequence and the assembled scan windows are then sent to the sequencer 1012.
The quality value model 1010 receives the scan label probabilities for the assembled scan windows. The quality value model 1010 then generates an estimated basecalling error probability. The estimated basecalling error probability may be translated to Phred-style quality scores by the following equation: QV=−10×log(Probability of Error). The quality value model 1010 may be a convolutional neural network. The quality value model 1010 may have several hidden layers with a logistic regression layer. The hypothesis functions, such as sigmoid function, may be utilized in the logistic regression layer to predict the estimated error probability based on the input scan probabilities. The quality value model 1010 may comprise one or more trained models that may be selected to be utilized. The selection may be based on minimum evaluation loss or error rate. The quality value model 1010 may be trained in accordance with the process depicted in
The sequencer 1012 receives the basecalls for the input trace sequence, the assembled scan windows, and the estimated basecalling error probabilities. The sequencer 1012 then finds the scan positions for the basecalls based on the output label probabilities from CTC networks and basecalls from decoders. The sequencer 1012 may utilize a deque max finder algorithm. The sequencer 1012 thus generates the output basecall sequence and estimated error probability.
In some embodiments, data augmentation techniques such as adding noise, spikes, dye blobs or other data artefacts or simulated sequencing trace may be utilized. These techniques may improve the robustness of the basecaller system 1000. Generative Adversarial Nets (GANs) may be utilized to implement these techniques.
Referring to
In some embodiments, data augmentation techniques such as adding noise, spikes, dye blobs or other data artefacts or simulated sequencing trace by Generative Adversarial Nets (GANs) may be utilized to improve the robustness of the models. Also, during training, other techniques, such as drop-out or weight decay, may be used to improve the generality of the models.
Referring to
A mini-batch of training samples is then selected (block 1212). The mini-batch may be selected randomly from the training dataset at each training step. The weights of the networks are updated to minimize the logistic loss against the mini-batch of training samples (block 1214). An Adam optimizer or other gradient descent optimizer may be utilized to update the weights. The networks are then saved as a model (block 1216). In some embodiments, the model is saved during specific training steps. The QV model training method 1200 then determines whether a predetermined number of training steps has been reached (decision block 1218). If not, the QV model training method 1200 is re-performed from block 1206 utilizing the network with the updated weights (i.e., the next iteration of the network). Once the pre-determined number of training steps are performed, the saved models are evaluated (block 1220). The models may be evaluated by an independent subset of samples in the validation dataset, which are not included in the training process. The selected trained models may be those with minimum evaluation loss or error rate.
As depicted in
The volatile memory 1310 and/or the nonvolatile memory 1314 may store computer-executable instructions and thus forming logic 1322 that when applied to and executed by the processor(s) 1304 implement embodiments of the processes disclosed herein.
The input device(s) 1308 include devices and mechanisms for inputting information to the data processing system 1320. These may include a keyboard, a keypad, a touch screen incorporated into the monitor or graphical user interface 1302, audio input devices such as voice recognition systems, microphones, and other types of input devices. In various embodiments, the input device(s) 1308 may be embodied as a computer mouse, a trackball, a track pad, a joystick, wireless remote, drawing tablet, voice command system, eye tracking system, and the like. The input device(s) 1308 typically allow a user to select objects, icons, control areas, text and the like that appear on the monitor or graphical user interface 1302 via a command such as a click of a button or the like.
The output device(s) 1306 include devices and mechanisms for outputting information from the data processing system 1320. These may include the monitor or graphical user interface 1302, speakers, printers, infrared LEDs, and so on as well understood in the art.
The communication network interface 1312 provides an interface to communication networks (e.g., communication network 1316) and devices external to the data processing system 1320. The communication network interface 1312 may serve as an interface for receiving data from and transmitting data to other systems. Embodiments of the communication network interface 1312 may include an Ethernet interface, a modem (telephone, satellite, cable, ISDN), (asynchronous) digital subscriber line (DSL), FireWire, USB, a wireless communication interface such as Bluetooth or WiFi, a near field communication wireless interface, a cellular interface, and the like.
The communication network interface 1312 may be coupled to the communication network 1316 via an antenna, a cable, or the like. In some embodiments, the communication network interface 1312 may be physically integrated on a circuit board of the data processing system 1320, or in some cases may be implemented in software or firmware, such as “soft modems”, or the like.
The computing device 1300 may include logic that enables communications over a network using protocols such as HTTP, TCP/IP, RTP/RTSP, IPX, UDP and the like.
The volatile memory 1310 and the nonvolatile memory 1314 are examples of tangible media configured to store computer readable data and instructions forming logic to implement aspects of the processes described herein. Other types of tangible media include removable memory (e.g., pluggable USB memory devices, mobile device SIM cards), optical storage media such as CD-ROMS, DVDs, semiconductor memories such as flash memories, non-transitory read-only-memories (ROMS), battery-backed volatile memories, networked storage devices, and the like. The volatile memory 1310 and the nonvolatile memory 1314 may be configured to store the basic programming and data constructs that provide the functionality of the disclosed processes and other embodiments thereof that fall within the scope of the present invention.
Logic 1322 that implements embodiments of the present invention may be formed by the volatile memory 1310 and/or the nonvolatile memory 1314 storing computer readable instructions. Said instructions may be read from the volatile memory 1310 and/or nonvolatile memory 1314 and executed by the processor(s) 1304. The volatile memory 1310 and the nonvolatile memory 1314 may also provide a repository for storing data used by the logic 1322.
The volatile memory 1310 and the nonvolatile memory 1314 may include a number of memories including a main random access memory (RAM) for storage of instructions and data during program execution and a read only memory (ROM) in which read-only non-transitory instructions are stored. The volatile memory 1310 and the nonvolatile memory 1314 may include a file storage subsystem providing persistent (non-volatile) storage for program and data files. The volatile memory 1310 and the nonvolatile memory 1314 may include removable storage systems, such as removable flash memory.
The bus subsystem 1318 provides a mechanism for enabling the various components and subsystems of data processing system 1320 communicate with each other as intended. Although the communication network interface 1312 is depicted schematically as a single bus, some embodiments of the bus subsystem 1318 may utilize multiple distinct busses.
It will be readily apparent to one of ordinary skill in the art that the computing device 1300 may be a device such as a smartphone, a desktop computer, a laptop computer, a rack-mounted computer system, a computer server, or a tablet computer device. As commonly known in the art, the computing device 1300 may be implemented as a collection of multiple networked computing devices. Further, the computing device 1300 will typically include operating system logic (not illustrated) the types and nature of which are well known in the art.
A new deep learning-based basecaller, Deep Basecaller, was developed to improve mixed basecalling accuracy and pure basecalling accuracy especially at 5′ and 3′ ends, and to increase read length for Sanger sequencing data in capillary electrophoresis instruments.
Bidirectional Recurrent Neural Networks (BRNNs) with Long Short-Term Memory (LSTM) units have been successfully engineered to basecall Sanger sequencing data by translating the input sequence of dye RFUs (relative fluoresce units) collected from CE instruments to the output sequence of basecalls. Large annotated Sanger Sequencing datasets, which include ˜49M basecalls for the pure base data set and ˜13.4M basecalls for the mixed base data set, were used to train and test the new deep learning based basecaller.
Below is an exemplary workflow of algorithms used for Deep Basecaller:
Below are exemplary details about quality value (QV) algorithms for Deep Basecaller:
QV=−10×log(Probability of Error).
Deep Basecaller may use deep learning approaches described above to generate the scan probabilities, basecalls with their scan positions and quality values.
LSTM BRNN or similar networks such as GRU BRNN with sequence-to-sequence architecture such as the encoder-decoder model with or without attention mechanism may also be used for basecalling Sanger sequencing data.
Segmental recurrent neural networks (SRNNs) can be also used for Deep Basecaller. In this approach, bidirectional recurrent neural nets are used to compute the “segment embeddings” for the contiguous subsequences of the input trace or input trace segments, which can be used to define compatibility scores with the output basecalls. The compatibility scores are then integrated to output a joint probability distribution over segmentations of the input and basecalls of the segments.
The frequency data of overlapped scan segments similar to Mel-frequency cepstral coefficients (MFCCs) in speech recognition can be used as the input for Deep Basecaller. Simple convolutional neural networks or other simple networks can be used on the overlapped scan segments to learn local features, which are then used as the input for LSTM BRNN or similar networks to train Deep Basecaller.
When the scans and basecalls are aligned or the scan boundaries for basecalls are known for the training data set, loss functions other than CTC loss such as Softmax cross entropy loss functions can be used with LSTM BRNN or similar networks, and such networks can be trained to classify the scans into basecalls. Alternatively, convolutional neural networks such as R-CNN (Region-based Convolutional Neural Networks) can be trained to segment the scans and then basecall each scan segment.
Terms used herein should be accorded their ordinary meaning in the relevant arts, or the meaning indicated by their use in context, but if an express definition is provided, that meaning controls.
“Circuitry” in this context refers to electrical circuitry having at least one discrete electrical circuit, electrical circuitry having at least one integrated circuit, electrical circuitry having at least one application specific integrated circuit, circuitry forming a general purpose computing device configured by a computer program (e.g., a general purpose computer configured by a computer program which at least partially carries out processes or devices described herein, or a microprocessor configured by a computer program which at least partially carries out processes or devices described herein), circuitry forming a memory device (e.g., forms of random access memory), or circuitry forming a communications device (e.g., a modem, communications switch, or optical-electrical equipment).
“Firmware” in this context refers to software logic embodied as processor-executable instructions stored in read-only memories or media.
“Hardware” in this context refers to logic embodied as analog or digital circuitry.
“Logic” in this context refers to machine memory circuits, non transitory machine readable media, and/or circuitry which by way of its material and/or material-energy configuration comprises control and/or procedural signals, and/or settings and values (such as resistance, impedance, capacitance, inductance, current/voltage ratings, etc.), that may be applied to influence the operation of a device. Magnetic media, electronic circuits, electrical and optical memory (both volatile and nonvolatile), and firmware are examples of logic. Logic specifically excludes pure signals or software per se (however does not exclude machine memories comprising software and thereby forming configurations of matter).
“Software” in this context refers to logic implemented as processor-executable instructions in a machine memory (e.g. read/write volatile or nonvolatile memory or media).
Herein, references to “one embodiment” or “an embodiment” do not necessarily refer to the same embodiment, although they may. Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively, unless expressly limited to a single one or multiple ones. Additionally, the words “herein,” “above,” “below” and words of similar import, when used in this application, refer to this application as a whole and not to any particular portions of this application. When the claims use the word “or” in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list, unless expressly limited to one or the other. Any terms not expressly defined herein have their conventional meaning as commonly understood by those having skill in the relevant art(s).
Various logic functional operations described herein may be implemented in logic that is referred to using a noun or noun phrase reflecting said operation or function. For example, an association operation may be carried out by an “associator” or “correlator”. Likewise, switching may be carried out by a “switch”, selection by a “selector”, and so on.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/065540 | 12/10/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62777429 | Dec 2018 | US |