Sequencing methods

Description

BACKGROUND

The understanding of human genetics is rapidly expanding, fueled in part by developments in large-scale sequencing technologies. The results obtained from the sequencing of a genome, however, still present numerous production and bioinformatics challenges. The sheer volume of data obtained in a single sequencing experiment poses a remarkable data analysis challenge. The quality of the data, which often comprise millions to hundreds of millions of small sequence reads, poses yet another challenge. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. Many obstacles remain in developing methods, algorithms, and computer program products for the analysis of sequencing data.

SUMMARY OF THE INVENTION

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read associated with a sequencing assay while the sequencing assay is in progress, wherein the computer system comprises a processor; and b) comparing by the processor the first sequencing read with another sequence to provide a comparison before the sequencing assay is complete.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is an overall schematic of a process wherein genomic data is streamed during sequencing and analyzed in real-time.

FIG. 2 is an overview of a process wherein genomic data can be shared within a global network.

FIG. 3 illustrates example alignments of sequence reads to a reference genome.

FIG. 4 is a visual representation of example modules of a variant phasing algorithm that separates polymorphisms onto separate haplotypes.

FIG. 5 is a graphic representation of error frequency based on a fragment length of input DNA.

FIG. 6 illustrates a process that distinguishes a polymorphism from a sequencing error.

FIG. 7 is a graphic representation of an algorithm and example modules of a computer-program product that can identify a translocation.

FIG. 8 is a visual representation of example modules of a computer program product that can identify regions of a genome that have gained or lost copies.

FIG. 9 is a visual representation of example modules of a computer program product that can provide a digital representation of a sequence of a sample based on an assembly of short sequencing-reads into long sequencing reads.

FIG. 10 is a visual representation of example modules of a computer program product that identifies sequencing reads originated from the same starting piece of DNA.

FIG. 11 is a diagram illustrating multiple possible approaches to an assembly of sequencing reads.

FIG. 12 illustrates a method of filtering continuous sequences with barcodes.

FIG. 13 is a graphical representation of a plot of alignment scores.

FIG. 14 is an overview of a method utilizing barcodes to improve alignments.

FIG. 15 is an illustration of a barcode-assisted assembly of RNA transcripts.

FIG. 16 is an illustration of a barcode-assisted assembly of RNA transcripts applied to resolving ambiguities in assembling sequencing-reads of distinct RNA isoforms.

FIG. 17 illustrates a representative visual display of a data visualization/interface.

FIG. 18 is a block diagram illustrating a first example architecture of a computer system that can be used in connection with example embodiments of the present invention.

FIG. 19 is a diagram illustrating a computer network that can be used in connection with example embodiments of the present invention.

FIG. 20 is a block diagram illustrating a second example architecture of a computer system that can be used in connection with example embodiments of the present invention.

FIG. 21 illustrates a process for creating records in which variants identified in data analysis can be co-recorded with the barcodes used to identify the variant.

DETAILED DESCRIPTION

The disclosure provides methods and computer program products with applications in medical diagnosis and in the study of disease, health, population, and evolutionary genetics. The invention provides streaming, real-time processing of haplotype phasing, structural variant identification, variant correction, phasing of structural variants, de novo assembly, and barcoded alignment of sequencing reads. The invention can determine to which haplotype in a pair of haplotypes a sequencing read corresponds. A variant phasing algorithm of the invention can distinguish between alleles of maternal and of paternal origin. The invention presented herein can, for example, utilize a barcode present in a sequencing read to distinguish identified polymorphims onto separate haplotypes and reduce error rates.

Rapid advances in DNA and RNA sequencing technologies have generated a wealth of genomic data. Such data can provide a powerful new tool permitting the identification and annotation of the genome of an organism. Sequencing data can identify and characterize: a) genes responsible for giving an organism unique characteristics; b) genes conserved among species; c) genes that illustrate evolutionary changes among organisms; d) genes associated with health/disease; e) genes that confer disease resistance to an organism; f) genes associated with agricultural and nutritional properties; and/or g) genes associated with natural physiological processes, such as aging.

Despite the growing availability of sequencing data, the ability to process the information found within the data rapidly, effectively, and accurately remains a challenge. Sequencing diploid genomes presents the challenge of distinguishing nearly identical sequencing reads based on paternal or maternal origin. The ambiguity associated with distinguishing highly similar sequencing reads derived from distinct chromosomes introduces the issue of “phasing,” or determining within which of the two chromosomes a sequencing read variant is associated. Questions of phasing are more complex than questions of distinguishing non-homologous chromosomes.

Single gene and/or genomic duplications provide another obstacle in the analysis of genomic data. Multiple copies of a variant/mutation can be present in one or more chromosomes, and the identification of the correct copy number can improve the value of genomic sequencing. Small insertions and/or deletions (“INDELs”) can be found throughout a genome, the characterization of which is a challenge. The ability to characterize translocations and inversions, and corresponding breakpoints in a genome, can be valuable.

The invention provides an efficient solution to the greatest challenges in sequencing data analysis, and is capable of analyzing sequencing data that presented in distinct sequencing formats. The methods herein can distinguish variants on separate homologous chromosomes using phasing information. Phased sequencing can capture unique chromosomal content, including mutations that differ across chromosome copies, such as distinct mutations in distinct haplotypes, and variations within the same chromosome copy, while preserving haplotype phasing information. The computer program products and methods provided herein also distinguish between sequencing errors and real variations/mutations within a sequencing read with a low error rate, and identify and characterize, for example, INDELs, translocations, inversions, and multiple copies of a gene or chromosome.

Barcodes.

The term “barcode,” as used herein, refers to a label, or identifier, that can be attached to an analyte to convey information about the analyte. Barcodes can have a variety of different formats, for example, barcodes can include: polynucleotide barcodes; random nucleic acid and/or amino acid sequences; and synthetic nucleic acid and/or amino acid sequences. A barcode can be attached to an analyte in a reversible or irreversible manner. A barcode can be added to, for example, a fragment of a DNA/RNA sample before, during, and/or after sequencing of the sample. Barcodes can allow for identification and/or quantification of individual sequencing-reads in real time.

Each partial-read, longer-partial, or full-read fragment generated during sequencing is associated with a barcode. In some embodiments, a sample can be associated with a plurality of distinct barcodes. As more data streams from a sequencer to one or a plurality of servers, the sequencing reads can be processed based on barcodes, thereby grouping partial-reads, longer-partial, and full-reads associated with a common barcode. Barcode analysis resolves ambiguity and provides haplotype phasing information.

Some embodiments, for example polynucleotide sequencing, can utilize unique barcodes to identify a sequencing read and, for example, to assemble a larger digital representation of a sample from shorter sequencing reads. Depending upon the specific application, barcodes can be added to a sample before or during a sequencing of the sample. For example, a sample can be fragmented and physically separated prior to sequencing. A barcode can be added to a fragmented sample, for example, by dividing the fragmented sample into partitions so that one or more barcodes can be introduced into a particular partition. Each partition can contain a different set of barcodes. The presence of the same barcode on multiple sequences can provide information about the origin of the sequence. For example, a barcode can indicate that the sequence came from a particular partition and/or a specific region of a genome. In some embodiments, a first sequencing read and another sequence share a common barcode. This feature can be particularly useful for an assembly of a larger sequencing read. Depending upon the specific application, barcodes can be attached to analyte fragments in a reversible or irreversible manner. In some embodiments, an algorithm and a computer program product can detect, process, distinguish, group, separate, filter out, and assemble in real-time a plurality of streaming sequencing reads associated with a plurality of barcodes.

Streaming Sequencing Data.

The streaming sequencing algorithms and computer program products herein allow for the analysis of sequencing data to start even before a sequencing experiment is complete. The systems and methods of the invention can process partial information as the partial information is received. A sequencing read can comprise a barcode. In some embodiments, the invention described herein comprises a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read associated with a sequencing assay while the sequencing experiment is in progress, wherein the computer system comprises a processor; and b) comparing by the processor the first sequencing read with another sequence to provide a comparison before the sequencing assay is complete. The other sequence can be another sequencing read or a reference sequence, such as a partial or complete reference genome. The other sequence can be obtained, for example, from an experiment or a database.

Any of the alignments, comparisons, analysis, and processes described herein can occur in real-time while a sequencing assay is still running and inputting data into the system. Data acquired initially or later in the processes are integrated seamlessly as if all the data had been received together. However, the rate at which the streaming methods can process data is much faster than that of a system that cannot stream and process data in real-time.

FIG. 1 illustrates an overall schematic of a process wherein genomic data are streamed during sequencing and analyzed in real-time with methods and computer-program products of the invention. Sequencing reads are generated with a Sequencer 101. Data associated with the sequencing reads are streamed in real-time and uploaded on Servers 102, for example, Amazon EC2™ cloud computing servers. Real-time streaming can allow the transfer of data from partial sequencing reads into a server. As the data are streamed into servers, data analysis by one or more exemplary algorithms starts 103. A computer-program product of the invention permits the process of data analysis to commence even as the sequencing is ongoing. In some embodiments, data from a sequencer of partial, full, and/or complete sequencing reads is streamed from a sequencer to a server as the sequencer runs.

The data collected in the process of streaming data from a sequencer to a server as the sequencer is running can be analyzed by streaming sequencing algorithms of the invention. In some embodiments, a computer program product can process partial information as streamed in real-time. In some embodiments, a computer program product comprises specific modules to analyze sequencing read data comprising barcodes. A module of a computer program product can, for example, align 104 a plurality of partial and/or full sequencing reads to a reference genome. The computer program products and methods of the invention contemplate the analysis of a variety of file formats for data types.

A module of a computer program product of the invention can process a plurality of different file formats for distinct data types. Non-limiting examples of file formats and associated sources of data are listed in TABLE 2.

TABLE 2

File Format
Source of Data

SAM (.sam)
Sequence alignment data

BAM (.bam)
Sequence alignment data

VCF (.vcf)
Variant call format

SEG (.seg)
Segmented data

CBS (.cbs)
Segmented data

MUT (.mut)
Mutation data

LOH (.loh)
Loss of Heterozygosity data

GFF (.gff)
Genome annotation data

GFF3 (.gff3)
Genome annotation data

BED (.bed)
Genome annotation data

GCT (.gct)
Gene expression data

RES (.res)
Gene expression data

GCT (.rnai.gct)
RNAi data

IGV (.igv)
Numeric data

TAB (.tab)
Numeric data

WIG (.wig)
Numeric data

CN (.cn)
Copy number data

SNP (.snp)
Single nucleotide polymorphism data

TDF (.tdf)
ChlP-Seq, RNA-Seq data

GISTIC (.gistic)
Amplification/deletion data

FASTA
Reference sequence (nucleic acid or protein)

The invention can process barcodes and BAM File 105, to perform a method described herein. A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data. A module can process barcodes and BAM Files 105, thereby associating a plurality of partial-read sequences with a genomic region.

One or a plurality of modules can be specifically programmed to provide, for example, an accurate solution to problems of haplotype phasing, Single Nucleotide Polymorphism (SNPs) identification, small insertion and deletion (INDELs) analysis, translocation analysis, inversion analysis, and/or copy number variant analysis (CNV) 106. The data generated in 106 can be processed, for example, in Variant Call Format (VCF) File(s) format 107. A module of a computer-program product of the invention processes data generated in 106 into a VCF File or another useful format 107. The VCF specified format of a text file can be processed by an algorithm comprised within a computer program product of the invention to store gene sequence variations of a sequencing read. Other formats, for example, the General Feature Format (GFF), can be processed by an algorithm and a computer program product of the invention to store all of the genetic data, including redundant sequences across the genome. In some embodiments, the selection of a file format that differentiates between redundant and non-redundant regions of a genome contributes to the efficiency of an analysis of sequencing reads.

A module of a computer program product can be programmed to produce files 108 for front-end analysis. A module 108 can, for example, produce and archive files corresponding to a summary and statistical analysis of the data associated with 101-107. A module 108 can provide downloadable files, visualization files, and database files. All data associated with 101-108 can be visually displayed. A computer program product module 109 can be programmed to output the data on a visual display, to run a summary of the data, and to upload, transmit, save, store, and/or copy the results. Data can be drawn, for example, from a genome, from an exome, or from a cancer panel 110.

FIG. 2 is an overall representation of the real-time transmission of a stream of data from partial-read sequences. 201 represents an illustrative sequencer generating sequencing reads from a sample. Data 202 from one or several sequencing cycles, for example, 203a, 203b, and 203c is transmitted into a server 204. The server 204 can transmit data from 203a, 203b, and 203c to a network 205. The network 205 can be a wired network or a wireless network, such as the cloud. All data described in 102-109 and 202-205 can be transmitted from one location to a different location. In some embodiments, the data can be transmitted from one geographic boundary to a second geographic boundary 206. In some embodiments, the methods and computer program products of the invention can provide a tangible product, such as a report characterizing the sequencing data of a subject.

Alignment.

A sequence alignment can be a way of arranging nucleic acid or amino acid sequences to identify regions of similarity or overlap. An alignment of sequencing reads to a reference sequence can arrange the sequencing reads for construction of a candidate sequence. An alignment of a sequencing read to a reference sequence can be a global-alignment or a local-alignment. A global alignment can align a sequencing read to an entire length of a reference sequence. A local alignment can align a sequencing read or portions of a sequencing read to a window of a reference sequence. A local alignment can identify regions of similarity within sequencing reads and that can be widely divergent overall. An alignment can use an individual sequencing read or a candidate sequence derived from multiple sequencing reads.

FIG. 3 illustrates alignment of sequencing reads to a reference sequence. 301 illustrates an alignment of a full-read sequence to a reference genome. 302 illustrates reducing an ambiguity of an alignment (302a) of a partial-read sequence to a reference sequence as more data associated with longer-partial or full-read sequences with the same barcode are streamed and aligned to the reference genome (302b).

Variant Phasing and Non-Variant Phasing Algorithms.

A method herein can sort sequencing reads into separate haplotypes based on mutations, such as single nucleotide polymorphisms (SNPs), or lack thereof, present in the sequencing reads. A mutation can be identified based on a comparison among sequencing reads or against a reference sequence. FIG. 4 is a visual representation of illustrative modules that can be programmed within a variant phasing algorithm to separate sequencing reads into separate haplotypes based on identified mutations. 401 illustrates sequencing Read 1, which is associated with Barcode 1, and sequencing Read 2, which is associated with Barcode 2. 401 also illustrates that an input received can be associated with a quantity of digital data, for example, 10-200 gigabases of data. A first module 402 can identify single nucleotide polymorphisms by comparing sequencing reads with a reference sequence. The module 402 can scan a sequencing-read for previously-characterized polymorphisms and uncharacterized polymorphisms. Each dot in 402 represents a SNP on a sequencing read, represented by a bar.

A module of a computer program product 403 can group sequencing reads sharing common barcodes for alignment against a reference sequence. A grouping of identical or similar barcodes can be aligned against a common region of the reference sequence to identify sequencing reads that originate from physically proximate regions of a sample. 403 illustrate two groups of sequencing reads represented by bars. The left triplet of reads share barcode and the right triplet of reads share the same barcode. Module 404 can filter out misidentified barcodes, for example, barcodes that are similar but not identical to those that had been previously grouped and barcodes associated with sequencing reads that do not contain an initially-identified single nucleotide polymorphism. 404 illustrates a sequencing read in the left triplet that mismatches those of the group, and is filtered out. Filtering out sequencing reads that do not comprise an initially-identified single nucleotide polymorphism can filter out a sequencing error.

Module 405 can associate and assemble sequencing reads comprising identical polymorphisms at the same position into a partial digital representation of a candidate sequence. 405 can associate sequencing reads with chromosome numbers, chromosomal position of the polymorphism, name of the polymorphism, and a p-value for the association of the sequencing reads. 405 can additionally provide a graphic representation of the association of sequencing reads comprising identical polymorphisms, for example, within a Manhattan plot, a Q-Q plot, or any other suitable graphical representation. A module 406 can provide a local alignment of 405 to a reference sequence.

Module 407 can combine local alignments into larger candidate sequences, including complete candidate chromosomes, as illustrated as the long horizontal bars in 407. A statistical analysis can be used to determine which local alignments to combine. The analysis can involve determining the pattern of SNPs on each local alignment, and determining a statistical likelihood that different groups of SNPs exist on a common candidate sequence or chromosome. Non-limiting examples of sources of such statistical information include the number of barcodes shared across local alignments, the probabilities of sequencing errors on these barcodes, and the probabilities of sequencing errors on the reads used to infer SNPs. Module 408 can use similar statistical information or alignment data to suggest corrections to possible alignment errors made earlier in the process. 408 illustrates a correction being made by exchanging a sequencing read from one chromosome to the other based on the SNPs or lack thereof associated with those sequencing reads.

Module 409 can then identify that one of the candidate sequences corresponds to one haplotype or a diplotype, for example, the maternal haplotype, and the other candidate sequence corresponds to the other haplotype, such as the paternal haplotype. As illustrated in 409, all SNPs or lack thereof represented by dark dots are associated with one haplotype, and all SNPs or lack thereof associated with white dots are associated with the other haplotype. The system of FIG. 4 can distinguish one haplotype from the other based on the unique SNP patterns of the haplotypes, notwithstanding that the sequencing reads of one haplotype are otherwise identical to the sequencing reads of the other haplotype.

FIG. 5 is a graphic representation of improvements in the error rate of switching sequencing reads between haplotypes as the size of DNA fragment length increases. Coverage is the average number of reads representing a given nucleotide in an assembly digital sequence. High coverage can reduce the error rate of assigning nucleotides and sequencing reads to haplotypes correctly.

Polymorphisms and Sequencing Errors.

A method or a computer program product can distinguish single nucleotide polymorphisms from sequencing errors. FIG. 6 illustrates a process of variant correction utilizing haplotype phasing to distinguish a real mutation from sequencing errors. 601 represents a diploid sample wherein the haplotypes have already been distinguished by the process of FIG. 4. Haplotype 1 does not contain a mutation, but haplotype 2 does contain a SNP. Each haplotype is represented by a triplet of sequencing reads that overlap at the site of the SNP. 602 illustrates that as the sequencing reads are processed into candidate sequences, the SNP associated with haplotype 2 sequencing reads is unambiguously assigned to haplotype 2, and haplotype 1 does not show a SNP at that position.

The diploid sample of 603 does not possess a SNP on either haplotype, but does suffer from an error in sequencing. Each haploid is represented by a triplet of sequencing reads, none of which shows a SNP. 604 illustrates the outcome obtained in the event of a sequencing error. The error appears among the sequencing reads at random, and is not unambiguously associated with one of the haplotypes. This random distribution of the erroneous nucleic acid is clearly distinguishable from the scenario of 602, wherein the SNP is not distributed randomly among the sequencing reads of both haplotypes. Thus, when a putative SNP is identified in a sequencing assay, haploid phasing can verify the existence of a SNP instead of an error.

Translocations and Inversions.

Translocations and Inversions can lead to variability in the DNA, RNA, and protein constitution of a subject. Translocations have been identified as causes of cancers, infertility, and Down's syndrome. Translocations can occur naturally as reciprocal translocations between nonhomologous chromosomes. The invention provides computer program products and methods that identify previously known or unknown translocations and inversions within a set of sequencing reads of a sample.

In an analysis of a wild type genome, an individual barcode is ordinarily associated most prominently with one local region of the genome. Sequencing reads derived from physically-distant regions of a genome are less likely to share a common barcode. When barcoded sequencing reads are aligned to a reference genome, if remote portions of the reference genome are represented by sequencing reads that share a common barcode, then the possibility of a mutation, such as a translocation, can be investigated. A translocation in a sample would result in distant portions of a reference genome being proximate in a sample.

FIG. 7 illustrates an algorithm and modules of a computer program product that can determine a translocation, thereby identifying distant regions of a genome that had been attached prior to translocation. 701 illustrates sequencing Read 1, which is associated with Barcode 1, and sequencing Read 2, which is associated with Barcode 2. 701 also illustrates the quantity of digital data associated with the sequencing reads, for example, 10-200 gigabases of data, which can be analyzed in real-time.

Module 702 aligns sequencing reads to a reference sequence, such as a reference chromosome or a reference genome, which can comprise multiple chromosomes. Each sequencing read independently comprises a barcode.

Module 703 represents barcodes distributed across three reference chromosomes (chr1, chr2, and chr3) in a reference genome. Each column of each matrix represents a region of each chromosome, and each row represents a specific barcode. The value at a point in a matrix indicates the number of times that the barcode was identified in the corresponding region of the chromosome. If regions have high numbers of barcodes in common, then those regions have a higher probability of being located close together in the sample, even if the regions appear far apart in the reference genome.

Module 704 processes matrices to identify candidate breakpoints. Matrices are multiplied, and the outcome can be scored to suggest a likelihood that a breakpoint exists on a certain reference chromosome in the reference genome. Module 705 then analyzes candidate breakpoints by p-value to determine which candidate breakpoints are most likely to be real.

Module 706 can identify the precise breakpoint in the reference genome leading to the translocation. Module 1006 aligns sequencing reads (dots) sharing a common barcode against a reference chromosome (horizontal bar) until a point is reached (vertical hash), beyond which sequencing reads possessing the common barcode no longer align with the reference chromosome. That point is the breakpoint of translocation. The breakpoint can be, for example, about 1, about 2, about 3, about 4, about 5, about 6, about 7, about 8, about 9, about 10, about 11, about 12, about 13, about 14, about 15, about 16, about 17, about 18, about 19, about 20, about 21, about 22, about 23, about 24, about 25, about 26, about 27, about 28, about 29, about 30, about 31, about 32, about 33, about 34, about 35, about 36, about 37, about 38, about 39, about 40, about 41, about 42, about 43, about 44, about 45, about 46, about 47, about 48, about 49, or about 50 residues from a sequencing read that does align with the reference chromosome.

Module 707 can determine which haplotype in a pair of haplotypes is more likely to possess the translocation. Once the sequencing reads have been sorted by haplotype using the methods described herein, module 707 can determine which haplotype is more prominently associated with the translocation by determining which sequencing reads are associated with both the translocation and a particular haplotype.

Identifying Gained or Lost Copies.

The methods disclosed herein can identify gained or lost copies of a residue, sequence, or fragment, for example, a nucleotide residue or an amino acid residue. FIG. 8 illustrates modules of computer program products that can identify a copy number variant within a sample. 801 illustrates example sequencing Read 1, which is associated with Barcode 1, and example sequencing Read 2, which is associated with Barcode 2. 1101 also illustrates that an input received from a first sequencing-read and a second sequencing-read can be associated with a quantity of digital data, for example, 10-200 gigabases of data, which can be analyzed in real-time. 802 and 803 illustrate alternative approaches for identifying copy number variants within a sample.

A system 802 can receive a first sequencing read and a second sequencing read. The first sequencing read and the second sequencing read can each be independently associated with barcodes, which can be identical or distinct. Module 802a can receive a plurality of sequencing reads, which can total in the hundreds, thousands, tens of thousands, or greater. Module 802b can align sequencing reads to a reference sequence and can further identify neighboring sequencing reads in the alignment, and determine whether the neighboring sequencing reads overlap at a non-barcode subsequence.

Module 802c can associate a set of sequencing reads comprising a common barcode to a region (a “window”) of a reference sequence. The system can receive an additional sequencing read that does not comprise the common barcode, which can be filtered out by module 802c. Module 802d can count a number of distinct barcodes that have been associated with a region of a reference sequence. Module 802e can then identify candidate regions for a copy number variant based on the number of counted barcodes in a local region compared to global counts of the total number of barcodes in sample or in a region of the sample. If the number of barcode counts associated with a particular window is significantly greater than a statistical distribution, then the window can be designated as a candidate for a copy number mutation. The number of copies can be estimated based on the numbers of barcodes counted locally versus the numbers counted globally. Module 802f can filter the identified regions based, for example, on a p-value to determine which candidate windows are most likely to contain a real mutation. Module 802g can determine to which haplotype in a pair of haplotypes the copy number variant corresponds using a haplotype phasing algorithms provided herein.

In alternative 803, the system begins the analysis using phasing information associated with received sequencing reads. Module 803a phases a set of sequencing reads comprising barcodes by haplotype using methods described herein to provide two sets of sequencing reads: one associated with the first haplotype and the another associated with the second haplotype in a haplotype pair. Module 803b determines or approximates a ratio of received sequencing reads associated with the first haplotype and received sequencing reads associated with the second haplotype within a window of a reference sequence. Module 803c can identify candidate copy number variant regions based on a comparison of the number of barcodes within the window associated with one haplotype versus the other. For example, a ratio of barcodes associated with each haplotype that significantly deviates from a statistical expectation can suggest the presence of a copy number variant on a haplotype, and can provide an estimation of the number of copies. Module 803d can filter the identified regions based, for example, on a p-value to determine which candidate windows are most likely to contain a real mutation.

A module can be programmed to execute any useful statistical calculation to determine, for example, the statistical significance of data suggesting a polymorphism, a copy number variant, a translocation, or an inversion, or the statistical likelihood of the existence of any of the foregoing. Non limiting example of a statistical analysis that can be performed include: a) analysis of variance (ANOVA); b) chi-squared test; c) factor analysis; d) Mann-Whitney U analysis; e) mean square weighted deviation (MSWD); f) Pearson product-moment correlation coefficient; g) regression analysis; h) Spearman's rank correlation coefficient; i) Student's t-test; j) time series analysis; and k) p-value analysis.

Assembling Sequencing Reads.

A challenge in the analysis of sequencing data is to assemble short sequencing reads obtained experimentally into a longer digital representation of the sample unambiguously. Longer sequencing reads can provide faster and more reliable processing, a lower error rate, and a greater confidence level in alignments and construction of candidate sequences. FIG. 9 illustrates modules that can be coded within an assembly algorithm that assembles short sequencing reads into long sequencing reads. 901 illustrates sequencing Read 1, which is associated with Barcode 1, and sequencing Read 2, which is associated with Barcode 2. 901 also illustrates that an input received can be associated with a quantity of digital data, for example, 10-200 gigabases of data, which can be analyzed in real-time.

Module 902 identifies barcodes in received sequencing reads and searches for errors in the barcode reads. An error in reading a barcode ordinarily produces a sequence that deviates from a barcode only slightly, or has a high degree of homology with a real barcode, such as about 80%, about 85%, about 90%, about 95%, or about 97% homology with a real barcode. Errors in reading barcodes can also be identified by a count of the number of identical barcodes received. A particular barcode error should be received in much lower frequency that a real barcode. Thus, a barcode observed with a frequency that is statistically low and possesses high homology with another barcode that is received in much higher quantity is most likely an error in reading the other barcode.

Module 902 corrects the error in the erroneous barcode so the corresponding sequencing read can be processed correctly. 902 illustrates two triplets of sequencing reads. The right triplet shares a common barcode represented by a white dot. The left triplet shares a common barcode represented by a black dot; however, one sequencing read possesses the wrong barcode. 902 finds and corrects this error.

Module 903 then groups sequencing reads based on barcodes associated with the sequencing reads. 903 illustrates three groups, A, B, and C, of sequencing reads obtained from a source. All sequencing reads in a group share a barcode in common with the other sequencing reads in the same group, but do not share a barcode in common with any other group. For example, all sequencing reads in Group A are associated with barcode A exclusively, all sequencing reads in Group B are associated with barcode B exclusively, and all sequencing reads in Group C are associated with barcode C exclusively.

Once the sequencing reads have been grouped, module 904 identifies sequencing errors in the sequencing reads. Errors can be identified based on the frequency of certain subsequences occurring among sequencing reads of the same group. The confidence that a subsequence is correct increases as the number of sequencing reads containing the subsequence increases. However, a small number of sequencing reads can possess a subsequence that differs in a small number of residues from another subsequence that has been identified with a higher confidence level. The deviant subsequence can be identified as a sequence error and corrected to match the subsequence obtained at a higher confidence level.

Module 905 can compare short sequencing reads for overlapping subsequences. Multiple overlaps can be found, and the greatest statistical significance and confidence is assigned to long overlaps that are observed over the greatest number of sequencing reads. As high-confidence overlaps are found, the smaller sequencing reads can be assembled into increasingly larger sequencing reads.

As the process continues, non-overlapping sequencing reads are eventually joined. For example, if a first sequencing read overlaps with a second sequencing read, but not with a third, and the second sequencing read does overlap with the third, module 905 can determine a candidate sequence organizing the first, second, and third sequencing reads into a single longer sequence.

Once longer sequencing reads have been generated, module 906 can improve data obtained from shorter sequencing reads by comparison of short sequencing reads with the longer sequencing reads. A short sequencing read can be aligned against a longer sequencing read that spans greater than the length of the short sequencing read. A high degree of homology in the alignment suggests with high confidence that both sequencing reads correspond to the same region of the sample. If the sequencing reads differ only slightly, such as in a single residue, module 906 can correct the short sequencing read to match the longer sequencing read because the longer sequencing read, being assembled from several barcoded sequencing reads, has a higher likelihood of being accurate. Module 907 can then output the long sequencing reads for use in development of candidate sequences.

Identifying Sequencing Reads Originating From a Common Source.

FIG. 10 is a visual representation of modules that can be encoded in an algorithm that identifies sequencing reads associated with a common biomolecule, such as a nucleic acid. 1001 illustrates sequencing Read 1, which is associated with Barcode 1, and sequencing Read 2, which is associated with Barcode 2. 1001 also illustrates that a sequencing read input received can be associated with a quantity of digital data, for example, 10-200 gigabases of data, which can be analyzed in real-time.

As sequencing reads are received and barcodes are recognized, module 1002 can correct errors in barcodes using methods described herein. The left triplet shown in 1002 illustrates a set of sequencing reads that have a mismatched barcode (white center). The right triplet shown in 1002 illustrates a set of sequencing reads that have properly matched barcodes. Module 1003 can then align sequencing reads to a reference, such as a reference genome. Module 1004 then groups sequencing reads within the alignment based on barcodes associated with each sequencing read, to provide groups of sequencing reads wherein each member of a group shares a common barcode. 1004 illustrates two such groups of barcoded sequencing reads aligned against the same reference genome, wherein both groups share the same common barcode.

Module 1005 can analyze the alignments to identify local regions of a sample. Overlapping sequencing reads sharing a common barcode that together provide a consensus region of a biomolecule represent a local region, and can be distinguished from another group of barcoded sequencing reads, which together provide a separate consensus region. As illustrated in 1004, both groups of sequencing reads share the same common barcode, and this commonality can suggest that both groups of sequencing reads represent local regions of a biomolecule that are spatially close. Very close regions can be physically contiguous in the sample.

Once close, or contiguous, regions have been identified, module 1006 can realign the sequencing reads to improve the quality of the alignment. When the local regions are contiguous, an individual sequencing read can potentially be aligned with one subregion that provides an alignment less accurate than that possible with another subregion. Dividing the barcoded group into distinguishable contiguous local regions provides a second opportunity to reexamine the alignments, and when necessary, transfer a sequencing read from one subregion to another contiguous subregion to provide a more optimal alignment, as illustrated in 1006.

Assembling Short Sequencing Reads into a Digital Representation of a Sample.

Assembling short sequencing reads into longer contiguous and overlapping sequences suitable for generating candidate sequences is an expensive and time-consuming aspect of sequencing data interpretation. FIG. 11 provides an overview of multiple approaches to a high-level assembly of short sequencing reads into long sequencing reads, made more rapid and efficient by the processes provided herein. Sequencing data 1101 can comprise short sequencing reads associated with barcodes. One approach 1102 is to assemble short sequencing reads comprising identical barcodes into contiguous sequences, and to infer a consensus sequence of overlapping residues based on sequence overlaps. A second approach 1103 is to form a scaffold of contiguous sequencing reads based on an overlap of the sequencing reads and partial/global alignment to a reference genome. The short sequencing reads assembled in 1103 can comprise barcodes, and a barcode can be used as described herein to resolve alignment ambiguities. A third approach is to assemble short barcoded sequencing reads based on alignment 1104 of the short sequencing reads, filtering 1105 of mismatched sequencing reads based on the barcodes, and assembly 1106 of candidate sequences using both long reads and the filtered short reads. A fourth approach 1107 is identification of a group of short sequencing reads based on an understanding of the origin of the sequencing reads within a sequencer apparatus. Such an approach can efficiently assemble a set of overlapping short sequencing reads into longer sequences, and the short and long sequencing reads can be used together. Alternatively, process 1106 can be used directly without first aligning and filtering the short sequencing reads. All approaches are accessible via real-time processing.

FIG. 12 illustrates advantages of barcode-assisted assembly of a sequencing read. Pane 1201 illustrates the assembly of a candidate sequence from sequencing reads, and a selection of the best candidate sequence based on barcode analysis. 1201a illustrates a candidate sequence derived from five sequencing reads, 1, 2, 3, 4, and 5, each of which corresponds to either barcode 1 (dashed) or barcode 2 (dotted). The pattern of barcodes 1 and 2 on the candidate sequence suggests that the sequencing reads associated with both barcodes originate from the same local region of the sample because both barcodes appear in overlapping series. Thus, the data obtained from the sequencing reads associated with barcode for barcode 2 mutually reinforce one another and add confidence to the candidate sequence of 1201a.

1201
b illustrates a candidate sequence derived from four sequencing-reads, 1, 2, 3, and 4, each of which corresponds to either barcode 1 or barcode 2. The alignment of 1201b indicates that sequencing reads 1 and 2 are likely to correspond to proximate regions in a sample, and sequencing reads 3 and 4 are likely to correspond to proximate regions in a sample. However, since no overlap is observed between sequencing reads of differing barcodes, the data obtained from barcodes 1 and 2 do not reinforce one another. The analysis cannot ascertain unambiguously whether the alignment illustrated in 1201b corresponds to a contiguous set of overlapping sequences or to remote regions of the sample.

1201
c illustrates a candidate sequence derived from four sequencing reads, 1, 2, 3, and 4, each of which corresponds to either barcode 1 or barcode 2. The alignment of 1201c indicates that sequencing reads 3 and 4 are likely to correspond to proximate regions in a sample, but cannot provide an unambiguous relationship between sequencing reads 1 and 4. Since no overlap is observed between the alignments of 1201c sequencing reads of differing barcodes, the data obtained from barcodes 1 and 2 do not reinforce one another. The analysis cannot ascertain unambiguously whether the alignment illustrated in 1201c corresponds to a contiguous set of overlapping sequences or to remote regions of the sample. Of the three alignments, 1201a offers the highest confidence option for constructing an accurate candidate sequence.

Panels 1202 and 1203 provide graphic representations of outputs of sequencing alignments against a reference sequence. Both graphs indicate the numbering of residues in the reference sequence on the x-axis, and, on the y-axis, the number of sequencing reads detected that correspond to the value of the x-axis.

1202 illustrates a negative spike around x=50, reaching down to y=0, over a span of residues of the reference sequence. This spike indicates a gap in the alignment between the sequencing reads and the reference sequence, a region of the reference sequence to which none of the sequencing reads correspond. The alignment producing this graph is unlikely to represent the sample accurately.

1203 illustrates a plot of a more accurate alignment. Each residue of the reference sequence is represented in multiple sequencing reads. The alignment producing this graph is more likely to represent the sample accurately.

FIG. 13 illustrates a plot scoring alignments generated by the methods described herein. A quality control module can generate a score for an alignment based on numerous factors such as percentage of reference residues represented and number of time represented, degree of sequencing read overlap, percentage of duplicates, number of errors, and numbers and ratios of barcodes detected. In FIG. 13, an alignment producing the results of 1202 would be scored low, and would be grouped “Unlikely Accurate”. An alignment producing the results of 1203 would be scored high, and would be grouped “Likely Accurate”.

FIG. 14 illustrates simulated results of barcode-assisted alignments as a function of sequencing-read length and mapping error. Simulations were performed for both known sequencing analysis algorithms and for algorithms disclosed herein, using single barcode per read and paired-end barcodes per read systems. The plots indicate that in both cases of single or paired barcodes, the algorithms of the invention produce lower error rates at every read length investigated.

RNA Sequencing Algorithms.

Alternative splicing events, which can be common in eukaryotic RNA, pose challenges to the identification of distinct splicing variants with sequencing technologies. The invention provides an effective and accurate method to resolve ambiguities present in sequencing reads generated with RNA sequencing technologies. Non-limiting examples of classes of RNAs that can be analyzed with a method herein include: messenger RNAs (mRNAs), transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), small interfering RNAs (siRNAs), piwi-interacting (piRNAs), non-coding RNAs (ncRNAs), long non-coding RNAs, (lncRNAs), and fragments of any of the foregoing.

The methods herein of analyzing sequencing data solve challenges in analyzing RNA sequencing data associated with genes that are alternatively spliced. FIG. 15 illustrates a process wherein an association of a specific barcode with individual transcripts unambiguously identifies each transcript, whether the transcript contains alternatively spliced exons or not.

In 1501, each barcode is associated with a distinct transcript. For example, a translocation frequently identified in prostate cancer can include fusions of the TMPRSS2 gene, and the ETS transcription factor gene ERG to generate TMPRSS2-ERG translocations. The presence of an identical barcode in a transcript comprising a fragment of the TMPRSS2 gene and a fragment of the ERG gene can designate the presence of a TMPRSS2-ERG translocation in a sample.

1502 illustrates a method wherein an association of a barcode with a particular locus can indicate the locus of origin of a transcript. For example, a mutation within the first exon of the Adenomatous polyposis coli (APC) gene of a human subject can be associated with a 95% likelihood of the subject developing familial adenomatous polyposis (FAP). A method of the invention can associate an RNA transcript with a haplotype, thereby unambiguously identifying the entirety of the transcript comprising the mutation.

Another challenge presented by RNA sequencing analysis is the reconstruction of short sequencing reads into correct isoforms. FIG. 16 illustrates a barcode-assisted assembly of RNA transcripts applied to resolving ambiguities in assembling sequencing reads of distinct RNA isoforms. The association of identical barcodes to different RNA isoforms provides an effective method to resolve ambiguities in assembling short sequencing-reads generated in RNA sequencing.

Visual Output.

The methods and computer program products provided herein can output information associated with a sequencing assay in real-time as an assay runs. Output information can include sequencing parameters, data, sequencing reads, candidate sequences, alignments, comparisons, mutations, errors, reference sequences used, and information associated with the assay instrument, user, or client. A system can track assays occurring on multiple instruments simultaneously and identify the instruments being used by make, model, serial number, owner, or user. In doing so, the output module can provide information about the end users and their preferences and parameters. A visual output can track the construction of multiple candidate sequences simultaneously, and display multiple candidate chromosomes simultaneously for comparison.

A user of the output system can monitor progress of an analysis, and can interact with information during the analysis, either in static or real time, and upon conclusion of an analysis. The user can make notations in the interface, such as posting comments on an analysis and flagging points of interest, such as potential errors. The user can obtain information from the interface, including data, data processing results, and information describing an instrument, computer, client, facility, sample, or subject associated with the sequencing data.

FIG. 17 is illustrates visual outputs. 1701 illustrates a visual display outputting data in real-time. 1701a illustrates a screen that displays sequencing assay parameters processed in real-time. Example output parameters include: a) median depth of sequencing; b) standard-deviation of depth of sequencing; c) total number of sequencing-reads received; d) percentage of pair aligned reads; e) percentage of single aligned reads; f) percentage of pair unaligned reads; g) percentage of zero coverage; h) percentage low coverage; i) mean insert size; and j) standard deviation of the insert size.

Display module 1701b illustrates displays parameters associated with barcode and sequencing read assembly in real-time. Examples of display information include: a) a total number of unique barcodes; b) percentage of barcodes with errors; c) final number of unique barcodes; d) mean reads per barcode; e) percentage of reads that are realigned; f) mean inferred contiguous sequence length; g) percentage of inferred contiguous sequences smaller than a threshold size; h) number of inferred contiguous sequences; i) mean or maximum contiguous sequence length per barcode; and j) standard deviation of contiguous sequence per barcode. Display module 1701c displays parameters associated with polymorphisms in real-time. Example parameters include: a) number of single nucleotide polymorphisms identified; b) number of INDELs identified; c) rate of identifying homozygous polymorphisms; d) number of translocations identified; e) number of translocations associated with a haplotype; % single nucleotide polymorphisms associated with a haplotype; and g) rate of INDELS associated with a haplotype.

The output modules can display graphical representations of the parameters above. 1701d illustrates a graphical representation of a coverage depth of sequencing. 1701e illustrates a graphical representation of an inferred assembly of a set of sequencing reads. 1701f illustrates a graphical representation of the length of a phase block.

Data Storage.

Output data from a sequencing experiment, and output results from data analysis, can be stored electronically in data archives. Any information described herein can be stored in a data archive, and specific information can be stored in files configured to contain or reference any barcodes associated with the information. A record can contain, for example, information associated with a variant that was identified in data analysis. Such information can include the identifying features of the variant, such as the chromosome bearing the variant, the position on the chromosome, and the identities of the nucleobases varied from a reference, for example, replacing adenosine with cytidine.

The record can contain the identities or sequences of any sequencing read associated with the variant. The record can further contain any barcode associated with the variant or associated sequencing reads. The inclusion of the barcodes in the records provides access to any information derivable from the barcodes, such as the error probability associated with the variant, and haplotype phasing.

The incorporation or association of barcodes with the records describing the variants allows for significant improvements in the efficiency of data analysis and manipulation. A user has access to the relevant barcodes without the need to search the raw data to find the barcodes associated with the variant. If further data analysis is done at a later time, the analysis can begin directly from the record rather than having to redo the initial data analysis. The avoidance of reanalyzing barcode/variant correlations established previously saves time and cost, and facilitates the performance of multiple successive analyses from the same data batch. Also, a user can focus subsequent data analysis on precisely the barcodes, or sequencing reads associated therewith, relevant to the variant of interest without the need to process all other barcodes associated with the original sample material. The data storage system promotes multiple bioinformatic applications, such as variant phasing, single-cell analysis, and metagenomics by using the barcodes to identify the place of origin of a variant.

FIG. 21 illustrates a non-limiting example of the record system. 2101 illustrates obtaining the raw sequencing data in the form of a Fastq file. The Fastq file can contain many barcodes associated with a sequencing experiment. Alignment of the sequencing reads provides aligned data, which can be stored in a BAM file (2102). The barcodes can be associated with variants in the aligned data, and can identify the variant by location. 2102 shows that barcode 1 is associated with a variant at position 3,955,443 on chromosome 20, and barcode 2 is associated with a variant at position 41,405,930 on chromosome 3. In 2103, the variants have been called and recorded in a VCF file. The VCF file identifies the differences between the sequencing data and a reference. Position 3,955,443 on chromosome 20 of the sample has a cytidine residue, whereas the reference has an adenosine residue, and position 41,405,930 on chromosome 3 of the sample has an adenosine residue whereas the reference has a guanosine residue. Once the variants are identified, the record can be elaborated (through use of the bam file (2102)) with the identities of all barcodes that support the variant call, including barcodes supporting the called variant and barcodes supporting the reference allele (2104).

Generating and Analyzing DNA/RNA Sequencing Reads.

Sequencing experiments can be done individually or in parallel to provide tens, hundreds, or thousands of sequences simultaneously. In addition to identifying standard nucleotides and amino acids, methods herein can identify natural and non-natural modifications within a sample, for example, methylation of the cytosine base to produce 5-methylcytosine. In some embodiments, a sample comprises modifications within traditional base pairs, such as hypoxanthine, or xanthine, which can be purine derivatives. In some embodiments, a sample can include natural and non-natural RNA derivatives, such as Inosine (I), and modifications such as 2′-O-methylribose modifications.

A computer program products and methods herein can analyze data generated by any sequencing instrument. Non-limiting examples of sequencers include: a) DNA sequencers produced by Illumina™, for example, HiSeg™, HiScanSQ™, Genome Analyzer GAIIX™, and MiSeg™ models; b) DNA sequencers produced by Life Technologies™, for example, DNA sequencers under the AB Applied Biosystems™ and/or Ion Torrent™ brands; c) DNA sequencers manufactured by Beckman Coulter™; d) DNA sequencers manufactured by 454 Life Sciences™; and e) DNA sequencers manufactured by Pacific Biosciences™.

Methods of the invention can analyze data obtained in a variety of sequencing experiments. For example, sequencing-by-synthesis technology uses a unique fluorescent-label for each of adenine, cytidine, guanine, and thymidine. A nucleotide chain is synthesized based on a sample sequence, and during each sequencing cycle, a single labeled deoxynucleoside triphosphate (dNTP) is added to the nucleic acid chain. The nucleotide label terminates polymerization. After each dNTP incorporation, the fluorescent dye is irradiated to identify the newly-added residue and then the label is enzymatically cleaved to allow incorporation of the next nucleotide.

Semiconductor sequencing technology is based on the premise that when a nucleotide is incorporated into a strand of DNA by a polymerase, a hydrogen ion is released as a byproduct. A sequence is synthesized based on a template sample sequence. Each of adenine, cytidine, guanine, and thymidine is added sequentially. If one of the nucleotides is incorporated into the growing chain, the charge from the released hydrogen ion changes the pH of the reaction solution. A solid-state pH meter detects the pH change, thereby identifying the chain elongation. If a nucleotide is not incorporated, no voltage change will be recorded by the solid-state pH meter and no information will be incorporated into a sequencing read. A pattern of released hydrogen ions can provide the information needed to construct a sequencing read.

Samples.

A sequencing read can be generated from various samples. A sample can be derived from a variety of biological sources, such as blood, saliva, tissue, or a biopsy. A sequence of a sample can comprise information associated with a variety of biomolecules. Non-limiting examples of biomolecules include amino acids, peptides, peptide-mimetics, proteins, recombinant proteins, antibodies (monoclonal or polyclonal), antibody fragments, antigens, epitopes, carbohydrates, lipids, fatty acids, enzymes, natural products, nucleic acids (including DNA, mRNA, microRNA, tRNA, rRNA, long non-coding RNAs, snoRNA, nucleosides, nucleotides, structure analogues or combinations thereof), nutrients, receptors, and vitamins.

In some embodiments, the amount of sample requires to provide sufficient data for analysis, or analysis with a the confidence rates or error rates described herein, is no more than about 1 ng, no more than about 5 ngs, no more than about 10 ngs, no more than about 15 ngs, no more than about 20 ngs, no more than about 25 ngs, no more than about 30 ngs, no more than about 35 ngs, no more than about 40 ngs, no more than about 45 ngs, no more than about 50 ngs, no more than about 55 ngs, no more than about 60 ngs, no more than about 65 ngs, no more than about 70 ngs, no more than about 75 ngs, no more than about 80 ngs, no more than about 85 ngs, no more than about 90 ngs, no more than about 95 ngs, no more than about 100 ngs, no more than about 200 ngs, no more than about 300 ngs, no more than about 400 ngs, no more than about 500 ngs, no more than about 600 ngs, no more than about 700 ngs, no more than about 800 ngs, no more than about 900 ngs, or no more than about 1 mg. In some embodiments, the amount of sample requires to provide sufficient data for analysis, or analysis with a the confidence rates or error rates described herein, is no more than about 1 pg, no more than about 5 pgs, no more than about 10 pgs, no more than about 15 pgs, no more than about 20 pgs, no more than about 25 pgs, no more than about 30 pgs, no more than about 35 pgs, no more than about 40 pgs, no more than about 45 pgs, no more than about 50 pgs, no more than about 55 pgs, no more than about 60 pgs, no more than about 65 pgs, no more than about 70 pgs, no more than about 75 pgs, no more than about 80 pgs, no more than about 85 pgs, no more than about 90 pgs, no more than about 95 pgs, no more than about 100 pgs, no more than about 200 pgs, no more than about 300 pgs, no more than about 400 pgs, no more than about 500 pgs, no more than about 600 pgs, no more than about 700 pgs, no more than about 800 pgs, no more than about 900 pgs, or no more than about 1 ng.

DNA Sequencing Sample Preparation.

A non-limiting example of a comprehensive DNA sequencing protocol that generates sequencing-reads is as follows.

Sample preparation can include DNA shearing. DNA shearing can comprise: (1) Transferring 1-3 μg of DNA to a 1.5 mL tube; (2) Bringing the DNA up to a volume of up to 130 μL with nuclease-free water; and (3) Transferring DNA to a microtube in preparation for sonication at 4° C. for 1-10 minutes

The sheared sample can be immobilized on a surface. Solid Phase Reversible Immobilization (SPRI™) beads can be used to immobilize a DNA or an RNA sample to a surface. The beads can be paramagnetic beads coated with nucleic acid binding moieties, such as carboxyl groups. To immobilize DNA to a set of beads, a user can: (1) Mix by vortex a quantity of beads to ensure a homogeneous solution; (2) Transfer 180 μL of bead solution to a 1.5 mL tube. (3) Add 100 μL of end-repaired DNA; (4) Mix by vortex for 5 minutes; (5) Place the tube on a magnet for 3-5 minutes, until the solution is clear; (6) Remove the cleared supernatant, while tube is on the magnet; (7) Add approximately 500 mL of 70% ethanol, while the tube is on the magnet; (8) Allow the beads to settle for approximately 1 minute; (9) Remove the ethanol and wash once with 500 μL of 70% ethanol; (10) Remove the ethanol; (11) Dry the sample on a heating block at 37° C. for 5 minutes to remove residual ethanol; (12) Add 10 μL of nuclease-free water to the beads; (13) Mix by vortex and incubate at room temperature for 2 minutes; (14) Place the tube on magnet for 2-3 minutes until the solution is clear; and (15) Transfer 15 μL of the solution to a fresh PCR tube, while being careful not to disturb the beads.

Sample preparation can include steps to add supplementary adenosine bases to immobilized/fragmented DNA. To add adenosine bases to immobilized/fragmented DNA, a user can: (1) Prepare a reaction (10 μL eluted DNA; 2 μL 10× Blue Buffer; 5 μL 1 mM dATP; and 3 μL Klenow exo-50,000 U/mL) in a 1.5 mL tube; (2) Incubate in the tube in a heat block at 37° C. for 30 minutes; and (3) Heat-inactivate the enzyme by incubating at 75° C. for 20 minutes.

Sample preparation can comprise ligating adapters to DNA fragments. An adapter can comprise a barcode. An adapter can be a structure used to attach a barcode to a target polynucleotide. A sample of DNA or RNA (total or fractionated, such as poly(A) enriched RNA) can be converted to a library of fragments with adaptors attached to one or both ends.

A non-limiting example of a protocol that can ligate an adapter to DNA fragments includes: (1) Prepare a reaction (20 μL DNA; 25 μL 2× rapid ligase buffer; 100 μM adapter mix; and 1 μL T4 DNA ligase) in a 1.5 mL tube; (2) Incubate on a thermal cycler at 20° C. for 30 minutes; and (3) Purify with AMPure™ XP beads.

The DNA-adapted ligated fragments can be purified prior to sequencing. A non-limiting example of a method of purification includes AMPure™ DNA Purification methods. (1) SPRI™ beads can be allowed to equilibrate at room temperature. (2) Beads are mixed by vortex to ensure a homogeneous solution. (3) 60 μL of bead solution is transferred to a 1.5 mL tube. (4) 50 μL of the adapter-ligated library is transferred to the tube. (5) Mix by vortex for 5 minutes. (6) The tube is placed on a magnet for 3-5 minutes, until the solution is clear. (7) The supernatant is removed, while the tube is on the magnet. (8) 500 μL of 70% ethanol is added to the sample while the tube is on the magnet. (9) Beads are allowed to settle for 1 minute. (10) Ethanol is removed and the beads are washed again with 500 μL of 70% ethanol. (11) Ethanol is removed. (12) The sample is dried on a 37° C. heat block for 5 minutes to remove residual ethanol. (13) Add 30 μL of nuclease-free water to the beads. (14) The sample is mixed by vortex and incubated at room temperature for about 2 minutes. (15) The tube is placed on a magnet for about 2-3 minutes until the solution is clear. (16) 30 μL of solution is transferred to fresh PCR tubes. To confirm the addition of adapters to DNA samples, 1 μL of sample can be evaluated on, for example, an Agilent 2100 Bioanalyzer™.

A non-limiting example of a protocol for sample library amplification includes: (1) Prepare a reaction (15 μL purified DNA template; 1.5 μL 25 μM forward primer; 1.5 μL 25 μM reverse primer; 25 μL 2× Mastermix; and 7 μL water) for each sample to be amplified. (2) Run a thermal cycler program (A: 98° C. for 2 minutes; B: 11 cycles of: i) 98° C. for 20 seconds; ii) 64° C. for 30 seconds; and iii) 72° C. for 5 minutes; C: 72° C. for 5 minutes; and D: hold at 4° C.). (3) Purify sample.

RNA Sequencing Sample Preparation.

A method and a computer program product of the invention can analyze RNA sequencing data. A non-limiting example of a protocol generating RNA sequencing reads is the following: Whole transcriptome RNA can be isolated and a library can be prepared for deep sequencing analysis. Library Construction: (1) mRNA is purified from Total RNA using magnetic beads: 5-100 ng of total RNA is diluted with nuclease-free water and heated to disrupt the secondary structures and then added to pre-prepared Magnetic Oligo(dT) Beads. After washing with washing buffer, 10 mM Tris-HCl is added to the beads and mRNA is eluted from the beads. (2) Fragmentation of the mRNA: The mRNA is fragmented into small pieces using divalent cations under elevated temperature. (3) Synthesis of the first and second strands of cDNA: The cleaved RNA fragments are copied into a first strand of cDNA using reverse transcriptase and random primers. Then the primary strand of mRNA is removed by RNaseH and a replacement strand is synthesized to generate double-stranded cDNA. (4) Overhang end repair into blunt ends: The overhangs resulting from fragmentation are converted into blunt ends using T4 DNA polymerase and E. coli DNA polymerase I Klenow fragment. The 3′ to 5′ exonuclease activity of these enzymes removes 3′ overhangs, and the polymerase activity fills in the 5′ overhangs. (5) Addition of ‘A’ Bases to the 3′ End of the DNA Fragments: An ‘A’ base is added to the 3′ end of the blunt phosphorylated DNA fragments, using the polymerase activity of Klenow fragment (3′ to 5′ exo minus). This step prepares the DNA fragments for ligation to the adapters, which have a single ‘T’ base overhang at the 3′ end. (6) Ligation of Adapters to DNA Fragments: The adapters are ligated to the ends of the DNA fragments, preparing DNA fragments to be hybridized to a flow cell. (7) Purification of Ligation Products: The products of the ligation reaction are purified on a 2% TAE-agarose gel to remove all unligated adapters and any adapters that have ligated to one another, and selected in a size-range of templates (for example 200±25 bp) for downstream enrichment. (8) Enrichment of the Adapter-cDNA Fragments by PCR: The cDNA fragments with adapters on both ends are amplified for 15 cycles of PCR, with two primers complementary to the ends of the adapters. (9) Validate the Library: Check the size, purity, and concentration of the library constructs using standard methods.

Template hybridization, cluster amplification, linearization, blocking, and primer hybridization: (1) An example of a sequencing apparatus is the cBot™ fluidics device that hybridizes samples onto a flow cell and amplifies for later sequencing on the HiSeq2000/1000 ™. The instrument uses solid support amplification to create an ultra-high density sequencing flow cell with millions of clusters, each containing 1,000 copies of template. Prior to sequencing, an initial in vitro clonal amplification step is used. The bridge PCR for cluster generation is achieved via the following steps: (2) Denaturation of the library construct and hybridization of the template DNA: Original dsDNA construct is denatured and converted to ssDNA that already has adapters ligated thereto, and is ready to hybridize via the homology of the adapters to a complementary adaptor primer on a grafted flowcell. (3) Amplification of template DNA: the template DNA is isothermally amplified to form a surface-bound colony (the clonal DNA cluster). (4) Linearization: dsDNA clusters are linearized as the first step of converting dsDNA to ssDNA that is suitable for sequencing. (5) Blocking: the free 3′ OH ends of the linearized dsDNA clusters are blocked. This step prevents nonspecific sites from being sequenced. (6) Denaturation and hybridization of sequencing primers: in this step the denatured dsDNA is then hybridized to a sequencing primer, or multiple sequencing primers, onto the linearized and blocked clusters. After this step, the flow cell is ready for sequencing.

Reference Sequences, Genomes, and Databases.

A computer program product and a method of the invention can analyze sequencing reads drawn from a database. A database can provide DNA, RNA, or protein sequencing data. A database comprising sequencing information can be a public database or a private database. A database can be specialized in a particular type of digital sequencing read data, or can be a central database comprising a plurality of distinct data formats. Non-limiting examples of databases include the NCBI completely sequenced genomes database, the Saccharomyces Genome database, FlyBase, WormBase, Arabidopsis Information Resource, Zebrafish Information Network, Swiss-Prot, TrEMBL, Protein Information Resource, Uniprot, 1000 Genomes Project database, and a database of the International HapMap Project.

A reference genome can be assembled as a representative reference sequence of a gene, a set of genes, or of the genome of a creature. A creature can be any creature in the six kingdoms of a biological taxonomical rank: Animalia, Plantae, Fungi, Protista, Archaea, and Bacteria. A creature can be, for example, a Human (H. sapiens), a Mouse (M. musculus), a Rat (R. norvegicus), a Fly (D. melanogaster), a Fly (D. pseudoobscura), a Fly (D. pmelanogaster-multiple strains), a Honey bee (A. mellifera), a Sea urchin (S. purpuratus), Bovine (B. taurus), a Rhesus monkey (M. mulatta), an Orangutan (P. pygmaeus), a Marmoset (C. jacchus), a Pea aphid (A. pisum), a Wasp (Nasonia spp.), a Beetle (T. castaneum), a Wallaby (M. eugenii), an Acorn worm (S. kowalevskii), a Hyrax (P. capensis), a Megabat (P. vampyrus), an Armadillo (D. novemcinctus), a Baboon (P. anubis), a Bumble bee (B. terrestris), a California mouse (P. californicus), a Centipede (S. maritima), a Cotton bollworm (H. armigera), a Deer mouse (P. maniculatus), a Gibbon, a Dolphin, Bottlenose (T. truncatus), a Dwarf honey bee (A. florea), mayze (Z. mays), or any other creature.

A digital sequence of a reference sequence can be contained in a database of reference sequences. A database of reference sequences can comprise a plurality of genomes and/or proteomes, sequenced with a plurality of methods. A reference genome can be a haploid, a diploid, or a polyploid genome. In some embodiments, the reference genome is a human genome. In some embodiments, the reference genome is an annotated human genome. In some embodiments, a system, a method, an algorithm, and/or a computer program product of the invention can identify a mutation and/or a variant in a genomic sample based on a comparison of an experimentally-generated sequencing read with a reference genome.

A reference sequence can be a digital nucleic acid or amino acid sequence. A reference genome can be a reference sequence. A reference genome can comprise digital annotations of a known allele, even if the known allele is not present in either the physical or the digital version of the nucleic acid sequence. For example, humans have the A, B, AB, or O blood type as designated in the ABO blood system. A reference sequence containing only an O allele can be annotated with alleles corresponding to the A, B, and AB gene sequences.

Polymorphisms.

A large number of variants have been identified in sequenced genomes, including the human genome. Variants can be of functional significance, for example, leading to a missense mutation or a non-sense mutation in a protein coding gene. Variants can be of no functional significance, such as silent mutations. Single nucleotide polymorphims (SNPs) can be of functional significance.

In some embodiments, the invention provided herein can identify polymorphisms present in a genome. In some embodiments, the invention can associate an identified polymorphism with a condition. In some embodiments polymorphisms identified in a genome as compared to a reference genome represents from about 0.1% to about 0.05%, from about 0.1% to about 0.01%, from about 0.05% to about 0.01%, from about 0.05% to about 0.005%, from about 0.01% to about 0.005%, or from about 0.005% to about 0.0001% of a sequenced genome. In some embodiments, the invention identifies a mutation in a sample based on a comparison with a reference sequence.

TABLE 3 illustrates the physical organization and the number of polymorphisms reported within a human genome by chromosome.

TABLE 3

No. of

Chro-

Reported
Protein

Small

mo-
Length

Poly-
coding
Pseudo-
Micro
Ribosomal
nuclear

some
(mm)
Base Pairs
morphims
genes
genes
RNA
RNA
RNA

1
85
249,250,621
4,401,091
2,012
1,130
134
66
221

2
83
243,199,373
4,607,702
1,203
948
115
40
161

3
67
198,022,430
3,894,345
1,040
719
99
29
138

4
65
191,154,276
3,673,892
718
698
92
24
120

5
62
180,915,260
3,436,667
849
676
83
25
106

6
58
171,115,067
3,360,890
1,002
731
81
26
111

7
54
159,138,663
3,045,992
866
803
90
24
90

8
50
146,364,022
2,890,692
659
568
80
28
86

9
48
141,213,431
2,581,827
785
714
69
19
66

10
46
135,534,747
2,609,802
745
500
64
32
87

11
46
135,006,516
2,607,254
1,258
775
63
24
74

12
45
133,851,895
2,482,194
1,003
582
72
27
106

13
39
115,169,878
1,814,242
318
323
42
16
45

14
36
107,349,540
1,712,799
601
472
92
10
65

15
35
102,531,392
1,577,346
562
473
78
13
63

16
31
90,354,753
1,747,136
805
429
52
32
53

17
28
81,195,210
1,491,841
1,158
300
61
15
80

18
27
78,077,248
1,448,602
268
59
32
13
51

19
20
59,128,983
1,171,356
1,399
181
110
13
29

20
21
63,025,520
1,206,753
533
213
57
15
46

21
16
48,129,895
787,784
225
150
16
5
21

22
17
51,304,566
745,778
431
308
31
5
23

X
53
155,270,560
2,174,952
815
780
128
22
85

Y
20
59,373,566
286,812
45
327
15
7
17

Mt.
0.0054
16,569
929
13
0
0
2
0

DNA

Therapies.

Integrating inherited genetic risk information into the clinical decision-making process early in life can have an important effect in ameliorating or even preventing disease symptoms or conditions. The computer program products and methods of the invention address pressing challenges in personalized medicine. By providing a streamlined and rapid method for effective interpretation and characterization of data generated in sequencing experiments, the invention can solve challenges in conveying meaningful interpretations of a genomic or proteomic sequence to physicians. The invention can facilitate characterization of various genetic conditions based on sequencing data.

Subjects can be afflicted by one or a plurality of conditions, or subjects can be healthy. Subjects can be of any age, including, for example, elderly adults, adults, adolescents, pre-adolescents, children, toddlers, and infants. Non limiting examples of subjects include humans, other primates, and mammals such as dogs, cats, horses, pigs, rabbits, rats, and mice.

In some embodiments, the invention can identify a condition based on an analysis of sequencing data associated with a sample.

A computer-program product and a method of the invention can identify genetic signatures associated with various conditions. Non-limiting examples of conditions include cystic fibrosis, Duchenne muscular dystrophy, Haemochromatosis, Tay-Sachs disease, Prader-Willi syndrome, Angelman syndrome, neurofibromatosis, phenylketonuria, Canavan disease, Coeliac disease, Acid beta-glucosidase deficiency, Gaucher, Charcot-Marie-Tooth disease, color blindness, Cri du chat, polycystic kidney disease, acrocephaly, familial adenomatous polyposis, adrenal gland disorders, amyotrophic lateral sclerosis (ALS), Alzheimer's disease, Parkinson's disease, anemia, ataxia, ataxia telangiectasia, autism, bone marrow diseases, Bonnevie-Ullrich syndrome, brain diseases, von Rippel-Lindau disease, congenital heart disease, Crohn's disease, dementia, myotonic dystrophy, Fabry disease, fragile X syndrome, galactosemia, genetic emphysema, retinoblastoma, Pendred syndrome, Usher syndrome, Wilson disease, neuropathies, Huntington's disease, immune system disorders, gout, X-linked spinal-bulbar muscle atrophy, learning disabilities, Li-Fraumeni syndrome, lipase D deficiency, Lou Gehrig disease, Marfan syndrome, metabolic disorders, Niemann-Pick, Noonan syndrome, Osteopsathyrosis, Peutz-Jeghers syndrome, Pfeiffer syndrome, porphyria, progeria, Rett syndrome, tuberous sclerosis, speech and communication disorders, spinal muscular atrophy, Treacher Collins syndrome, trisomies, and monosomies.

A computer-program product, a method, an algorithm, or a system herein can identify variants present in a various cancers. Non-limiting examples of cancers include: acute lymphoblastic leukemia, acute myeloid leukemia, adrenocortical carcinoma, AIDS-related cancers, AIDS-related lymphoma, anal cancer, appendix cancer, astrocytomas, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancers, brain tumors, such as cerebellar astrocytoma, cerebral astrocytoma/malignant glioma, ependymoma, medulloblastoma, supratentorial primitive neuroectodermal tumors, visual pathway and hypothalamic glioma, breast cancer, bronchial adenomas, Burkitt lymphoma, carcinoma of unknown primary origin, central nervous system lymphoma, cerebellar astrocytoma, cervical cancer, childhood cancers, chronic lymphocytic leukemia, chronic myelogenous leukemia, chronic myeloproliferative disorders, colon cancer, cutaneous T-cell lymphoma, desmoplastic small round cell tumor, endometrial cancer, ependymoma, esophageal cancer, Ewing's sarcoma, germ cell tumors, gallbladder cancer, gastric cancer, gastrointestinal carcinoid tumor, gastrointestinal stromal tumor, gliomas, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular (liver) cancer, Hodgkin lymphoma, Hypopharyngeal cancer, intraocular melanoma, islet cell carcinoma, Kaposi sarcoma, kidney cancer, laryngeal cancer, lip and oral cavity cancer, liposarcoma, liver cancer, lung cancers, such as non-small cell and small cell lung cancer, lymphomas, leukemias, macroglobulinemia, malignant fibrous histiocytoma of bone/osteosarcoma, medulloblastoma, melanomas, mesothelioma, metastatic squamous neck cancer with occult primary, mouth cancer, multiple endocrine neoplasia syndrome, myelodysplastic syndromes, myeloid leukemia, nasal cavity and paranasal sinus cancer, nasopharyngeal carcinoma, neuroblastoma, non-Hodgkin lymphoma, non-small cell lung cancer, oral cancer, oropharyngeal cancer, osteosarcoma/malignant fibrous histiocytoma of bone, ovarian cancer, ovarian epithelial cancer, ovarian germ cell tumor, pancreatic cancer, pancreatic cancer islet cell, paranasal sinus and nasal cavity cancer, parathyroid cancer, penile cancer, pharyngeal cancer, pheochromocytoma, pineal astrocytoma, pineal germinoma, pituitary adenoma, pleuropulmonary blastoma, plasma cell neoplasia, primary central nervous system lymphoma, prostate cancer, rectal cancer, renal cell carcinoma, renal pelvis and ureter transitional cell cancer, retinoblastoma, rhabdomyosarcoma, salivary gland cancer, sarcomas, skin cancers, skin carcinoma merkel cell, small intestine cancer, soft tissue sarcoma, squamous cell carcinoma, stomach cancer, T-cell lymphoma, throat cancer, thymoma, thymic carcinoma, thyroid cancer, trophoblastic tumor (gestational), cancers of unkown primary site, urethral cancer, uterine sarcoma, vaginal cancer, vulvar cancer, Waldenström macroglobulinemia, and Wilms tumor.

The invention can further comprise recommending or providing a therapy to a subject based on an analysis of sequencing data. A therapy can be a medicinal intervention, such as a chemotherapy, or a medical procedure, such as a surgery. For example, an analysis of sequencing data can identify a polymorphism known to contribute to cystic fibrosis in a sample. Non-limiting examples of therapeutic interventions that can be provided to a subject afflicted by cystic fibrosis include antibiotics, anti-inflammatories, enzyme replacement, mucolytics, airway clearance, postural drainage, wearing an inflatable therapy vest, using a flutter device, diet regimens, exercise regimens, and organ transplants.

The therapy provided to the subject can be directed towards a genetic component of an identified condition. For example, mutations in the Abelson tyrosine kinase (ABL) gene can lead to Chronic Myeloid Leukemia (CML), and single nucleotide polymorphisms can render subjects resistant to Gleevec. An analysis of sequencing data associated with a sample can identify a polymorphisms associated with CML, and a physician can use an output to determine that the subject is potentially non-responsive to Gleevec.

Data Processing and Error Rates.

Non-limiting examples of real-time analysis time-frames for a comparison of sequencing reads include: less than 1 second, less than 2 seconds, less than 3 seconds, less than 4 seconds, less than 5 seconds, less than 10 seconds, less than 15 seconds, less than 30 seconds, less than 45 seconds, less than 60 seconds; less than 2 minutes; less than 3 minutes; less than 4 minutes; less than 5 minutes; less than 6 minutes; less than 7 minutes; less than 8 minutes; less than 9 minutes; less than 10 minutes; less than 11 minutes; less than 12 minutes; less than 13 minutes; less than 14 minutes; less than 15 minutes; less than 16 minutes; less than 17 minutes; less than 18 minutes; less than 19 minutes; less than 20 minutes; less than 21 minutes; less than 22 minutes; less than 23 minutes; less than 24 minutes; less than 25 minutes; less than 26 minutes; less than 27 minutes; less than 28 minutes; less than 29 minutes; and less than 30 minutes.

Non-limiting examples of real-time analysis time-frames of complete sequencing experiments include less than 15 minutes; from about 10 minutes to about 30 minutes; from about 10 minutes to about 1 hour; from about 10 minutes to about 2 hours; from about 10 minutes to about 3 hours; from about 10 minutes to about 4 hours; from about 10 minutes to about 5 hours; from about 10 minutes to about 6 hours; from about 10 minutes to about 7 hours; from about 10 minutes to about 8 hours; from about 10 minutes to about 9 hours; from about 10 minutes to about 10 hours; from about 1 hour to about 2 hours; from about 1 hour to about 3 hours; from about 1 hour to about 4 hours; from about 1 hour to about 5 hours; from about 1 hour to about 6 hours; from about 1 hour to about 7 hours; from about 1 hour to about 8 hours; from about 1 hour to about 9 hours; from about 1 hour to about 10 hours; from about 2 hours to about 3 hours; from about 2 hours to about 4 hours; from about 2 hours to about 5 hours; from about 2 hours to about 6 hours; from about 2 hours to about 7 hours; from about 2 hours to about 8 hours; from about 2 hours to about 9 hours; from about 2 hours to about 10 hours; from about 3 hours to about 4 hours; from about 3 hours to about 5 hours; from about 3 hours to about 6 hours; from about 3 hours to about 7 hours; from about 3 hours to about 8 hours; from about 3 hours to about 9 hours; from about 3 hours to about 10 hours; from about 4 hours to about 5 hours; from about 4 hours to about 6 hours; from about 4 hours to about 7 hours; from about 4 hours to about 8 hours; from about 4 hours to about 9 hours; from about 4 hours to about 10 hours; from about 5 hours to about 6 hours; from about 5 hours to about 7 hours; from about 5 hours to about 8 hours; from about 5 hours to about 9 hours; from about 5 hours to about 10 hours; from about 6 hours to about 7 hours; from about 6 hours to about 8 hours; from about 6 hours to about 9 hours; from about 6 hours to about 10 hours; from about 7 hours to about 8 hours; from about 7 hours to about 9 hours; from about 7 hours to about 10 hours; from about 8 hours to about 9 hours; from about 8 hours to about 10 hours; from about 9 hours to about 10 hours; and about 10 hours.

A method of identifying sequencing errors can comprise comparing a first sequencing read, with a number of additional sequencing reads. The additional sequencing reads can comprise a common barcode or share a level of homology with the first sequencing read or one another. In some embodiments, a confidence level of determining that a variant in a sequencing assay is an error increases as the number of additional sequencing reads compared to the first sequencing read increases.

In some embodiments, a computer program product and an algorithm of the invention can detect, process, distinguish, and group at least 100 barcodes per sample; at least 500 barcodes per sample; at least 1,000 barcodes per sample; at least 2,000 barcodes per sample; at least 3,000 barcodes per sample; at least 4,000 barcodes per sample; at least 5,000 barcodes per sample; at least 6,000 barcodes per sample; at least 7,000 barcodes per sample; at least 8,000 barcodes per sample; at least 9,000 barcodes per sample; at least 10,000 barcodes per sample; at least 11,000 barcodes per sample; at least 12,000 barcodes per sample; at least 13,000 barcodes per sample; at least 14,000 barcodes per sample; at least 15,000 barcodes per sample; at least 16,000 barcodes per sample; at least 17,000 barcodes per sample; at least 18,000 barcodes per sample; at least 19,000 barcodes per sample; at least 20,000 barcodes per sample; at least 21,000 barcodes per sample; at least 22,000 barcodes per sample; at least 23,000 barcodes per sample; at least 24,000 barcodes per sample; at least 25,000 barcodes per sample; at least 26,000 barcodes per sample; at least 27,000 barcodes per sample; at least 28,000 barcodes per sample; at least 29,000 barcodes per sample; at least 30,000 barcodes per sample; at least 31,000 barcodes per sample; at least 32,000 barcodes per sample; at least 33,000 barcodes per sample; at least 34,000 barcodes per sample; at least 35,000 barcodes per sample; at least 36,000 barcodes per sample; at least 37,000 barcodes per sample; at least 38,000 barcodes per sample; at least 39,000 barcodes per sample; at least 40,000 barcodes per sample; at least 41,000 barcodes per sample; at least 42,000 barcodes per sample; at least 43,000 barcodes per sample; at least 44,000 barcodes per sample; at least 45,000 barcodes per sample; at least 46,000 barcodes per sample; at least 47,000 barcodes per sample; at least 48,000 barcodes per sample; at least 49,000 barcodes per sample; at least 50,000 barcodes per sample; at least 51,000 barcodes per sample; at least 52,000 barcodes per sample; at least 53,000 barcodes per sample; at least 54,000 barcodes per sample; at least 55,000 barcodes per sample; at least 56,000 barcodes per sample; at least 57,000 barcodes per sample; at least 58,000 barcodes per sample; at least 59,000 barcodes per sample; at least 60,000 barcodes per sample; at least 61,000 barcodes per sample; at least 62,000 barcodes per sample; at least 63,000 barcodes per sample; at least 64,000 barcodes per sample; at least 65,000 barcodes per sample; at least 66,000 barcodes per sample; at least 67,000 barcodes per sample; at least 68,000 barcodes per sample; at least 69,000 barcodes per sample; at least 70,000 barcodes per sample; at least 71,000 barcodes per sample; at least 72,000 barcodes per sample; at least 73,000 barcodes per sample; at least 74,000 barcodes per sample; at least 75,000 barcodes per sample; at least 76,000 barcodes per sample; at least 77,000 barcodes per sample; at least 78,000 barcodes per sample; at least 79,000 barcodes per sample; at least 80,000 barcodes per sample; at least 81,000 barcodes per sample; at least 82,000 barcodes per sample; at least 83,000 barcodes per sample; at least 84,000 barcodes per sample; at least 85,000 barcodes per sample; at least 86,000 barcodes per sample; at least 87,000 barcodes per sample; at least 88,000 barcodes per sample; at least 89,000 barcodes per sample; at least 90,000 barcodes per sample; at least 91,000 barcodes per sample; at least 92,000 barcodes per sample; at least 93,000 barcodes per sample; at least 94,000 barcodes per sample; at least 95,000 barcodes per sample; at least 96,000 barcodes per sample; at least 97,000 barcodes per sample; at least 98,000 barcodes per sample; at least 99,000 barcodes per sample; or at least 100,000 barcodes per sample.

In some embodiments, a computer program product and an algorithm of the invention can detect, process, distinguish, group, and/or separate no greater than 100 barcodes per sample; no greater than 500 barcodes per sample; no greater than 1,000 barcodes per sample; no greater than 2,000 barcodes per sample; no greater than 3,000 barcodes per sample; no greater than 4,000 barcodes per sample; no greater than 5,000 barcodes per sample; no greater than 6,000 barcodes per sample; no greater than 7,000 barcodes per sample; no greater than 8,000 barcodes per sample; no greater than 9,000 barcodes per sample; no greater than 10,000 barcodes per sample; no greater than 11,000 barcodes per sample; no greater than 12,000 barcodes per sample; no greater than 13,000 barcodes per sample; no greater than 14,000 barcodes per sample; no greater than 15,000 barcodes per sample; no greater than 16,000 barcodes per sample; no greater than 17,000 barcodes per sample; no greater than 18,000 barcodes per sample; no greater than 19,000 barcodes per sample; no greater than 20,000 barcodes per sample; no greater than 21,000 barcodes per sample; no greater than 22,000 barcodes per sample; no greater than 23,000 barcodes per sample; no greater than 24,000 barcodes per sample; no greater than 25,000 barcodes per sample; no greater than 26,000 barcodes per sample; no greater than 27,000 barcodes per sample; no greater than 28,000 barcodes per sample; no greater than 29,000 barcodes per sample; no greater than 30,000 barcodes per sample; no greater than 31,000 barcodes per sample; no greater than 32,000 barcodes per sample; no greater than 33,000 barcodes per sample; no greater than 34,000 barcodes per sample; no greater than 35,000 barcodes per sample; no greater than 36,000 barcodes per sample; no greater than 37,000 barcodes per sample; no greater than 38,000 barcodes per sample; no greater than 39,000 barcodes per sample; no greater than 40,000 barcodes per sample; no greater than 41,000 barcodes per sample; no greater than 42,000 barcodes per sample; no greater than 43,000 barcodes per sample; no greater than 44,000 barcodes per sample; no greater than 45,000 barcodes per sample; no greater than 46,000 barcodes per sample; no greater than 47,000 barcodes per sample; no greater than 48,000 barcodes per sample; no greater than 49,000 barcodes per sample; no greater than 50,000 barcodes per sample; no greater than 51,000 barcodes per sample; no greater than 52,000 barcodes per sample; no greater than 53,000 barcodes per sample; no greater than 54,000 barcodes per sample; no greater than 55,000 barcodes per sample; no greater than 56,000 barcodes per sample; no greater than 57,000 barcodes per sample; no greater than 58,000 barcodes per sample; no greater than 59,000 barcodes per sample; no greater than 60,000 barcodes per sample; no greater than 61,000 barcodes per sample; no greater than 62,000 barcodes per sample; no greater than 63,000 barcodes per sample; no greater than 64,000 barcodes per sample; no greater than 65,000 barcodes per sample; no greater than 66,000 barcodes per sample; no greater than 67,000 barcodes per sample; no greater than 68,000 barcodes per sample; no greater than 69,000 barcodes per sample; no greater than 70,000 barcodes per sample; no greater than 71,000 barcodes per sample; no greater than 72,000 barcodes per sample; no greater than 73,000 barcodes per sample; no greater than 74,000 barcodes per sample; no greater than 75,000 barcodes per sample; no greater than 76,000 barcodes per sample; no greater than 77,000 barcodes per sample; no greater than 78,000 barcodes per sample; no greater than 79,000 barcodes per sample; no greater than 80,000 barcodes per sample; no greater than 81,000 barcodes per sample; no greater than 82,000 barcodes per sample; no greater than 83,000 barcodes per sample; no greater than 84,000 barcodes per sample; no greater than 85,000 barcodes per sample; no greater than 86,000 barcodes per sample; no greater than 87,000 barcodes per sample; no greater than 88,000 barcodes per sample; no greater than 89,000 barcodes per sample; no greater than 90,000 barcodes per sample; no greater than 91,000 barcodes per sample; no greater than 92,000 barcodes per sample; no greater than 93,000 barcodes per sample; no greater than 94,000 barcodes per sample; no greater than 95,000 barcodes per sample; no greater than 96,000 barcodes per sample; no greater than 97,000 barcodes per sample; no greater than 98,000 barcodes per sample; no greater than 99,000 barcodes per sample; or no greater than 100,000 barcodes per sample.

An identified error rate within a sequence read can be of about 1 nucleotide per sequencing read, about 2 nucleotides per sequencing read, about 3 nucleotides per sequencing read, about 4 nucleotides per sequencing read, about 5 nucleotides per sequencing read, about 6 nucleotides per sequencing read, about 7 nucleotides per sequencing read, about 8 nucleotides per sequencing read, about 9 nucleotides per sequencing read, about 10 nucleotides per sequencing read, about 11 nucleotides per sequencing read, about 12 nucleotides per sequencing read, about 13 nucleotides per sequencing read, about 14 nucleotides per sequencing read, about 15 nucleotides per sequencing read, about 16 nucleotides per sequencing read, about 17 nucleotides per sequencing read, about 18 nucleotides per sequencing read, about 19 nucleotides per sequencing read, about 20 nucleotides per sequencing read, about 21 nucleotides per sequencing read, about 22 nucleotides per sequencing read, about 23 nucleotides per sequencing read, about 24 nucleotides per sequencing read, about 25 nucleotides per sequencing read, about 26 nucleotides per sequencing read, about 27 nucleotides per sequencing read, about 28 nucleotides per sequencing read, about 29 nucleotides per sequencing read, about 30 nucleotides per sequencing read, about 31 nucleotides per sequencing read, about 32 nucleotides per sequencing read, about 33 nucleotides per sequencing read, about 34 nucleotides per sequencing read, about 35 nucleotides per sequencing read, about 36 nucleotides per sequencing read, about 37 nucleotides per sequencing read, about 38 nucleotides per sequencing read, about 39 nucleotides per sequencing read, or about 40 nucleotides per sequencing read.

An identified error rate within a sample can be about 0.00000000001%, about 0.00000000002%, about 0.00000000003%, about 0.00000000004%, about 0.00000000005%, about 0.00000000006%, about 0.00000000007%, about 0.00000000008%, about 0.00000000009%, about 0.0000000001%, about 0.0000000002%, about 0.0000000003%, about 0.0000000004%, about 0.0000000005%, about 0.0000000006%, about 0.0000000007%, about 0.0000000008%, about 0.0000000009%, about 0.000000001%, about 0.000000002%, about 0.000000003%, about 0.000000004%, about 0.000000005%, about 0.000000006%, about 0.000000007%, about 0.000000008%, about 0.000000009%, about 0.00000001%, about 0.00000002%, about 0.00000003%, about 0.00000004%, about 0.00000005%, about 0.00000006%, about 0.00000007%, about 0.00000008%, about 0.00000009%, about 0.0000001%, about 0.0000002%, about 0.0000003%, about 0.0000004%, about 0.0000005%, about 0.0000006%, about 0.0000007%, about 0.0000008%, about 0.0000009%, about 0.000001%, about 0.000002%, about 0.000003%, about 0.000004%, about 0.000005%, about 0.000006%, about 0.000007%, about 0.000008%, about 0.000009%, about 0.00001%, about 0.00002%, about 0.00003%, about 0.00004%, about 0.00005%, about 0.00006%, about 0.00007%, about 0.00008%, about 0.00009%, about 0.0001%, about 0.0002%, about 0.0003%, about 0.0001%, about 0.004%, about 0.0005%, about 0.0006%, about 0.0007%, about 0.0008%, about 0.0009%, about 0.001%, about 0.002%, about 0.003%, about 0.004%, about 0.005%, about 0.006%, about 0.007%, about 0.008%, about 0.009%, about 0.01%, about 0.02%, about 0.03%, about 0.04%, about 0.05%, about 0.06%, about 0.07%, about 0.08%, about 0.09%, or about 0.1%.

Homology.

A sequencing read can be aligned to a reference sequence with a degree of homology. A first sequencing read can be aligned to a second sequencing read with a degree of homology. A portion (“window”) of a sequencing read can share a percent homology to a reference sequence, or a second sequencing read. The entirety of a sequencing read can share a percent homology to a reference sequence, or a second sequencing read. Modules of the invention can determine a statistical significance of an alignment based on a percent of shared homology between two sequences. A sequencing read can share structural and/or base-pair (“pairwise”) homology to a reference sequence or a second sequencing read. Homology can be measured oven an entire sequence or a fragment of a sequence. Homology can be measured by a computer-executable algorithm, for example, the Basic Local Alignment Search Tool (BLAST).

A sequencing read and a reference sequence can share about 1% pairwise homology, about 2% pairwise homology, about 3% pairwise homology, about 4% pairwise homology, about 5% pairwise homology, about 6% pairwise homology, about 7% pairwise homology, about 8% pairwise homology, about 9% pairwise homology, about 10% pairwise homology, about 11% pairwise homology, about 12% pairwise homology, about 13% pairwise homology, about 14% pairwise homology, about 15% pairwise homology, about 16% pairwise homology, about 17% pairwise homology, about 18% pairwise homology, about 19% pairwise homology, about 20% pairwise homology, about 21% pairwise homology, about 22% pairwise homology, about 23% pairwise homology, about 24% pairwise homology, about 25% pairwise homology, about 26% pairwise homology, about 27% pairwise homology, about 28% pairwise homology, about 29% pairwise homology, about 30% pairwise homology, about 31% pairwise homology, about 32% pairwise homology, about 33% pairwise homology, about 34% pairwise homology, about 35% pairwise homology, about 36% pairwise homology, about 37% pairwise homology, about 38% pairwise homology, about 39% pairwise homology, about 40% pairwise homology, about 41% pairwise homology, about 41% pairwise homology, about 42% pairwise homology, about 43% pairwise homology, about 44% pairwise homology, about 45% pairwise homology, about 46% pairwise homology, about 47% pairwise homology, about 48% pairwise homology, about 49% pairwise homology, about 50% pairwise homology, about 51% pairwise homology, about 52% pairwise homology, about 53% pairwise homology, about 54% pairwise homology, about 55% pairwise homology, about 56% pairwise homology, about 57% pairwise homology, about 58% pairwise homology, about 59% pairwise homology, about 60% pairwise homology, about 61% pairwise homology, about 62% pairwise homology, about 63% pairwise homology, about 64% pairwise homology, about 65% pairwise homology, about 66% pairwise homology, about 67% pairwise homology, about 68% pairwise homology, about 69% pairwise homology, about 70% pairwise homology, about 71% pairwise homology, about 72% pairwise homology, about 73% pairwise homology, about 74% pairwise homology, about 75% pairwise homology, about 76% pairwise homology, about 77% pairwise homology, about 78% pairwise homology, about 79% pairwise homology, about 80% pairwise homology, about 81% pairwise homology, about 82% pairwise homology, about 83% pairwise homology, about 84% pairwise homology, about 85% pairwise homology, about 86% pairwise homology, about 87% pairwise homology, about 88% pairwise homology, about 89% pairwise homology, about 90% pairwise homology, about 91% pairwise homology, about 92% pairwise homology, about 93% pairwise homology, about 94% pairwise homology, about 95% pairwise homology, about 96% pairwise homology, about 97% pairwise homology, about 98% pairwise homology, about 99% pairwise homology, or about 100% pairwise homology to a reference sequence or a second sequencing read.

An algorithm can allow for gaps in an alignment. An alignment gap can be of about 1 residue, about 2 residues, about 3 residues, about 4 residues, about 5 residues, about 6 residues, about 7 residues, about 8 residues, about 9 residues, about 10 residues, about 11 residues, about 12 residues, about 13 residues, about 14 residues, about 15 residues, about 16 residues, about 17 residues, about 18 residues, about 19 residues, about 20 residues, about 21 residues, about 22 residues, about 23 residues, about 24 residues, about 25 residues, about 26 residues, about 27 residues, about 28 residues, about 29 residues, about 30 residues, about 31 residues, about 32 residues, about 33 residues, about 34 residues, about 35 residues, about 36 residues, about 37 residues, about 38 residues, about 39 residues, about 40 residues, about 41 residues, about 42 residues, about 43 residues, about 44 residues, about 45 residues, about 46 residues, about 47 residues, about 48 residues, about 49 residues, or about 50 residues. In some embodiments, an alignment gap can be more than 50 residues.

Computer Architectures.

Sequencing data can be analyzed by a plurality of computers, with various computer architectures. Various computer architectures are suitable for use with the invention. FIG. 18 is a block diagram illustrating a first example architecture of a computer system 1800 that can be used in connection with example embodiments of the present invention. As depicted in FIG. 18, the example computer system can include a processor 1802 for processing instructions. Non-limiting examples of processors include: Intel Core i7™ processor, Intel Core i5™ processor, Intel Core i3™ processor, Intel Xeon™ processor, AMD Opteron™ processor, Samsung 32-bit RISC ARM 1176JZ(F)-S v1.0™ processor, ARM Cortex-A8 Samsung S5PC100™ processor, ARM Cortex-A8 Apple A4™ processor, Marvell PXA 930™ processor, or a functionally-equivalent processor. Multiple threads of execution can be used for parallel processing. In some embodiments, multiple processors or processors with multiple cores can be used, whether in a single computer system, in a cluster, or distributed across systems over a network comprising a plurality of computers, cell phones, and/or personal data assistant devices.

As illustrated in FIG. 18, a high speed cache 1801 can be connected to, or incorporated in, the processor 1802 to provide a high speed memory for instructions or data that have been recently, or are frequently, used by processor 1802. The processor 1802 is connected to a north bridge 1806 by a processor bus 1805. The north bridge 1806 is connected to random access memory (RAM) 1803 by a memory bus 1804 and manages access to the RAM 1803 by the processor 1802. The north bridge 1806 is also connected to a south bridge 1808 by a chipset bus 1807. The south bridge 1808 is, in turn, connected to a peripheral bus 1809. The peripheral bus can be, for example, PCI, PCI-X, PCI Express, or other peripheral bus. The north bridge and south bridge are often referred to as a processor chipset and manage data transfer between the processor, RAM, and peripheral components on the peripheral bus 1809. In some architectures, the functionality of the north bridge can be incorporated into the processor instead of using a separate north bridge chip.

In some embodiments, system 1800 can include an accelerator card 1812 attached to the peripheral bus 1809. The accelerator can include field programmable gate arrays (FPGAs) or other hardware for accelerating certain processing.

Software and data are stored in external storage 1813 and can be loaded into RAM 1803 and/or cache 1801 for use by the processor. The system 1800 includes an operating system for managing system resources; non-limiting examples of operating systems include: Linux, Windows™, MACOS™, BlackBerry OS™, iOS™, and other functionally-equivalent operating systems, as well as application software running on top of the operating system.

In this example, system 1800 also includes network interface cards (NICs) 1810 and 1811 connected to the peripheral bus for providing network interfaces to external storage, such as Network Attached Storage (NAS) and other computer systems that can be used for distributed parallel processing.

FIG. 19 is a diagram showing a network 1900 with a plurality of computer systems 1902a, and 1902b, a plurality of cell phones and personal data assistants 1902c, and Network Attached Storage (NAS) 1901a, and 1901b. In some embodiments, systems 1902a, 1902b, and 1902c can manage data storage and optimize data access for data stored in Network Attached Storage (NAS) 1901a and 1902b. A mathematical model can be used for the data and be evaluated using distributed parallel processing across computer systems 1902a, and 1902b, and cell phone and personal data assistant systems 1902c. Computer systems 1902a, and 1902b, and cell phone and personal data assistant systems 1902c can also provide parallel processing for adaptive data restructuring of the data stored in Network Attached Storage (NAS) 1901a and 1901b. FIG. 19 illustrates an example only, and a wide variety of other computer architectures and systems can be used in conjunction with the various embodiments of the present invention. For example, a blade server can be used to provide parallel processing. Processor blades can be connected through a back plane to provide parallel processing. Storage can also be connected to the back plane or as Network Attached Storage (NAS) through a separate network interface. In some embodiments, processors can maintain separate memory spaces and transmit data through network interfaces, back plane, or other connectors for parallel processing by other processors. In some embodiments, some or all of the processors can use a shared virtual address memory space.

FIG. 20 is a block diagram of a multiprocessor computer system using a shared virtual address memory space. The system includes a plurality of processors 2001a-f that can access a shared memory subsystem 2002. The system incorporates a plurality of programmable hardware memory algorithm processors (MAPs) 2003a-f in the memory subsystem 2002. Each MAP 2003a-f can comprise a memory 2004a-f and one or more field programmable gate arrays (FPGAs) 2005a-f. The MAP provides a configurable functional unit and particular algorithms or portions of algorithms can be provided to the FPGAs 2005a-f for processing in close coordination with a respective processor. In this example, each MAP is globally accessible by all of the processors for these purposes. In one configuration, each MAP can use Direct Memory Access (DMA) to access an associated memory 2004a-f, allowing it to execute tasks independently of, and asynchronously from, the respective microprocessor 2001a-f. In this configuration, a MAP can feed results directly to another MAP for pipelining and parallel execution of algorithms.

The above computer architectures and systems are examples only, and a wide variety of other computer, cell phone, and personal data assistant architectures and systems can be used in connection with example embodiments, including systems using any combination of general processors, co-processors, FPGAs and other programmable logic devices, system on chips (SOCs), application specific integrated circuits (ASICs), and other processing and logic elements. Any variety of data storage media can be used in connection with example embodiments, including random access memory, hard drives, flash memory, tape drives, disk arrays, Network Attached Storage (NAS) and other local or distributed data storage devices and systems.

In example embodiments, the computer system can be implemented using software modules executing on any of the above or other computer architectures and systems. In other embodiments, the functions of the system can be implemented partially or completely in firmware, programmable logic devices such as field programmable gate arrays (FPGAs) as referenced in FIG. 20, system on chips (SOCs), application specific integrated circuits (ASICs), or other processing and logic elements. For example, the Set Processor and Optimizer can be implemented with hardware acceleration through the use of a hardware accelerator card, such as accelerator card 1812 illustrated in FIG. 18.

EMBODIMENTS

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read associated with a sequencing assay while the sequencing assay is in progress, wherein the computer system comprises a processor; and b) comparing by the processor the first sequencing read with another sequence to provide a comparison before the sequencing assay is complete. In some embodiments, the other sequence is a second sequencing read. Some embodiments further comprise determining whether the first sequencing read and the other sequence share a common subsequence. Some embodiments further comprise determining whether the first sequencing read and the other sequence correspond to a common biomolecule. Some embodiments further comprise aligning the first sequencing read and the other sequence. Some embodiments further comprise determining whether the first sequencing read and the other sequence share a common barcode. In some embodiments, the other sequence is a reference sequence. Some embodiments further comprise identifying a mutation in the sample based on the comparison. Some embodiments further comprise identifying a single nucleotide polymorphism in the sample based on the comparison. Some embodiments further comprise identifying a translocation in the sample based on the comparison. Some embodiments further comprise determining a candidate sequence, wherein the candidate sequence incorporates a portion of the first sequencing read. In some embodiments, determining the candidate sequence is performed at a rate, wherein the computer system receives a number of sequencing reads, wherein the rate increases as the number of sequencing reads increases. Some embodiments further comprise determining to which haplotype in a pair of haplotypes the first sequencing read corresponds. In some embodiments, the sample is associated with a plurality of distinct barcodes. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. In some embodiments, the sequencing data is peptide sequencing data. Some embodiments further comprise outputting a result of the comparing. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of analyzing the sequencing data. In some embodiments, the first sequencing read is a partial sequencing read. In some embodiments, the first sequencing read is a complete sequencing read.

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a set of sequencing reads, each sequencing read independently comprising a common barcode, wherein the computer system comprises a processor; and b) aligning by the processor each sequencing read of the set of sequencing reads against a reference sequence, wherein the sample is associated with a plurality of distinct barcodes. Some embodiments further comprise receiving an additional sequencing read that does not comprise the common barcode, and filtering out the additional sequencing read that does not comprise the common barcode. Some embodiments further comprise determining a candidate sequence based on the alignment of the sequencing reads against the reference sequence. Some embodiments further comprise identifying neighboring sequencing reads in the alignment, and determining whether the neighboring sequencing reads overlap at a non-barcode subsequence. Some embodiments further comprise determining whether the sequencing reads that comprise the common barcode correspond to a common haplotype. In some embodiments, the determining whether the sequencing reads that comprise the common barcode correspond to the common haplotype comprises identifying a non-barcode subsequence common to a plurality of the sequencing reads that comprise the common barcode, and determining whether the sequencing reads of the plurality share a common single nucleotide polymorphism at the non-barcode common subsequence. Some embodiments further comprise determining a candidate sequence corresponding to one haplotype of a pair of haplotypes, wherein the candidate sequence is based on a plurality of sequencing reads, each of which comprise the common barcode, each of which share a common single nucleotide polymorphism at a common position. Some embodiments further comprise receiving a plurality of additional sequencing reads, wherein each additional sequencing read independently comprises a distinct barcode. Some embodiments further comprise aligning each additional sequencing read with the reference sequence. Some embodiments further comprise: c) receiving a plurality of additional sequencing reads, wherein each additional sequencing read independently comprises a distinct barcode; d) aligning each additional sequencing read with the reference sequence; e) selecting a window of the reference sequence; and f) determining a total of distinct barcodes associated with the additional sequencing reads that correspond to the window of the reference sequence. Some embodiments further comprise estimating a likelihood that the sample comprises a mutation. Some embodiments further comprise determining which haplotype in a pair of haplotypes possesses the mutation. In some embodiments, the mutation is a copy number variant. Some embodiments further comprise estimating the number of copies in the copy number variant. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting a result of the aligning. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of analyzing the sequencing data.

In some embodiments, the invention provides a method of distinguishing a first haplotype and a second haplotype in a pair of haplotypes based on sequencing data, the method comprising: a) receiving by a computer system a first set of sequencing reads, wherein the computer system comprises a processor; b) determining by the processor that each sequencing read in the first set of sequencing reads possesses a common nucleic acid residue at a common position; and c) associating each sequencing read in the first set of sequencing reads with the first haplotype based on the possession of the common nucleic acid residue at the common position. In some embodiments, the common nucleic acid residue at the common position is a single nucleotide polymorphism. In some embodiments, the single nucleotide polymorphism is identified based on a comparison of the common nucleic acid residue at the common position with a corresponding position in a reference sequence. Some embodiments further comprise estimating a likelihood that one of the haplotypes possesses a mutation. In some embodiments, the mutation is a copy number variant. Some embodiments further comprise estimating the number of copies in the copy number variant. Some embodiments further comprise: d) receiving a second set of sequencing reads; e) determining that each sequencing read in the second set of sequencing reads possesses the common position and does not possess the common nucleic acid residue at the common position; and f) associating each sequencing read in the second set of sequencing reads with the second haplotype based on the non-possession of the common nucleic acid residue at the common position. In some embodiments, each sequencing read in the first set of sequencing reads comprises a common barcode. In some embodiments, each sequencing read in the second set of sequencing reads comprises the common barcode. Some embodiments further comprise: g) identifying a window of a reference sequence that corresponds to the sequencing reads in the first set of sequencing reads and the second set of sequencing reads; and h) comparing the total number of sequencing reads in the first set of sequencing reads that occur within the window and the total number of sequencing reads in the second set of sequencing reads that occur within the window. In some embodiments, the associating each sequencing read in the first set of sequencing reads with the first haplotype has an error rate of no greater than 1%. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting a result of the associating. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of the method.

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read, wherein the computer system comprises a processor; b) determining by the processor that the first sequencing read has a single polynucleotide polymorphism at a position; c) receiving a second sequencing read, wherein the second sequencing read comprises the position; d) determining that the second sequencing read has the single polynucleotide polymorphism at the position; and e) selecting the first sequencing read and the second sequencing read for alignment based on the presence of the single polynucleotide polymorphism at the position in both the first sequencing read and the second sequencing read. Some embodiments further comprise generating a candidate sequence by aligning the first sequencing read and the second sequencing read. In some embodiments, the presence of the single nucleotide polymorphism in the first sequencing read is determined by comparison of the first sequencing read with another sequence. In some embodiments, the presence of the single nucleotide polymorphism in the first sequencing read is determined by comparison of the first sequencing read with a reference sequence. In some embodiments, the sample is associated with a plurality of distinct barcodes. In some embodiments, the first sequencing read and the second sequencing read share a common barcode. Some embodiments further comprise receiving a third sequencing read, wherein the third sequencing read does not comprise the common barcode, and filtering out the third sequencing read. Some embodiments further comprise: f) receiving a third sequencing read, wherein the third sequencing read comprises the position; g) determining that the third sequencing read does not have the single polynucleotide polymorphism at the position; and h) excluding the third sequencing read from alignment with the first sequencing read based on the absence of the single polynucleotide polymorphism at the position of the third sequencing read. Some embodiments further comprise estimating a likelihood that the third sequencing read and the first sequencing read correspond to regions of a common chromosome. In some embodiments, the determining whether the third sequencing read and the first sequencing read correspond to regions of a common chromosome has an error rate of no greater than 1%. Some embodiments further comprise determining that the first sequencing read corresponds to a first haploid in a haploid pair, and that the third sequencing read corresponds to the second haploid in the haploid pair. In some embodiments, the determination that the first sequencing read corresponds to the first haploid in the haploid pair and that the third sequencing read corresponds to the second haploid in the haploid pair comprises determining a level of homology between the first sequencing read and the third sequencing read. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting the selection of the first sequencing read and the second sequencing read for alignment. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of the method.

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first candidate sequence and a second candidate sequence, wherein the computer system comprises a processor; b) determining by the processor a pattern of single nucleotide polymorphisms or lack thereof within the first candidate sequence; c) determining a pattern of single nucleotide polymorphisms or lack thereof within the second candidate sequence; and d) estimating a likelihood that the first candidate sequence and the second candidate sequence correspond to a common chromosome based on the pattern of single nucleotide polymorphisms or lack thereof within the first candidate sequence and the pattern of single nucleotide polymorphisms or lack thereof within the second candidate sequence. In some embodiments, the estimation of the likelihood that the first candidate sequence and the second candidate sequence correspond to a common chromosome is based on a statistical analysis of the pattern of single nucleotide polymorphisms or lack thereof within the first candidate sequence and the pattern of single nucleotide polymorphisms or lack thereof within the second candidate sequence. In some embodiments, the estimating the likelihood that the first candidate sequence and the second candidate sequence correspond to a common chromosome has an error rate of no greater than 1%. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting the selection of the first sequencing read and the second sequencing read for alignment. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of the method.

In some embodiments, the invention provides a method of analyzing sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read and a second sequencing read, wherein the first sequencing read and the second sequencing read each comprise a common barcode, wherein the computer system comprises a processor; b) aligning the first sequencing read to a reference genome, wherein the reference genome comprises at least two reference chromosomes, and determining that the first sequencing read corresponds to a first reference chromosome in the reference genome; and c) aligning the second sequencing read to the reference genome and determining to which reference chromosome in the reference genome the second sequencing read corresponds, wherein the sample is associated with a plurality of distinct barcodes. In some embodiments, the aligning the second sequencing read to the reference genome determines that the second sequencing read corresponds to the first reference chromosome in the reference genome. In some embodiments, the aligning the second sequencing read to the reference genome determines that the second sequencing read corresponds to a second reference chromosome in the reference genome. Some embodiments further comprise estimating a likelihood that the sample contains a mutation from the reference genome. In some embodiments, the estimating the likelihood that the sample contains the mutation from the reference genome is based on the alignment of the first sequencing read to the reference genome and the alignment of the second sequencing read to the reference genome. Some embodiments further comprise determining a candidate site of the mutation in the reference genome. In some embodiments, the mutation is a translocation. Some embodiments further comprise: d) receiving by the computer system a plurality of further sequencing reads, wherein each further sequencing read comprises the common barcode; e) aligning each further sequencing read against the first reference chromosome in the reference genome; f) identifying a portion of the first reference chromosome that does not align to any of the further sequencing reads; and g) identifying the portion of the first reference chromosome that does not align to any of the further sequencing reads as the candidate breakpoint. Some embodiments further comprise determining a candidate breakpoint in the reference genome. In some embodiments, the candidate breakpoint is within 100 nucleotide residues of a sequencing read that aligns with the first reference chromosome. In some embodiments, the candidate breakpoint is within 50 nucleotide residues of a sequencing read that aligns with the first reference chromosome. In some embodiments, the candidate breakpoint is within 10 nucleotide residues of a sequencing read that aligns with the first reference chromosome. Some embodiments further comprise determining which haplotype in a pair of haplotypes possesses the mutation. In some embodiments, the determining which haplotype in the pair of haplotypes possesses the mutation comprises identifying a common position among a subset of the further sequencing reads and identifying a common single nucleotide polymorphism at the common position of the further sequencing reads. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting the determination of the reference chromosome in the reference genome to which the second sequencing read corresponds. Some embodiments further comprise providing a therapeutic intervention to a subject based on a result of the method.

In some embodiments, the invention provides a method of identifying sequencing errors in sequencing data associated with a sample, the method comprising: a) receiving by a computer system a first sequencing read and a second sequencing read, wherein the first sequencing read and the second sequencing read comprise a common barcode, wherein the first sequencing read and the second sequencing read share at least 90% homology, and wherein the computer system comprises a processor; b) comparing by the processor the first sequencing read and the second sequencing read to find at least one site at which the first sequencing read and the second sequencing read differ; and c) determining that the site in the second sequencing read is an error based on the comparing, wherein the sample is associated with a plurality of distinct barcodes. Some embodiments further comprise comparing the second sequencing read with a number of additional sequencing reads, wherein each of the additional sequencing reads comprises the common barcode and shares at least 90% homology with the second sequencing read. In some embodiments, a confidence level of the determining that the site in the second sequencing read is an error increases as the number of additional sequencing reads compared to the second sequencing read increases. Some embodiments further comprise fixing the error by changing a nucleic acid identity at the site in the second sequencing read. In some embodiments, the first sequencing read is longer than the second sequencing read. In some embodiments, the first sequencing read and the second sequencing read share at least 95% homology. In some embodiments, the first sequencing read and the second sequencing read share at least 99% homology. In some embodiments, the first sequencing read and the second sequencing read share at least 99.9% homology. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise outputting that the site in the second sequencing read is an error.

In some embodiments, the invention provides a method of analyzing sequencing data, the method comprising: a) receiving by a computer system a candidate sequence, wherein the candidate sequence is based on a set of sequencing reads, wherein each sequencing read is independently associated with one of at least two distinct barcodes, wherein the computer system comprises a processor; and b) analyzing by the processor the candidate sequence to identify a pattern of the barcodes associated with the candidate sequence. Some embodiments further comprise: c) selecting a sequencing read from the set of sequencing reads; and d) comparing the pattern of the barcodes associated with the candidate sequence and the barcode associated with the selected sequencing read. Some embodiments further comprise: e) relocating the selected sequencing based on the comparison of the pattern of the barcodes associated with the candidate sequence and the barcode associated with the selected sequencing read. In some embodiments, the candidate sequence corresponds to a reference chromosome, and the method further comprises relocating one of the sequencing reads to another position on the reference chromosome. In some embodiments, the candidate sequence corresponds to a reference chromosome, and the method further comprises relocating one of the sequencing reads to another reference chromosome. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. Some embodiments further comprise providing an output based on the analyzing.

In some embodiments, the invention provides a method of analyzing sequencing data, the method comprising: a) receiving by a computer system a first candidate sequence and a second candidate sequence, wherein the computer system comprises a processor; b) processing by the processor the first candidate sequence and the second candidate sequence to provide display data for both first candidate sequence and the second candidate sequence; and c) displaying contemporaneously the first candidate sequence and the second candidate sequence. In some embodiments, the first candidate sequence is based on data obtained from a sequencing instrument. In some embodiments, the first candidate sequence is received before the sequencing assay is complete. In some embodiments, the display of the first candidate sequence is updated before the sequencing assay is complete. In some embodiments, the display of the second candidate sequence is updated contemporaneously with the update of the display of the first candidate sequence. In some embodiments, the second candidate sequence is based on data obtained from the sequencing instrument. In some embodiments, the second candidate sequence is based on data obtained from a different sequencing instrument. In some embodiments, the second candidate sequence is based on data obtained from a database. In some embodiments, the first candidate sequence and the second candidate sequence correspond to different chromosomes. In some embodiments, the different chromosomes are associated with a common sample. In some embodiments, the different chromosomes are each independently associated with a different sample. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. In some embodiments, the sequencing data is peptide sequencing data.

In some embodiments, the invention provides a computer program product comprising a computer-readable medium having computer-executable code encoded therein, the computer-executable code adapted to be executed to implement a method of analyzing sequencing data, the method comprising: a) providing a sequencing data analysis system, wherein the sequencing data analysis system comprises: i) a receiving module configured to receive sequencing data associated with a sequencing assay while the sequencing assay is in progress; and ii) an alignment module configured to align sequencing data while the sequencing assay is in progress; b) receiving by the receiving module a set of sequencing reads associated with the sequencing assay while the sequencing assay is in progress; and c) performing an alignment by the alignment module of at least one received sequencing read with another sequence while the sequencing assay is in progress. In some embodiments, the sequencing data analysis system further comprises a database access module, wherein the method further comprises accessing a database to obtain the other sequence. In some embodiments, the sequencing data analysis system further comprises a reference sequence, wherein the other sequence aligned with the sequencing read is the reference sequence. In some embodiments, the sequencing data analysis system further comprises a barcode module, wherein the method further comprises identifying by the barcode module at least two sequencing reads in the set of sequencing reads that share a common barcode. In some embodiments, the sequencing data analysis system further comprises a candidate module, wherein the method further comprises building by the candidate module a candidate sequence based on the alignment. In some embodiments, the sequencing data analysis system further comprises a mutation module, wherein the method further comprises estimating by the mutation module a likelihood that the candidate sequence comprises a mutation. In some embodiments, the mutation is a single nucleotide polymorphism. In some embodiments, the mutation is a copy number variant. In some embodiments, the mutation is a translocation. In some embodiments, the sequencing data analysis system further comprises a haplotype module, wherein the method further comprises determining by the haplotype module to which haplotype in a pair of haplotypes one of sequencing reads of the set of sequencing reads corresponds. In some embodiments, the sequencing reads associated with the sequencing assay are received from an instrument conducting the sequencing assay. In some embodiments, the sequencing reads associated with the sequencing assay are received from a database. In some embodiments, the computer program product is in communication with a sequencing instrument. In some embodiments, the sequencing data analysis system further comprises a user information module, wherein the method further comprises obtaining by the user information module user information from the sequencing instrument. In some embodiments, the sequencing data analysis system further comprises a user information module, wherein the method further comprises obtaining by the user information module user information associated with a configuration of the sequencing instrument. In some embodiments, the computer program product is in communication with a plurality of sequencing instruments. In some embodiments, the sequencing data analysis system analyzes DNA sequencing data. In some embodiments, the sequencing data analysis system analyzes RNA sequencing data. In some embodiments, the sequencing data analysis system analyzes genomic sequencing data. In some embodiments, the sequencing data analysis system analyzes peptide sequencing data. In some embodiments, the sequencing data analysis system further comprises an output module, wherein the method further comprises outputting by the output module a result of the alignment.

In some embodiments, the invention provides a computer program product comprising a computer-readable medium having computer-executable code encoded therein, the computer-executable code adapted to be executed to implement a method of analyzing sequencing data, the method comprising: a) providing a sequencing data analysis system, wherein the sequencing data analysis system comprises: i) a receiving module; ii) an alignment module; iii) a database access module; and iv) a sorting module; b) receiving by the receiving module a set of sequencing reads, wherein each sequencing read shares a common position; c) accessing by the database access module a database to obtain a reference sequence stored in the database; d) performing an alignment by the alignment module of each sequencing read of the set of sequencing reads with the reference sequence; and e) sorting by the sorting module the sequencing reads of the set of sequencing reads into two subsets of sequencing reads, wherein a first subset of sequencing reads contains sequencing reads that match the reference sequence at the common position, and a second subset of sequencing reads contains sequencing reads that do not match the reference sequence at the common position. In some embodiments, each sequencing read of the second subset of sequencing reads possesses a single nucleotide polymorphism at the common position. In some embodiments, the sequencing data analysis system further comprises a haplotype module, wherein the method further comprises associating by the haplotype module the first subset of sequencing reads with a first haplotype of a pair of haplotypes, and the second subset of sequencing reads with the second haplotype of the pair of haplotypes. In some embodiments, the sequencing data analysis system further comprises a mutation module, wherein the method further comprises estimating by the mutation module a likelihood that either haplotype of the pair of haplotypes comprises a mutation based on a total number of sequencing reads in the first subset of sequencing reads and a total number of sequencing reads in the second subset of sequencing reads. In some embodiments, the sequencing data analysis system further comprises a barcode module, wherein the method further comprises identifying a common barcode within the a set of sequencing reads and excluding from the first subset of sequencing reads and the second subset of sequencing reads any sequencing read that does not comprise the common barcode. In some embodiments, the sequencing data analysis system analyzes DNA sequencing data. In some embodiments, the sequencing data analysis system analyzes RNA sequencing data. In some embodiments, the sequencing data analysis system analyzes genomic sequencing data. In some embodiments, the sequencing data analysis system further comprises an output module, wherein the method further comprises outputting by the output module a result of the sorting.

In some embodiments, the invention provides a computer program product comprising a computer-readable medium having computer-executable code encoded therein, the computer-executable code adapted to be executed to implement a method of analyzing sequencing data, the method comprising: a) providing a sequencing data analysis system, wherein the sequencing data analysis system comprises: i) a receiving module; ii) an alignment module; iii) a database access module; and iv) a selection module; b) receiving by the receiving module a first sequencing read and a second sequencing read, wherein the first sequencing read and the second sequencing read overlap at a common position; c) accessing by the database access module a database to obtain a reference sequence stored in the database; d) performing an alignment by the alignment module of the first sequencing read with the reference sequence, thereby determining a presence of a single nucleotide polymorphism at the common position in the first sequencing read; e) performing an alignment by the alignment module of the second sequencing read with the reference sequence, thereby determining a presence of the single nucleotide polymorphism at the common position in the second sequencing read; and f) selecting by the selection module the first sequencing read and the second sequencing read for mutual alignment based on the presence of the single nucleotide polymorphism at the common position in the first sequencing read and the second sequencing read. In some embodiments, the method further comprises aligning by the alignment module the first sequencing read and the second sequencing read. In some embodiments, the sequencing data analysis system further comprises a candidate module, wherein the method further comprises building by the candidate module a candidate sequence based on the alignment of the first sequencing read and the second sequencing read. In some embodiments, the sequencing data analysis system further comprises a barcode module, wherein the method further comprises determining by the barcode module that the first sequencing read and the second sequencing read share a common barcode. In some embodiments, the selecting by the selection module the first sequencing read and the second sequencing read for mutual alignment is further based on the first sequencing read and the second sequencing read sharing a common barcode. In some embodiments, the sequencing data analysis system further comprises a chromosome module, wherein the receiving module receives a third sequencing read, and wherein the method further comprises estimating by the chromosome module a likelihood that the third sequencing read and the first sequencing read correspond to regions of a common chromosome. In some embodiments, the sequencing data analysis system analyzes DNA sequencing data. In some embodiments, the sequencing data analysis system analyzes RNA sequencing data. In some embodiments, the sequencing data analysis system analyzes genomic sequencing data. In some embodiments, the sequencing data analysis system further comprises an output module, wherein the method further comprises outputting by the output module a result based on the selecting the first sequencing read and the second sequencing read for alignment.

In some embodiments, the invention provides a computer program product comprising a computer-readable medium having computer-executable code encoded therein, the computer-executable code adapted to be executed to implement a method of analyzing sequencing data, the method comprising: a) providing a sequencing data analysis system, wherein the sequencing data analysis system comprises: i) a receiving module; ii) an alignment module; iii) a database access module; and iv) a mutation module; b) receiving by the receiving module a first sequencing read and a second sequencing read; c) accessing by the database access module a database to obtain a reference genome stored in the database, wherein the reference genome comprises at least two reference chromosomes; d) performing an alignment by the alignment module of the first sequencing read with the reference genome, thereby determining that the first sequencing read corresponds to a first reference chromosome in the reference genome; e) performing an alignment by the alignment module of the second sequencing read with the reference genome, thereby determining that the second sequencing read corresponds to a second reference chromosome in the reference genome; and f) estimating by the mutation module a likelihood that a mutation exists in a sample associated with the first sequencing read and the second sequencing read based on the alignments of the first sequencing read and the second sequencing read with the reference genome. In some embodiments, the method further comprises determining by the mutation module a candidate site of the mutation based on the alignments of the first sequencing read and the second sequencing read with the reference genome. In some embodiments, the mutation module is configured to identify a translocation, and wherein the estimating by the mutation module a likelihood that a mutation exists in the sample associated with the first sequencing read and the second sequencing read comprises estimating a likelihood that a translocation exists in the sample associated with the first sequencing read and the second sequencing read. In some embodiments, the mutation module is configured to identify a candidate breakpoint associated with the translocation, wherein the method further comprises identifying by the mutation module a candidate breakpoint associated with the translocation based on the alignments of the first sequencing read and the second sequencing read with the reference genome. In some embodiments, the sequencing data analysis system further comprises a barcode module, and the method further comprises: g) receiving by the receiving module a plurality of further sequencing reads; h) determining by the barcode module that the first sequencing read and the further sequencing reads comprise a common barcode; i) aligning by the alignment module each further sequencing read against the first reference chromosome in the reference genome; and j) identifying by the alignment module a portion of the first reference chromosome that does not align to any of the further sequencing reads. In some embodiments, the sequencing data analysis system further comprises a haplotype module, wherein the method further comprises determining by the haplotype module which haplotype in a pair of haplotypes has the mutation. In some embodiments, the sequencing data is DNA sequencing data. In some embodiments, the sequencing data is RNA sequencing data. In some embodiments, the sequencing data is genomic sequencing data. In some embodiments, the sequencing data analysis system further comprises an output module, wherein the method further comprises outputting by the output module an estimated likelihood that that the mutation exists in the sample associated with the first sequencing read and the second sequencing.

Claims

1. A sequencing analysis system, comprising: one or more computer processors in communication with a sequencer, said sequencer configured to sequence a plurality of nucleic acid molecules and generate a set of partial sequencing reads comprising a partial sequencing read comprising a subsequence, wherein said subsequence comprises a sequencing error,wherein said one or more computer processors are individually or collectively configured to:(i) while said sequencer is sequencing said plurality of nucleic acid molecules, process said partial sequencing read based at least in part on a frequency of said subsequence among at least a portion of said set of partial sequencing reads to thereby correct said sequencing error, and(ii) while said sequencer is sequencing said plurality of nucleic acid molecules, generate, for display, an indication of said sequencing error in said subsequence.
2. The system of claim 1, wherein said one or more computer processors are individually or collectively configured to, while said sequencer is sequencing said plurality of nucleic acid molecules, receive user input through an interface in communication with said one or more computer processors.
3. The system of claim 1, wherein said subsequence comprises at least a portion of a barcode sequence.
4. The system of claim 1, wherein said one or more computer processors are individually or collectively configured to process said partial sequencing read and an additional partial sequencing read comprising said subsequence to identify said sequencing error or generate a consensus sequence.
5. The system of claim 1, wherein said one or more computer processors are in communication with a database, and wherein said one or more computer processors are individually or collectively configured to store said set of partial sequencing reads in said database.
6. The system of claim 1, wherein said one or more computer processors are individually or collectively configured to receive said set of partial sequencing reads from a data stream while said sequencer is sequencing said plurality of nucleic acid molecules.
7. The system of claim 1, wherein said one or more computer processors are in communication with said sequencer over a cloud infrastructure.
8. A sequencing analysis system, comprising: one or more computer processors in communication with a sequencer, said sequencer configured to sequence a plurality of nucleic acid molecules and generate a set of partial sequencing reads, wherein a partial sequencing read of said set of partial sequencing reads comprises at least a portion of a barcode sequence, wherein said one or more computer processors are individually or collectively configured to:while said sequencer is sequencing said plurality of nucleic acid molecules, receive, from said sequencer, said set of partial sequencing reads; andwhile said sequencer is sequencing said plurality of nucleic acid molecules, analyze said partial sequencing read of said set of partial sequencing reads based at least in part on said at least said portion of said barcode sequence.
9. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to process (i) said partial sequencing read comprising said portion of said barcode sequence and (ii) an additional partial sequencing read of said set of partial sequencing reads comprising an additional portion of said barcode sequence to assemble said barcode sequence.
10. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to assign a subset of said set of partial sequencing reads to a group of a plurality of groups based on said at least said portion of said barcode sequence.
11. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to assemble said partial sequencing read with at least one other read of said set of partial sequencing reads.
12. The system of claim 8, wherein said set of partial sequencing reads comprises a second partial sequencing read, and wherein said one or more computer processors are individually or collectively configured to process said partial sequencing read and said second partial sequencing read to determine whether said partial sequencing read and said second partial sequencing read share a common subsequence.
13. The system of claim 12, wherein said one or more computer processors are individually or collectively configured to identify a genetic variant in said partial sequencing read or said second partial sequencing read.
14. The system of claim 12, wherein said one or more computer processors are individually or collectively configured to determine whether said partial sequencing read and said second partial sequencing read correspond to a common biomolecule.
15. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to process said partial sequencing read with a reference sequence.
16. The system of claim 15, wherein said one or more computer processors are individually or collectively configured to process said partial sequencing read with said reference sequence to identify a mutation in said partial sequencing read.
17. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to determine a candidate sequence corresponding to a chromosome, wherein said candidate sequence incorporates a portion of said partial sequencing read.
18. The system of claim 17, wherein said one or more computer processors are individually or collectively configured to determine said candidate sequence at a rate that increases as a number of partial sequencing reads received by said one or more computer processors increases.
19. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to determine a haplotype in a pair of haplotypes to which said partial sequencing read corresponds.
20. The system of claim 8, wherein said one or more computer processors are individually or collectively configured to receive said set of partial sequencing reads from a data stream while said sequencer is sequencing said plurality of nucleic acid molecules.
21. A sequencing analysis system, comprising: one or more computer processors in communication with a sequencer over a cloud, said sequencer configured to sequence a nucleic acid molecule to generate a partial sequencing read,wherein said one or more computer processors are individually or collectively configured to, while said sequencer is sequencing said nucleic acid molecule, (i) receive said partial sequencing read from said sequencer over said cloud, and (ii) process said partial sequencing read to yield a result.
22. The system of claim 21, wherein said one or more computer processors are in communication with a database, and wherein said one or more computer processors are individually or collectively configured to store said partial sequencing read in said database.
23. The system of claim 21, wherein said one or more computer processors are individually or collectively configured to receive said partial sequencing read from a data stream while said sequencer is sequencing said nucleic acid molecule.

CROSS REFERENCE

This application is a continuation of U.S. application Ser. No. 14/470,746, filed Aug. 27, 2014, now U.S. Pat. No. 10,395,758, which claims priority to U.S. Provisional Application No. 61/979,973, filed Apr. 15, 2014, U.S. Provisional Application No. 61/916,566, filed Dec. 16, 2013, and U.S. Provisional Application No. 61/872,597, filed Aug. 30, 2013, each of which is entirely incorporated herein by reference for all purposes.

US Referenced Citations (470)

Number	Name	Date	Kind
4124638	Hansen	Nov 1978	A
5149625	Church et al.	Sep 1992	A
5185099	Delpuech et al.	Feb 1993	A
5202231	Drmanac et al.	Apr 1993	A
5270183	Corbett et al.	Dec 1993	A
5413924	Kosak et al.	May 1995	A
5436130	Mathies et al.	Jul 1995	A
5478893	Ghosh et al.	Dec 1995	A
5512131	Kumar et al.	Apr 1996	A
5587128	Wilding et al.	Dec 1996	A
5605793	Stemmer et al.	Feb 1997	A
5618711	Gelfand et al.	Apr 1997	A
5695940	Drmanac et al.	Dec 1997	A
5736330	Fulton	Apr 1998	A
5756334	Perler et al.	May 1998	A
5834197	Parton	Nov 1998	A
5846719	Brenner et al.	Dec 1998	A
5851769	Gray et al.	Dec 1998	A
5856174	Lipshutz et al.	Jan 1999	A
5900481	Lough et al.	May 1999	A
5942609	Hunkapiller et al.	Aug 1999	A
5958703	Dower et al.	Sep 1999	A
5994056	Higuchi	Nov 1999	A
6033880	Haff et al.	Mar 2000	A
6046003	Mandecki	Apr 2000	A
6051377	Mandecki	Apr 2000	A
6057107	Fulton	May 2000	A
6057149	Burns et al.	May 2000	A
6103537	Ullman et al.	Aug 2000	A
6123798	Gandhi et al.	Sep 2000	A
6143496	Brown et al.	Nov 2000	A
6171850	Nagle et al.	Jan 2001	B1
6172218	Brenner	Jan 2001	B1
6176962	Soane et al.	Jan 2001	B1
6297006	Drmanac et al.	Oct 2001	B1
6297017	Schmidt et al.	Oct 2001	B1
6306590	Mehta et al.	Oct 2001	B1
6327410	Walt et al.	Dec 2001	B1
6355198	Kim et al.	Mar 2002	B1
6361950	Mandecki	Mar 2002	B1
6372813	Johnson et al.	Apr 2002	B1
6379929	Burns et al.	Apr 2002	B1
6406848	Bridgham et al.	Jun 2002	B1
6409832	Weigl et al.	Jun 2002	B2
6432360	Church	Aug 2002	B1
6485944	Church et al.	Nov 2002	B1
6492118	Abrams et al.	Dec 2002	B1
6511803	Church et al.	Jan 2003	B1
6524456	Ramsey et al.	Feb 2003	B1
6586176	Trnovsky et al.	Jul 2003	B1
6632606	Ullman et al.	Oct 2003	B1
6632655	Mehta et al.	Oct 2003	B1
6670133	Knapp et al.	Dec 2003	B2
6767731	Hannah	Jul 2004	B2
6800298	Burdick et al.	Oct 2004	B1
6806052	Bridgham et al.	Oct 2004	B2
6806058	Jesperson et al.	Oct 2004	B2
6859570	Walt et al.	Feb 2005	B2
6913935	Thomas	Jul 2005	B1
6915679	Chien et al.	Jul 2005	B2
6929859	Chandler et al.	Aug 2005	B2
6969488	Bridgham et al.	Nov 2005	B2
6974669	Mirkin et al.	Dec 2005	B2
7041481	Anderson et al.	May 2006	B2
7115400	Adessi et al.	Oct 2006	B1
7129091	Ismagilov et al.	Oct 2006	B2
7268167	Higuchi et al.	Sep 2007	B2
7282370	Bridgham et al.	Oct 2007	B2
7294503	Quake et al.	Nov 2007	B2
7323305	Leamon et al.	Jan 2008	B2
7425431	Church et al.	Sep 2008	B2
7536928	Kazuno	May 2009	B2
7544473	Brenner	Jun 2009	B2
7604938	Takahashi et al.	Oct 2009	B2
7622076	Davies et al.	Nov 2009	B2
7622280	Holliger et al.	Nov 2009	B2
7638276	Griffiths et al.	Dec 2009	B2
7645596	Williams et al.	Jan 2010	B2
7666664	Sarofim et al.	Feb 2010	B2
7708949	Stone et al.	May 2010	B2
7709197	Drmanac	May 2010	B2
7745178	Dong	Jun 2010	B2
7772287	Higuchi et al.	Aug 2010	B2
7776927	Chu et al.	Aug 2010	B2
RE41780	Anderson et al.	Sep 2010	E
7799553	Mathies et al.	Sep 2010	B2
7842457	Berka et al.	Nov 2010	B2
7901891	Drmanac	Mar 2011	B2
7910354	Drmanac et al.	Mar 2011	B2
7927797	Nobile et al.	Apr 2011	B2
7960104	Drmanac et al.	Jun 2011	B2
7968287	Griffiths et al.	Jun 2011	B2
7972778	Brown et al.	Jul 2011	B2
8003312	Krutzik et al.	Aug 2011	B2
8053192	Bignell et al.	Nov 2011	B2
8067159	Brown et al.	Nov 2011	B2
8133719	Drmanac et al.	Mar 2012	B2
8168385	Brenner	May 2012	B2
8252539	Quake et al.	Aug 2012	B2
8268564	Roth et al.	Sep 2012	B2
8273573	Ismagilov et al.	Sep 2012	B2
8278071	Brown et al.	Oct 2012	B2
8298767	Brenner et al.	Oct 2012	B2
8304193	Ismagilov et al.	Nov 2012	B2
8318433	Brenner	Nov 2012	B2
8329407	Ismagilov et al.	Dec 2012	B2
8337778	Stone et al.	Dec 2012	B2
8592150	Drmanac et al.	Nov 2013	B2
8603749	Gillevet	Dec 2013	B2
8658430	Miller et al.	Feb 2014	B2
8748094	Weitz et al.	Jun 2014	B2
8748102	Berka et al.	Jun 2014	B2
8765380	Berka et al.	Jul 2014	B2
8822148	Ismagliov et al.	Sep 2014	B2
8835358	Fodor et al.	Sep 2014	B2
8871444	Griffiths et al.	Oct 2014	B2
8889083	Ismagilov et al.	Nov 2014	B2
8975302	Light et al.	Mar 2015	B2
9012370	Hong	Apr 2015	B2
9012390	Holtze et al.	Apr 2015	B2
9017948	Agresti et al.	Apr 2015	B2
9029083	Griffiths et al.	May 2015	B2
9029085	Agresti et al.	May 2015	B2
9085798	Chee	Jul 2015	B2
9089844	Hiddessen et al.	Jul 2015	B2
9126160	Ness et al.	Sep 2015	B2
9156010	Colston et al.	Oct 2015	B2
9194861	Hindson et al.	Nov 2015	B2
9216392	Hindson et al.	Dec 2015	B2
9238206	Rotem et al.	Jan 2016	B2
9266104	Link	Feb 2016	B2
9290808	Fodor et al.	Mar 2016	B2
9328382	Drmanac et al.	May 2016	B2
9347059	Saxonov	May 2016	B2
9371598	Chee	Jun 2016	B2
9388465	Hindson et al.	Jul 2016	B2
9410201	Hindson et al.	Aug 2016	B2
9417190	Hindson et al.	Aug 2016	B2
9486757	Romanowsky et al.	Nov 2016	B2
9498761	Holtze et al.	Nov 2016	B2
9500664	Ness et al.	Nov 2016	B2
9567631	Hindson et al.	Feb 2017	B2
9593365	Frisen et al.	Mar 2017	B2
9623384	Hindson et al.	Apr 2017	B2
9637799	Fan et al.	May 2017	B2
9644204	Hindson et al.	May 2017	B2
9689024	Hindson et al.	Jun 2017	B2
9694361	Bharadwaj et al.	Jul 2017	B2
9695468	Hindson et al.	Jul 2017	B2
9701998	Hindson et al.	Jul 2017	B2
9764322	Hiddessen et al.	Sep 2017	B2
9824068	Wong et al.	Nov 2017	B2
9868979	Chee et al.	Jan 2018	B2
9879313	Chee et al.	Jan 2018	B2
9946577	Stafford et al.	Apr 2018	B1
9951386	Hindson et al.	Apr 2018	B2
9957558	Leamon et al.	May 2018	B2
9975122	Masquelier et al.	May 2018	B2
10011872	Belgrader et al.	Jul 2018	B1
10017759	Kaper et al.	Jul 2018	B2
10030261	Frisen et al.	Jul 2018	B2
10059989	Giresi et al.	Aug 2018	B2
10119167	Srinivasan et al.	Nov 2018	B2
10221436	Hardenbol et al.	Mar 2019	B2
10221442	Hindson et al.	Mar 2019	B2
10253364	Hindson et al.	Apr 2019	B2
10273541	Hindson et al.	Apr 2019	B2
10323279	Hindson et al.	Jun 2019	B2
10347365	Wong et al.	Jul 2019	B2
10357771	Bharadwaj et al.	Jul 2019	B2
10395758	Schnall-Levin	Aug 2019	B2
10400280	Hindson et al.	Sep 2019	B2
10428326	Belhocine et al.	Oct 2019	B2
10450607	Hindson et al.	Oct 2019	B2
10533221	Hindson et al.	Jan 2020	B2
10544413	Bharadwaj et al.	Jan 2020	B2
10549279	Bharadwaj et al.	Feb 2020	B2
10557158	Hardenbol et al.	Feb 2020	B2
10590244	Delaney et al.	Mar 2020	B2
10745742	Bent et al.	Aug 2020	B2
10752949	Hindson et al.	Aug 2020	B2
10774374	Frisen et al.	Sep 2020	B2
10815525	Lucero et al.	Oct 2020	B2
10829815	Bharadwaj et al.	Nov 2020	B2
10837047	Delaney et al.	Nov 2020	B2
10874997	Weitz et al.	Dec 2020	B2
10995333	Pfeiffer	May 2021	B2
11030276	Wong	Jun 2021	B2
11371094	Ryvkin et al.	Jun 2022	B2
11459607	Terry et al.	Oct 2022	B1
11467153	Belhocine et al.	Oct 2022	B2
11655499	Pfeiffer	May 2023	B1
11845983	Belhocine et al.	Dec 2023	B1
11851683	Maheshwari et al.	Dec 2023	B1
11851700	Bava et al.	Dec 2023	B1
11952626	Pfeiffer et al.	Apr 2024	B2
20010020588	Adourian et al.	Sep 2001	A1
20010044109	Mandecki	Nov 2001	A1
20020005354	Spence et al.	Jan 2002	A1
20020034737	Drmanac	Mar 2002	A1
20020051971	Stuelpnagel et al.	May 2002	A1
20020051992	Bridgham et al.	May 2002	A1
20020058332	Quake et al.	May 2002	A1
20020089100	Kawasaki	Jul 2002	A1
20020092767	Bjornson et al.	Jul 2002	A1
20020119455	Chan	Aug 2002	A1
20020127736	Chou et al.	Sep 2002	A1
20020179849	Maher et al.	Dec 2002	A1
20030008285	Fischer	Jan 2003	A1
20030008323	Ravkin et al.	Jan 2003	A1
20030027221	Scott et al.	Feb 2003	A1
20030028981	Chandler et al.	Feb 2003	A1
20030036206	Chien et al.	Feb 2003	A1
20030044777	Beattie	Mar 2003	A1
20030044836	Levine et al.	Mar 2003	A1
20030075446	Culbertson et al.	Apr 2003	A1
20030104466	Knapp et al.	Jun 2003	A1
20030108897	Drmanac	Jun 2003	A1
20030124509	Kenis et al.	Jul 2003	A1
20030149307	Hai et al.	Aug 2003	A1
20030170698	Gascoyne et al.	Sep 2003	A1
20030182068	Battersby et al.	Sep 2003	A1
20030207260	Trnovsky et al.	Nov 2003	A1
20030215862	Parce et al.	Nov 2003	A1
20040019005	Van Ness et al.	Jan 2004	A1
20040063138	McGinnis et al.	Apr 2004	A1
20040067493	Matsuzaki et al.	Apr 2004	A1
20040068019	Higuchi et al.	Apr 2004	A1
20040132122	Banerjee et al.	Jul 2004	A1
20040258701	Dominowski et al.	Dec 2004	A1
20050019839	Jespersen et al.	Jan 2005	A1
20050042625	Schmidt et al.	Feb 2005	A1
20050130188	Walt et al.	Jun 2005	A1
20050181379	Su et al.	Aug 2005	A1
20050202429	Trau et al.	Sep 2005	A1
20050202489	Cho et al.	Sep 2005	A1
20050221339	Griffiths et al.	Oct 2005	A1
20050244850	Huang et al.	Nov 2005	A1
20050250147	Macevicz	Nov 2005	A1
20050266582	Modlin et al.	Dec 2005	A1
20050287572	Mathies et al.	Dec 2005	A1
20060020371	Ham et al.	Jan 2006	A1
20060073487	Oliver et al.	Apr 2006	A1
20060153924	Griffiths et al.	Jul 2006	A1
20060163385	Link et al.	Jul 2006	A1
20060177832	Brenner	Aug 2006	A1
20060199193	Koo et al.	Sep 2006	A1
20060240506	Kushmaro et al.	Oct 2006	A1
20060263888	Fritz et al.	Nov 2006	A1
20060275782	Gunderson et al.	Dec 2006	A1
20060292583	Schneider et al.	Dec 2006	A1
20070003442	Link et al.	Jan 2007	A1
20070020617	Trnovsky et al.	Jan 2007	A1
20070020640	McCloskey et al.	Jan 2007	A1
20070042419	Barany et al.	Feb 2007	A1
20070054119	Garstecki et al.	Mar 2007	A1
20070077572	Tawfik et al.	Apr 2007	A1
20070092914	Griffiths et al.	Apr 2007	A1
20070099208	Drmanac et al.	May 2007	A1
20070111241	Cereb et al.	May 2007	A1
20070154903	Marla et al.	Jul 2007	A1
20070172873	Brenner et al.	Jul 2007	A1
20070190543	Livak	Aug 2007	A1
20070195127	Ahn et al.	Aug 2007	A1
20070196397	Torii et al.	Aug 2007	A1
20070207060	Zou et al.	Sep 2007	A1
20070228588	Noritomi et al.	Oct 2007	A1
20070264320	Lee et al.	Nov 2007	A1
20080003142	Link et al.	Jan 2008	A1
20080004436	Tawfik et al.	Jan 2008	A1
20080014589	Link et al.	Jan 2008	A1
20080056948	Dale et al.	Mar 2008	A1
20080166720	Hsieh et al.	Jul 2008	A1
20080213766	Brown et al.	Sep 2008	A1
20080241820	Krutzik et al.	Oct 2008	A1
20080242560	Gunderson et al.	Oct 2008	A1
20080268431	Choy et al.	Oct 2008	A1
20090005252	Drmanac et al.	Jan 2009	A1
20090011943	Drmanac et al.	Jan 2009	A1
20090012187	Chu et al.	Jan 2009	A1
20090025277	Takanashi	Jan 2009	A1
20090035770	Mathies et al.	Feb 2009	A1
20090047713	Handique	Feb 2009	A1
20090048124	Leamon et al.	Feb 2009	A1
20090053169	Castillo et al.	Feb 2009	A1
20090068170	Weitz et al.	Mar 2009	A1
20090099041	Church et al.	Apr 2009	A1
20090118488	Drmanac et al.	May 2009	A1
20090131543	Weitz et al.	May 2009	A1
20090137404	Drmanac et al.	May 2009	A1
20090137414	Drmanac et al.	May 2009	A1
20090143244	Bridgham et al.	Jun 2009	A1
20090148961	Luchini et al.	Jun 2009	A1
20090155563	Petsev et al.	Jun 2009	A1
20090155781	Drmanac et al.	Jun 2009	A1
20090197772	Griffiths et al.	Aug 2009	A1
20090202984	Cantor	Aug 2009	A1
20090203531	Kurn	Aug 2009	A1
20090235990	Beer	Sep 2009	A1
20090264299	Drmanac et al.	Oct 2009	A1
20090269248	Falb et al.	Oct 2009	A1
20090286687	Dressman et al.	Nov 2009	A1
20100021973	Makarov et al.	Jan 2010	A1
20100021984	Edd et al.	Jan 2010	A1
20100022414	Link et al.	Jan 2010	A1
20100035254	Williams	Feb 2010	A1
20100069263	Shendure et al.	Mar 2010	A1
20100086914	Bentley et al.	Apr 2010	A1
20100105866	Fraden et al.	Apr 2010	A1
20100130369	Shenderov et al.	May 2010	A1
20100137163	Link et al.	Jun 2010	A1
20100184928	Kumacheva	Jul 2010	A1
20100210479	Griffiths et al.	Aug 2010	A1
20100216153	Lapidus et al.	Aug 2010	A1
20100248991	Roesler et al.	Sep 2010	A1
20100304982	Hinz et al.	Dec 2010	A1
20110033854	Drmanac et al.	Feb 2011	A1
20110071053	Drmanac et al.	Mar 2011	A1
20110091366	Kendall et al.	Apr 2011	A1
20110092376	Colston, Jr. et al.	Apr 2011	A1
20110160078	Fodor et al.	Jun 2011	A1
20110195496	Muraguchi et al.	Aug 2011	A1
20110217736	Hindson	Sep 2011	A1
20110218123	Weitz et al.	Sep 2011	A1
20110257889	Klammer et al.	Oct 2011	A1
20110263457	Krutzik et al.	Oct 2011	A1
20110267457	Weitz et al.	Nov 2011	A1
20110281738	Drmanac et al.	Nov 2011	A1
20110305761	Shum et al.	Dec 2011	A1
20110319281	Drmanac	Dec 2011	A1
20120000777	Garrell et al.	Jan 2012	A1
20120010098	Griffiths et al.	Jan 2012	A1
20120010107	Griffiths et al.	Jan 2012	A1
20120015382	Weitz et al.	Jan 2012	A1
20120015822	Weitz et al.	Jan 2012	A1
20120041727	Mishra et al.	Feb 2012	A1
20120071331	Casbon et al.	Mar 2012	A1
20120121481	Romanowsky et al.	May 2012	A1
20120132288	Weitz et al.	May 2012	A1
20120135893	Drmanac et al.	May 2012	A1
20120172259	Rigatti et al.	Jul 2012	A1
20120184449	Hixson et al.	Jul 2012	A1
20120190032	Ness et al.	Jul 2012	A1
20120196288	Beer	Aug 2012	A1
20120211084	Weitz et al.	Aug 2012	A1
20120219947	Yurkovetsky et al.	Aug 2012	A1
20120220494	Samuels et al.	Aug 2012	A1
20120220497	Jacobson et al.	Aug 2012	A1
20120222748	Weitz et al.	Sep 2012	A1
20120230338	Ganeshalingam et al.	Sep 2012	A1
20120309002	Link	Dec 2012	A1
20130028812	Prieto et al.	Jan 2013	A1
20130079231	Pushkarev et al.	Mar 2013	A1
20130109575	Kleinschmidt et al.	May 2013	A1
20130130919	Chen et al.	May 2013	A1
20130157870	Pushkarev et al.	Jun 2013	A1
20130157899	Adler, Jr. et al.	Jun 2013	A1
20130178368	Griffiths et al.	Jul 2013	A1
20130185096	Giusti et al.	Jul 2013	A1
20130189700	So et al.	Jul 2013	A1
20130203605	Shendure et al.	Aug 2013	A1
20130210639	Link et al.	Aug 2013	A1
20130225418	Watson	Aug 2013	A1
20130268206	Porreca et al.	Oct 2013	A1
20130274117	Church et al.	Oct 2013	A1
20130311106	White et al.	Nov 2013	A1
20130317755	Mishra et al.	Nov 2013	A1
20140037514	Stone et al.	Feb 2014	A1
20140057799	Johnson et al.	Feb 2014	A1
20140065234	Shum et al.	Mar 2014	A1
20140155295	Hindson et al.	Jun 2014	A1
20140194323	Gillevet et al.	Jul 2014	A1
20140199731	Agresti et al.	Jul 2014	A1
20140200166	Van Rooyen et al.	Jul 2014	A1
20140214334	Plattner et al.	Jul 2014	A1
20140221239	Carman et al.	Aug 2014	A1
20140227706	Kato et al.	Aug 2014	A1
20140235506	Hindson et al.	Aug 2014	A1
20140272996	Bemis	Sep 2014	A1
20140274740	Srinivasan et al.	Sep 2014	A1
20140287963	Hindson et al.	Sep 2014	A1
20140302503	Lowe et al.	Oct 2014	A1
20140323316	Drmanac et al.	Oct 2014	A1
20140338753	Sperling et al.	Nov 2014	A1
20140378322	Hindson et al.	Dec 2014	A1
20140378345	Hindson et al.	Dec 2014	A1
20140378349	Hindson et al.	Dec 2014	A1
20150005199	Hindson et al.	Jan 2015	A1
20150005200	Hindson et al.	Jan 2015	A1
20150011430	Saxonov	Jan 2015	A1
20150011432	Saxonov et al.	Jan 2015	A1
20150066385	Schnall-Levin et al.	Mar 2015	A1
20150111256	Church et al.	Apr 2015	A1
20150133344	Shendure et al.	May 2015	A1
20150218633	Hindson et al.	Aug 2015	A1
20150220532	Wong et al.	Aug 2015	A1
20150224466	Hindson et al.	Aug 2015	A1
20150267191	Steelman et al.	Sep 2015	A1
20150275289	Otwinowski et al.	Oct 2015	A1
20150298091	Weitz et al.	Oct 2015	A1
20150299772	Zhang	Oct 2015	A1
20150361418	Reed	Dec 2015	A1
20150376605	Jarosz et al.	Dec 2015	A1
20150376608	Kaper et al.	Dec 2015	A1
20150376609	Hindson et al.	Dec 2015	A1
20150376700	Schnall-Levin et al.	Dec 2015	A1
20150379196	Schnall-Levin et al.	Dec 2015	A1
20160008778	Weitz et al.	Jan 2016	A1
20160024558	Hardenbol et al.	Jan 2016	A1
20160024572	Shishkin et al.	Jan 2016	A1
20160053253	Salathia et al.	Feb 2016	A1
20160059204	Hindson et al.	Mar 2016	A1
20160060621	Agresti et al.	Mar 2016	A1
20160122753	Mikkelsen et al.	May 2016	A1
20160122817	Jarosz et al.	May 2016	A1
20160203196	Schnall-Levin et al.	Jul 2016	A1
20160232291	Kyriazopoulou-Panagiotopoulou et al.	Aug 2016	A1
20160244809	Belgrader et al.	Aug 2016	A1
20160281160	Jarosz et al.	Sep 2016	A1
20160289769	Schwartz et al.	Oct 2016	A1
20160304860	Hindson et al.	Oct 2016	A1
20160314242	Schnall-Levin et al.	Oct 2016	A1
20160348093	Price et al.	Dec 2016	A1
20160350478	Chin et al.	Dec 2016	A1
20170016041	Greenfield et al.	Jan 2017	A1
20170128937	Hung et al.	May 2017	A1
20170144161	Hindson et al.	May 2017	A1
20170145476	Ryvkin et al.	May 2017	A1
20170159109	Zheng et al.	Jun 2017	A1
20170235876	Jaffe et al.	Aug 2017	A1
20170260584	Zheng et al.	Sep 2017	A1
20180030515	Regev et al.	Feb 2018	A1
20180071695	Weitz et al.	Mar 2018	A1
20180080075	Brenner et al.	Mar 2018	A1
20180105808	Mikkelsen et al.	Apr 2018	A1
20180196781	Wong et al.	Jul 2018	A1
20180216162	Belhocine et al.	Aug 2018	A1
20180265928	Schnall-Levin et al.	Sep 2018	A1
20180312822	Lee et al.	Nov 2018	A1
20180312873	Zheng	Nov 2018	A1
20180334670	Bharadwaj et al.	Nov 2018	A1
20180340169	Belhocine et al.	Nov 2018	A1
20180371545	Wong et al.	Dec 2018	A1
20190060890	Bharadwaj et al.	Feb 2019	A1
20190060904	Bharadwaj et al.	Feb 2019	A1
20190060905	Bharadwaj et al.	Feb 2019	A1
20190064173	Bharadwaj et al.	Feb 2019	A1
20190071656	Chang et al.	Mar 2019	A1
20190085391	Hindson et al.	Mar 2019	A1
20190100632	Delaney et al.	Apr 2019	A1
20190127731	McDermott	May 2019	A1
20190134633	Bharadwaj et al.	May 2019	A1
20190136316	Hindson et al.	May 2019	A1
20190153532	Bharadwaj et al.	May 2019	A1
20190176152	Bharadwaj et al.	Jun 2019	A1
20190177800	Boutet et al.	Jun 2019	A1
20190249226	Bent et al.	Aug 2019	A1
20190323088	Boutet et al.	Oct 2019	A1
20190345636	McDermott et al.	Nov 2019	A1
20190352717	Schnall-Levin	Nov 2019	A1
20190367997	Bent et al.	Dec 2019	A1
20190376118	Belhocine et al.	Dec 2019	A1
20200002763	Belgrader et al.	Jan 2020	A1
20200005902	Mellen et al.	Jan 2020	A1
20200032335	Alvarado Martinez	Jan 2020	A1
20200033237	Hindson et al.	Jan 2020	A1
20200033366	Alvarado Martinez	Jan 2020	A1
20210190770	Delaney et al.	Jun 2021	A1
20210270703	Abousoud	Sep 2021	A1
20240002914	Pfeiffer et al.	Jan 2024	A1

Foreign Referenced Citations (186)

Number	Date	Country
0249007	Dec 1987	EP
0637996	Jul 1997	EP
1019496	Sep 2004	EP
1482036	Oct 2007	EP
1841879	Oct 2007	EP
1594980	Nov 2009	EP
1967592	Apr 2010	EP
2258846	Dec 2010	EP
2145955	Feb 2012	EP
1905828	Aug 2012	EP
2136786	Oct 2012	EP
1908832	Dec 2012	EP
2540389	Jan 2013	EP
2635679	Apr 2017	EP
2097692	Nov 1982	GB
2097692	May 1985	GB
2485850	May 2012	GB
S5949832	Mar 1984	JP
2006507921	Mar 2006	JP
2006289250	Oct 2006	JP
2007193708	Aug 2007	JP
2007268350	Oct 2007	JP
2009208074	Sep 2009	JP
2012525147	Oct 2012	JP
2321638	Apr 2008	RU
WO-8402000	May 1984	WO
WO-9530782	Nov 1995	WO
WO-9629629	Sep 1996	WO
WO-9641011	Dec 1996	WO
WO-9909217	Feb 1999	WO
WO-9952708	Oct 1999	WO
WO-0008212	Feb 2000	WO
WO-2000008212	Feb 2000	WO
WO-0026412	May 2000	WO
WO-2001002850	Jan 2001	WO
WO-0114589	Mar 2001	WO
WO-0189787	Nov 2001	WO
WO-0190418	Nov 2001	WO
WO-0231203	Apr 2002	WO
WO-02086148	Oct 2002	WO
WO-03096223	Nov 2003	WO
WO-2004002627	Jan 2004	WO
WO-2004010106	Jan 2004	WO
WO-2004065617	Aug 2004	WO
WO-2004069849	Aug 2004	WO
WO-2004091763	Oct 2004	WO
WO-2004102204	Nov 2004	WO
WO-2004103565	Dec 2004	WO
WO-2004105734	Dec 2004	WO
WO-2005002730	Jan 2005	WO
WO-2005021151	Mar 2005	WO
WO-2005023331	Mar 2005	WO
WO-2005040406	May 2005	WO
WO-2005049787	Jun 2005	WO
WO-2005082098	Sep 2005	WO
WO-2006030993	Mar 2006	WO
WO-2006040551	Apr 2006	WO
WO-2006078841	Jul 2006	WO
WO-2006096571	Sep 2006	WO
WO-2007001448	Jan 2007	WO
WO-2007002490	Jan 2007	WO
WO-2007024840	Mar 2007	WO
WO-2007081385	Jul 2007	WO
WO-2007081387	Jul 2007	WO
WO-2007089541	Aug 2007	WO
WO-2007114794	Oct 2007	WO
WO-2007121489	Oct 2007	WO
WO-2007133710	Nov 2007	WO
WO-2007138178	Dec 2007	WO
WO-2007139766	Dec 2007	WO
WO-2007140015	Dec 2007	WO
WO-2007147079	Dec 2007	WO
WO-2007149432	Dec 2007	WO
WO-2008021123	Feb 2008	WO
WO-2008091792	Jul 2008	WO
WO-2008102057	Aug 2008	WO
WO-2008109176	Sep 2008	WO
WO-2008121342	Oct 2008	WO
WO-2008134153	Nov 2008	WO
WO-2008150432	Dec 2008	WO
WO-2009005680	Jan 2009	WO
WO-2009011808	Jan 2009	WO
WO-2009015296	Jan 2009	WO
WO-2009023821	Feb 2009	WO
WO-2009061372	May 2009	WO
WO-2009085215	Jul 2009	WO
WO-2009152928	Dec 2009	WO
WO-2010004018	Jan 2010	WO
WO-2010033200	Mar 2010	WO
WO-2010104604	Sep 2010	WO
WO-2010115154	Oct 2010	WO
WO-2010117620	Oct 2010	WO
WO-2010127304	Nov 2010	WO
WO-2010148039	Dec 2010	WO
WO-2010151776	Dec 2010	WO
WO-2011028539	Mar 2011	WO
WO-2011047870	Apr 2011	WO
WO-2011056546	May 2011	WO
WO-2011066476	Jun 2011	WO
WO-2011074960	Jun 2011	WO
WO-2012012037	Jan 2012	WO
WO-2012048341	Apr 2012	WO
WO-2012055929	May 2012	WO
WO-2012061832	May 2012	WO
WO-2012083225	Jun 2012	WO
WO-2012100216	Jul 2012	WO
WO-2012106546	Aug 2012	WO
WO-2012112804	Aug 2012	WO
WO-2012112970	Aug 2012	WO
WO-2012116331	Aug 2012	WO
WO-2012142531	Oct 2012	WO
WO-2012142611	Oct 2012	WO
WO-2012149042	Nov 2012	WO
WO-2012166425	Dec 2012	WO
WO-2012167142	Dec 2012	WO
WO-2013019751	Feb 2013	WO
WO-2013035114	Mar 2013	WO
WO-2013036929	Mar 2013	WO
WO-2013055955	Apr 2013	WO
WO-2013096643	Jun 2013	WO
WO-2013123125	Aug 2013	WO
WO-2013126741	Aug 2013	WO
WO-2013134261	Sep 2013	WO
WO-2013177220	Nov 2013	WO
WO-2014028378	Feb 2014	WO
WO-2014028537	Feb 2014	WO
WO-2014093676	Jun 2014	WO
WO-2014108810	Jul 2014	WO
WO-2014132497	Sep 2014	WO
WO-2014165559	Oct 2014	WO
WO-2015015199	Feb 2015	WO
WO-2015044428	Apr 2015	WO
WO-2015157567	Oct 2015	WO
WO-2015164212	Oct 2015	WO
WO-2015200891	Dec 2015	WO
WO-2016040476	Mar 2016	WO
WO-2016061517	Apr 2016	WO
WO-2016126871	Aug 2016	WO
WO-2016130578	Aug 2016	WO
WO-2016168584	Oct 2016	WO
WO-2017015075	Jan 2017	WO
WO-2017066231	Apr 2017	WO
WO-2017180949	Oct 2017	WO
WO-2017184707	Oct 2017	WO
WO-2017197343	Nov 2017	WO
WO-2018039338	Mar 2018	WO
WO-2018091676	May 2018	WO
WO-2018119301	Jun 2018	WO
WO-2018119447	Jun 2018	WO
WO-2018172726	Sep 2018	WO
WO-2018191701	Oct 2018	WO
WO-2018213643	Nov 2018	WO
WO-2018226546	Dec 2018	WO
WO-2018236615	Dec 2018	WO
WO-2019028166	Feb 2019	WO
WO-2019040637	Feb 2019	WO
WO-2019083852	May 2019	WO
WO-2019084043	May 2019	WO
WO-2019084165	May 2019	WO
WO-2019108851	Jun 2019	WO
WO-2019113235	Jun 2019	WO
WO-2019118355	Jun 2019	WO
WO-2019126789	Jun 2019	WO
WO-2019148042	Aug 2019	WO
WO-2019152108	Aug 2019	WO
WO-2019157529	Aug 2019	WO
WO-2019165318	Aug 2019	WO
WO-2019169028	Sep 2019	WO
WO-2019169347	Sep 2019	WO
WO-2019191321	Oct 2019	WO
WO-2019217758	Nov 2019	WO
WO-2020028882	Feb 2020	WO
WO-2020142779	Jul 2020	WO
WO-2020168013	Aug 2020	WO
WO-2020198532	Oct 2020	WO
WO-2021046475	Mar 2021	WO
WO-2021133845	Jul 2021	WO
WO-2021207610	Oct 2021	WO
WO-2021212042	Oct 2021	WO
WO-2021222302	Nov 2021	WO
WO-2021222301	Nov 2021	WO
WO-2022103712	May 2022	WO
WO-2022182682	Sep 2022	WO
WO-2022182785	Sep 2022	WO
WO-2022271908	Dec 2022	WO
WO-2023076528	May 2023	WO

Non-Patent Literature Citations (278)

Entry
Levene et al. Zero-Mode Waveguides forSingle-Molecule Analysis at HighConcentrations Jan. 2003 vol. 299 Science.
Au et al. Improving PacBio Long Read Accuracy by Short Read Alignment PLOS ONE Oct. 2012 \| vol. 7 \| Issue 10 \| e46679; Published Oct. 4, 2012.
10X Genomics, Inc. CG000153 Rev A. Chromium Single Cell DNA Reagent Kits User Guide. 2018.
10X Genomics, Inc. CG000184 Rev A. Chromium Single Cell 3′ Reagent Kits v3 User Guide with Feature Barcoding Technology for CRISPR Screening. 2018.
10X Genomics, Inc. CG000185 Rev B. Chromium Single Cell 3′ Reagent Kits User Guide with Feature Barcoding Technology for Cell Surface Protein. 2018.
10X Genomics, Inc. CG000208 Rev E. Chromium Next GEM Single Cell V(D)J reagent Kits v1.1 User Guide with Feature Barcode Technology for Cell Surface Protein. 2020.
10X Genomics, Inc. CG000209 Rev D. Chromium Next GEM Single Cell ATAC Reagent Kits v1.1 User Guide. 2020.
10X Genomics, Inc. CG000239 Rev B. Visium Spatial Gene Expression Reagent Kits User Guide. 2020.
10X Genomics, Inc. CG00026. Chromium Single Cell 3' Reagent Kit User Guide. 2016.
10X Genomics, Inc. LIT00003 Rev B Chromium Genome Solution Application Note. 2017.
Co-pending U.S. Appl. No. 16/708,214, filed Dec. 9, 2019.
Co-pending U.S. Appl. No. 16/737,762, filed Jan. 8, 2020.
Co-pending U.S. Appl. No. 16/737,770, filed Jan. 8, 2020.
Co-pending U.S. Appl. No. 16/789,273, filed Feb. 12, 2020.
Co-pending U.S. Appl. No. 16/789,287, filed Feb. 12, 2020.
Co-pending U.S. Appl. No. 16/800,450, filed Feb. 25, 2020.
Co-pending U.S. Appl. No. 16/814,908, filed Mar. 10, 2020.
PCT/US2020/017785 Application filed on Feb. 11, 2020 by Ziraldo, Solongo B. et al.
PCT/US2020/017789 Application filed on Feb. 11, 2020 by Belhocine, Zahara Kamila et al.
Aitman, et al. Copy Number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature. Feb. 16, 2006;439(7078):851-5.
Balikova, et al. Autosomal-dominant microtia linked to five tandem copies of a copy-number-variable region at chromosome 4p16. Am J Hum Genet. Jan. 2008;82(1):181-7. doi: 10.1016/j.ajhg.2007.08.001.
Bansal et al. “An MCMC algorithm for haplotype assembly from whole-genome sequence data,” (2008) Genome Res 18:1336-1346.
Bansal et al. “HapCUT: an efficient and accurate algorithm for the haplotype assembly problem,” Bioinformatics (2008) 24:i153-i159.
Bedtools: General Usage,â€ http://bedtools.readthedocs.io/en/latest/content/generalusage.html; Retrieved from the Internet Jul. 8, 2016.
Braeckmans et al., Scanning the Code. Modern Drug Discovery. 2003:28-32.
Bray, “The JavaScript Object Notation (JSON) Data Interchange Format,” Mar. 2014, retrieved from the Internet Feb. 15, 2015; https://tools.ietf.org/html/rfc7159.
Cappuzzo, et al. Increased HER2 gene copy number is associated with response to gefitinib therapy in epidermal growth factor receptor-positive non-small-cell lung cancer patients. J Clin Oncol. Aug. 1, 2005;23(22):5007-18.
Chen et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation,â€ Nature Methods (2009) 6(9):677-681.
Choi, et al. Identification of novel isoforms of the EML4-ALK transforming gene in non-small cell lung cancer. Cancer Res. Jul. 1, 2008;68(13):4971-6. doi: 10.1158/0008-5472.CAN-07-6158.
Chokkalingam, et al. Probing cellular heterogeneity in cytokine-secreting immune cells using droplet-based microfluidics. Lab Chip. Dec. 21, 2013;13(24):4740-4. doi: 10.1039/c3lc50945a.
Cleary et al. “Joint variant and de novo mutation identification on pedigrees from highthroughput sequencing data,” J Comput Biol (2014) 21:405-419.
Cook, et al. Copy-Number variations associated with neuropsychiatric conditions. Nature. Oct. 1, 20086;455(7215):919-23. doi: 10.1038/nature07458.
Ekblom, R. et al. “A field guide to whole-genome sequencing, assembly and annotation” Evolutionary Apps (Jun. 24, 2014) 7(9):1026-1042.
Fabi, et al. Correlation of efficacy between EGFR gene copy number and lapatinib/capecitabine therapy in HER2-positive metastatic breast cancer. J. Clin. Oncol. 2010; 28:15S. 2010 ASCO Meeting abstract Jun. 14, 2010:1059.
Fisher, et al. A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries. Genome Biol. 2011;12(1):R1. doi: 10.1186/GB-2011-12-1-r1. Epub Jan. 4, 2011.
Fulton, et al. Advanced multiplexed analysis with the FlowMetrix system. Clin Chem. Sep. 1997;43(9):1749-56.
Ghadessy, et al. Directed evolution of polymerase function by compartmentalized self-replication. Proc Natl Acad Sci USA. 2001;98:4552-4557.
Gonzalez, et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science. Mar. 4, 2005;307(5714):1434-40. Epub Jan. 6, 2005.
Gordon et al. “Consed: A Graphical Tool for Sequence Finishing,” Genome Research (1998) 8:198-202.
Heng et al. “Fast and accurate long-read alignment with Burrows-Wheeler transform,” Bioinformatics (2010) 25(14): 1754-1760.
Huang et al. EagleView: A genome assembly viewer for next-generationsequencing technologies,â€ Genome Research (2008) 18:1538-1543.
Huebner, “Quantitative detection of protein expression in single cells using droplet microfluidics”, Chem. Commun. 1218-1220 (2007).
Hug, et al. Measurement of the No. of molecules of a single mRNA species in a complex mRNA preparation. J Theor Biol. Apr. 21, 2003;221(4):615-24.
Jarosz, M. et al. “Using 1ng of DNA to detect haplotype phasing and gene fusions from whole exome sequencing of cancer cell lines” Cancer Res (2015) 75(supp15):4742.
Kanehisa et al. “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research (2000) 28:27-30.
Khomiakova et al., Analysis of perfect and mismatched DNA duplexes by a generic hexanucleotide microchip. Mol Biol(Mosk). Jul.-Aug. 2003;37(4):726-41. Russian. Abstract only.
Kim et al. “HapEdit: an accuracy assessment viewer for haplotype assembly using massively parallel DNA-sequencing technologies,” Nucleic Acids Research (2011) pp. 1-5.
Kirkness et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome,â€ Genome Res (2013) 23:826-832.
Kitzman et al. “Haplotype-resolved genome sequencing of a Gujarati Indian individual.” Nat Biotechnol (2011) 29:59-63.
Knight, et al. Subtle chromosomal rearrangements in children with unexplained mental retardation. Lancet. Nov. 13, 1999;354(9191):1676-81.
Layer et al. “LUMPY: A probabilistic framework for structural variant discovery,” Genome Biology (2014) 15(6):R84.
Lippert et al. “”Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem,â€ Brief. Bionform (2002) 3:23-31.
Lo, et al. On the design of clone-based haplotyping. Genome Biol. 2013;14(9):R100.
Lupski. Genomic rearrangements and sporadic disease. Nat Genet. Jul. 2007;39(7 Suppl):S43-7.
Macosko, et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. May 21, 2015; 161(5): 1202-14. doi: 10.1016/j.cell.2015.05.002.
Marcus. Gene method offers diagnostic hope. The Wall Street Journal. Jul. 11, 2012.
Margulies 2005 Supplementary methods (Year: 2005).
Margulies et al. “Genome sequencing in microfabricated high-density picoliter reactors”, Nature (2005) 437:376-380.
McCoy, R. et al. “Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements” PLOS (2014) 9(9):e1016689.
McKenna, Aaron et al. “The Genome Analysis Toolkit: A MapReduce Framework for Analyzing next-Generation DNA Sequencing Data.” Genome Research 20.9 (2010): 1297-1303. PMC. Web. Feb. 2, 2017.
Miller et al. “Assembly Algorithms for next-generation sequencing data,” Genomics, 95 (2010), pp. 315-327.
Mirzabekov, “DNA Sequencing by Hybridization—a Megasequencing Method and a Diagnostic Tool?” Trends in Biotechnology 12(1): 27-32 (1994).
Mouritzen et al., Single nucleotide polymorphism genotyping using locked nucleic acid (LNa). Expert Rev Mol Diagn. Jan. 2003;3(1):27-38.
Myllykangas et al. “Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide-selective sequencing,” Nat Biotechnol, (2011) 29:1024-1027.
Navin. The first five years of single-cell cancer genomics and beyond. Genome Res. Oct. 2015;25(10):1499-507. doi: 10.1101/gr.191098.115.
Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature, 487(7406):190-195 (Jul. 11, 2012).
Pinto, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature. Jul. 15, 2010;466(7304):368-72. doi: 10.1038/nature09146. Epub Jun. 9, 2010.
Pushkarev et al. Single-molecule sequencing of an individual human genome,â€ Nature Biotech (2009) 17:847-850.
Ritz, A. et al. “Characterization of structural variants with single molecule and hybrid sequencing approaches” Bioinformatics (2014) 30(24):3458-3466.
Ropers. New perspectives for the elucidation of genetic disorders. Am J Hum Genet. Aug. 2007;81(2):199-207. Epub Jun. 29, 2007.
Schmitt, “Bead-based multiplex genotyping of human papillomaviruses”, J. Clinical Microbial., 44:2 504-512 (2006).
Sebat, et al. Strong association of de novo copy number mutations with autism. Science. Apr. 20, 2007;316(5823):445-9. Epub Mar. 15, 2007.
Shendure et al. Accurate Multiplex Polony Sequencing of an Evolved bacterial Genome. Science (2005) 309:1728-1732.
Shlien, et al. Copy number variations and cancer. Genome Med. Jun. 16, 2009;1(6):62. doi: 10.1186/gm62.
Shlien, et al. Excessive genomic DNA copy number variation in the Li-Fraumeni cancer predisposition syndrome. Proc Natl Acad Sci U S A. Aug. 12, 2008;105(32):11264-9. doi: 10.1073/pnas.0802970105. Epub Aug. 6, 2008.
Simeonov et al., Single nucleotide polymorphism genotyping using short, fluorescently labeled locked nucleic acid (LNA) probes and fluorescence polarization detection. Nucleic Acids Res. Sep. 1, 2002;30(17):e91.
Sorokin et al., Discrimination between perfect and mismatched duplexes with oligonucleotide gel microchips: role of thermodynamic and kinetic effects during hybridization. J Biomol Struct Dyn. Jun. 2005;22(6):725-34.
SSH Tunnel—Local and Remote Port Forwarding Explained With Examples,â€ Trackets Blog, http://blog.trackets.com/2014/05/17/ssh-tunnel-local-and-remote-port-forwarding-explained with-examples.html; Retrieved from the Internet Jul. 7, 2016.
Tewhey, et al. Microdroplet-based PCR amplification for large-scale targeted sequencing. Nat Biotechnol. Nov. 2009;27(11):1025-31. doi: 10.1038/nbt.1583. Epub Nov. 1, 2009.
The SAM/BAM Format Specificatio Working Group, “Sequence Allignment/ Map Format Specification,” Dec. 28, 2014.
Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. eLife, 2:e00569 (2013). doi: 10.7554/eLife.00569. Epub Jul. 2, 2013.
Wang, et al. Digital karyotyping. Proc Natl Acad Sci U S A. Dec. 10, 2002;99(25): 16156-61. Epub Dec. 2, 2002.
Wang et al., Single nucleotide polymorphism discrimination assisted by improved base stacking hybridization using oligonucleotide microarrays. Biotechniques. 2003;35:300-08.
Weaver, “Rapid clonal growth measurements at the single-cell level: gel microdroplets and flow cytometry”, Biotechnology, 9:873-877 (1991).
Wheeler et al., “Database resources of the National Center for Biotechnology Information,” Nucleic Acids Res. (2007) 35 (Database issue): D5-12.
Zerbino, Daniel, “Velvet Manual—version 1.1,” Aug. 15, 2008, pp. 1-22.
Zerbino, D.R. “Using the Velvet de novo assembler for short-read sequencing technologies” Curr Protoc Bioinformatics. Sep. 2010;Chapter 11:Unit 11.5. doi: 10.1002/0471250953.bi1105s31.
Zerbino et al. “Velvet: Algorithms for de novo short read assembly using de Bruijn graphs,” Genome Research (2008) 18:821-829.
Zheng, X.Y. et al. “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing” Nature Biotech (Feb. 1, 2016) 34(3):303-311.
Zong et al. Genome-Wide Detection of Single Nucleotide and Copy Number Variations of a Single Human Cell. Science 338(6114):1622-1626 (2012) .
Co-pending U.S. Appl. No. 17/014,909, inventor Giresi; Paul, filed Sep. 8, 2020.
Co-pending U.S. Appl. No. 17/148,942, inventors McDermott; Geoffrey et al., filed Jan. 14, 2021.
Co-pending U.S. Appl. No. 17/166,982, inventors McDermott; Geoffrey et al., filed Feb. 3, 2021.
Co-pending U.S. Appl. No. 17/175,542, inventors Maheshwari; Arundhati Shamoni et al., filed Feb. 12, 2021.
Co-pending U.S. Appl. No. 17/220,303, inventor Walter; Dagmar, filed Apr. 1, 2021.
Co-pending U.S. Appl. No. 17/318,364, inventors Bava; Felice Alessio et al., filed May 12, 2021.
Co-pending U.S. Appl. No. 17/381,612, inventor Martinez; Luigi Jhon Alvarado, filed Jul. 21, 2021.
Co-pending U.S. Appl. No. 17/499,039, inventors Pfeiffer; Katherine et al., filed Oct. 12, 2021.
Co-pending U.S. Appl. No. 17/512,241, inventors Hill; Andrew John et al., filed Oct. 27, 2021.
Co-pending U.S. Appl. No. 17/517,408, inventors Salmanzadeh; Alireza et al., filed Nov. 2, 2021.
Co-pending U.S. Appl. No. 17/518,213, inventor Lund; Paul Eugene, filed Nov. 3, 2021.
Co-pending U.S. Appl. No. 17/522,741, inventors Zheng; Xinying et al., filed Nov. 9, 2021.
Co-pending U.S. Appl. No. 17/545,862, inventor Katherine; Pfeiffer, filed Dec. 8, 2021.
Co-pending U.S. Appl. No. 17/573,350, inventor Corey; M. Nemec, filed Jan. 11, 2022.
Co-pending U.S. Appl. No. 17/580,947, inventor Gibbons; Michael, filed Jan. 21, 2022.
Co-pending U.S. Appl. No. 17/831,835, inventor Martinez; Luigi Jhon Alvarado, filed Jun. 3, 2022.
Co-pending U.S. Appl. No. 17/957,781, inventor Bava; Felice Alessio, filed Sep. 30, 2022.
Co-pending U.S. Appl. No. 18/046,843, inventor Toh; Mckenzi, filed Oct. 14, 2022.
Droplet Based Sequencing (slides) dated (Mar. 12, 2008).
Abate, et al. Beating Poisson encapsulation statistics using close-packed ordering. Lab Chip. Sep. 21, 2009;9(18):2628-31. doi: 10.1039/b909386a. Epub Jul. 28, 2009.
Abate, et al. High-throughput injection with microfluidics using picoinjectors. Proc Natl Acad Sci U S A. Nov. 9, 2010;107(45):19163-6. doi: 10.1073/pNas.1006888107. Epub Oct. 20, 2010.
Abate et al., Valve-based flow focusing for drop formation. Appl Phys Lett. 2009;94. 3 pages.
Agresti, et al. Selection of ribozymes that catalyse multiple-turnover Diels-Alder cycloadditions by using in vitro compartmentalization. Proc Natl Acad Sci U S A. Nov. 8, 2005;102(45):16170-5. Epub Oct. 31, 2005.
Akselband, “Enrichment of slow-growing marine microorganisms from mixed cultures using gel microdrop (GMD) growth assay and fluorescence-activated cell sorting”, J. Exp. Marine Bioi., 329: 196-205 (2006).
Akselband, “Rapid mycobacteria drug susceptibility testing using gel microdrop (GMD) growth assay and flow cytometry”, J. Microbiol. Methods, 62:181-197 (2005).
Attia, et al. Micro-injection moulding of polymer microfluidic devices. Microfluidics and nanofluidics. 2009; 7(1):1-28.
Baret, et al. Fluorescence-activated droplet sorting (FADS): efficient microfluidic cell sorting based on enzymatic activity. Lab Chip. Jul. 7, 2009;9(13):1850-8. doi: 10.1039/b902504a. Epub Apr. 23, 2009.
Bentley et al. “Accurate whole human genome sequencing using reversible terminator chemistry,” (2008) Nature 456:53-59.
Boone, et al. Plastic advances microfluidic devices. The devices debuted in silicon and glass, but plastic fabrication may make them hugely successful in biotechnology application. Analytical Chemistry. Feb. 2002; 78A-86A.
Bransky, et al. A microfluidic droplet generator based on a piezoelectric actuator. Lab Chip. Feb. 21, 2009;9(4):516-20. doi: 10.1039/b814810d. Epub Nov. 20, 2008.
Brouzes, et al. Droplet microfluidic technology for single-cell high-throughput screening. Proc Natl Acad Sci U S A. Aug. 25, 2009;106(34):14195-200. doi: 10.1073/pnas.0903542106. Epub Jul. 15, 2009.
Browning, S.R. et al. “Haplotype Phasing: Existing Methods and New Developments” NaRevGenet (Sep. 16, 2011) 12(10):703-714.
Carroll, “The selection of high-producing cell lines using flow cytometry and cell sorting”, Exp. Op. Bioi. Therp., 4:11 1821-1829 (2004).
Chaudhary “A rapid method of cloning functional variable-region antibody genes in Escherichia coli as single-chain immunotoxins” Proc. Natl. Acad. Sci USA 87: 1066-1070 (Feb. 1990).
Chechetkin et al., Sequencing by hybridization with the generic 6-mer oligonucleotide microarray: an advanced scheme for data processing. J Biomol Struct Dyn. Aug. 2000;I8(1):83-101.
Chou, et al. Disposable Microdevices for DNA Analysis and Cell Sorting. Proc. Solid-State Sensor and Actuator Workshop, Hilton Head, SC. Jun. 8-11, 1998; 11-14.
De Bruin et al., UBS Investment Research. Q-Series®: DNA Sequencing. UBS Securities LLC. Jul. 12, 2007. 15 pages.
Demirci, et al. Single cell epitaxy by acoustic picolitre droplets. Lab Chip. Sep. 2007;7(9):1139-45. Epub Jul. 10, 2007.
Doerr, “The smallest bioreactor”, Nature Methods, 2:5 326 (2005).
Dowding, et al. Oil core/polymer shell microcapsules by internal phase separation from emulsion droplets. II: controlling the release profile of active molecules. Langmuir. Jun. 7, 2005;21(12):5278-84.
Draper, et al. Compartmentalization of electrophoretically separated analytes in a multiphase microfluidic platform. Anal Chem. Jul. 3, 2012;84(13):5801-8. doi: 10.1021/ac301141x. Epub Jun. 13, 2012.
Dressler, et al. Droplet-based microfluidics enabling impact on drug discovery. J Biomol Screen. Apr. 2014;19(4):483-96. doi: 10.1177/1087057113510401. Epub Nov. 15, 2013.
Eid, et al. Real-time DNA sequencing from single polymerase molecules. Science. Jan. 2, 2009;323(5910):133-8. doi: 10.1126/science.1162986. Epub Nov. 20, 2008.
Fredrickson, et al. Macro-to-micro interfaces for microfluidic devices. Lab Chip. Dec. 2004;4(6):526-33. Epub Nov. 10, 2004.
Freiberg, et al. Polymer microspheres for controlled drug release. Int J Pharm. Sep. 10, 2004;282(1-2):1-18.
Fu, “A micro fabricated fluorescence-activated cell sorter”, Nature Biotech., 17:1109-1111 (1997).
Garstecki, et al. Formation of monodisperse bubbles in a microfluidic flow-focusing device. Applied Physics Letters. 2004; 85(13):2649-2651. DOI: 10.1063/1.1796526.
Gartner, et al. The Microfluidic Toolbox—examples for fluidic interfaces and standardization concepts. Proc. SPIE 4982, Microfluidics, BioMEMS, and Medical Microsystems, (Jan. 17, 2003); doi: 10.1117/12.479566.
Granieri, Lucia. Droplet-based microfluidics and engineering of tissue plasminogen activator for biomedical applications. Ph.D. Thesis, Nov. 13, 2009 (131 pages).
Grasland-Mongrain, et al. Droplet coalescence in microfluidic devices. Jan.-Jul. 2003. 31 pages. http://www.eleves.ens.fr/home/grasland/rapports/stage4.pdf.
He, “Selective Encapsulation of Single Cells and Subcellular Organelles into Picoliter- and Femtoliter-Volume Droplets” Anal. Chem 77: 1539-1544 (2005).
Illumina, Inc. An Introduction to Next-Generation Sequencing Technology. Feb. 28, 2012.
Jena, et al. Cyclic olefin copolymer based microfluidic devices for biochip applications: Ultraviolet surface grafting using 2-methacryloyloxyethyl phosphorylcholine. Biomicrofluidics. Mar. 2012;6(1):12822-1 to 12822-12. doi: 10.1063/1.3682098. Epub Mar. 15, 2012.
Jung, et al. Micro machining of injection mold inserts for fluidic channel of polymeric biochips. Sensors. 2007; 7(8):1643-1654.
Kim et al., Albumin loaded microsphere of amphiphilic poly( ethylene glycol)/poly(a-ester) multiblock copolymer. Eu. J. Pharm. Sci. 2004;23:245-51. Available online Sep. 27, 2004.
Kim, et al. Fabrication of monodisperse gel shells and functional microgels in microfluidic devices. Angew Chem Int Ed Engl. 2007;46(11):1819-22.
Kim, et al. Rapid prototyping of microfluidic systems using a PDMS/polymer tape composite. Lab Chip. May 7, 2009;9(9):1290-3. doi: 10.1039/b818389a. Epub Feb. 10, 2009.
Kitzman, et al. Noninvasive whole-genome sequencing of a human fetus. Sci Transl Med. Jun. 6, 2012;4(137):137ra76. doi: 10.1126/scitranslmed.3004323.
Klein, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. May 21, 2015;161(5):1187-201. doi: 10.1016/j.cell.2015.04.044.
Kutyavin, et al. Oligonucleotides containing 2-aminoadenine and 2-thiothymine act as selectively binding complementary agents. Biochemistry. Aug. 27, 1996;35(34):11170-6.
Lagus, et al. A review of the theory, methods and recent applications of high-throughput single-cell droplet microfluidics. J. Phys. D: Appl. Phys. (2013) 46:114005. (21 pages).
Li, Y., et al., “PEGylated PLGA Nanoparticles as protein carriers: synthesis, preparation and biodistribution in rats,” Journal of Controlled Release, vol. 71, pp. 203-211 (2001).
Liu, et al. Preparation of uniform-sized PLA microcapsules by combining Shirasu porous glass membrane emulsification technique and multiple emulsion-solvent evaporation method. J Control Release. Mar. 2, 2005;103(1):31-43. Epub Dec. 21, 2004.
Liu, et al. Smart thermo-triggered squirting capsules for Nanoparticle delivery. Soft Matter. 2010; 6(16):3759-3763.
Loscertales, I.G., et al., “Micro/Nano Encapsulation via Electrified Coaxial Liquid Jets,” Science, vol. 295, pp. 1695-1698 (2002).
Love, “A microengraving method for rapid selection of single cells producing antigen-specific antibodies”, Nature Biotech, 24:6 703 (Jun. 2006).
Lowe, Adam J. Norbornenes and [n]polynorbornanes as molecular scaffolds for anion recognition. Ph.D. Thesis (May 2010). (361 pages).
Makino, et al. Preparation of hydrogel microcapsules: Effects of preparation conditions upon membrane properties. Colloids and Surfaces B: Biointerfaces. Nov. 1998; 12(2), 97-104.
Matochko, et al. Uniform amplification of phage display libraries in monodisperse emulsions. Methods. Sep. 2012;58(1): 18-27. doi: 10.1016/j.ymeth.2012.07.012. Epub Jul. 20, 2012.
Mazutis, et al. Selective droplet coalescence using microfluidic systems. Lab Chip. Apr. 24, 2012;12(10):1800-6. doi: 10.1039/c2lc40121e. Epub Mar. 27, 2012.
Merriman, et al. Progress in ion torrent semiconductor chip based sequencing. Electrophoresis. Dec. 2012;33(23):3397-417. doi: 10.1002/elps.201200424.
Moore, et al. Behavior of capillary valves in centrifugal microfluidic devices prepared by three-dimensional printing. Microfluidics and Nanofluidics. 2011; 10(4):877-888.
Nagashima, et al. Preparation of monodisperse poly (acrylamide-co-acrylic acid) hydrogel microspheres by a membrane emulsification technique and their size-dependent surface properties. Colloids and Surfaces B: Biointerfaces. Jun. 15, 1998; 11(1-2), 47-56.
Nguyen, et al. In situ hybridization to chromosomes stabilized in gel microdrops. Cytometry. 1995; 21:111-119.
Novak, et al. Single cell multiplex gene detection and sequencing using microfluidicallygenerated agarose emulsions. Angew Chem Int Ed Engl. Jan. 10, 2011;50(2):390-5. doi: 10.1002/anie.201006089.
Oberholzer, et al. Polymerase chain reaction in liposomes. Chem Biol. Oct. 1995;2(10):677-82.
Ogawa, et al. Production and characterization of O/W emulsions containing cationic droplets stabilized by lecithin-chitosan membranes. J Agric Food Chem. Apr. 23, 2003;51(9):2806-12.
Okushima, S., et al,. “Controlled Production ofMonodisperse Double Emulsions by Two-Step Droplet Breakup in Microfluidic Devices,” Langmuir, vol. 20, pp. 9905-9908 (2004).
Perez, C., et al., “Poly(lactic acid)-poly(ethylene glycol) Nanoparticles as new carriers for the delivery of plasmid DNA,” Journal of Controlled Release, vol. 75, pp. 211-224 (2001).
Rotem, et al. High-Throughput Single-Cell Labeling (Hi-SCL) for RNA-Seq Using Drop-Based Microfluidics. PLoS One. May 22, 2015;10(5):e0116328. doi: 10.1371/journal.pone.0116328. eCollection 2015.
Rotem, et al. Single Cell Chip-Seq Using Drop-Based Microfluidics. Abstract #50. Frontiers of Single Cell Analysis, Stanford University Sep. 5-7, 2013.
Ryan, Rapid assay for mycobacterial growth and antibiotic susceptibility using gel microdrop and encapsulation, J. Clinical Microbial., 33:7 1720-1726 (1995).
Schirinzi et al., Combinatorial sequencing-by-hybridization: Analysis of the NF1 gene. Genet Test. 2006 Spring;10(1):8-17.
Seiffert, et al. Smart microgel capsules from macromolecular precursors. J Am Chem Soc. May 12, 2010;132(18):6606-9. doi: 10.1021/ja102156h.
Shimkus, et al. A chemically cleavable biotinylated nucleotide: usefulness in the recovery of protein-DNA complexes from avidin affinity cols. Proc Natl Acad Sci U S A. May 1985;82(9):2593-7.
Su, et al., Microfluidics-Based Biochips: Technology Issues, Implementation Platforms, and Design-Automation Challenges. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2006;25(2):211-23. (Feb. 2006).
Sun et al., Progress in research and application of liquid-phase chip technology. Chinese Journal Experimental Surgery. May 2005;22(5):639-40.
Tawfik, D.S., et al., “Man-made cell-like compartments for molecular evolution,” Nature Biotechnology, vol. 16, pp. 652-656 (1998).
Tubeleviciute, et al. Compartmentalized self-replication (CSR) selection of Thermococcus litoralis Sh1B DNa polymerase for diminished uracil binding. Protein Eng Des Sel. Aug. 2010;23(8):589-97. doi: 10.1093/protein/gzq032. Epub May 31, 2010.
Turner, et al. Methods for genomic partitioning. Annu Rev Genomics Hum Genet. 2009;10:263-84. doi: 10.1146/annurev-genom-082908-150112. Review.
Wang, et al. A novel thermo-induced self-bursting microcapsule with magnetic-targeting property. Chemphyschem. Oct. 5, 2009;10(14):2405-9.
Whitesides, “Soft lithography in biology and biochemistry”, Annual Review of Biomedical Engineering, 3:335-373 (2001).
Woo, et al. G/C-modified oligodeoxynucleotides with selective complementarity: synthesis and hybridization properties. Nucleic Acids Res. Jul. 1, 1996;24(13):2470-5.
Xia and Whitesides, Soft Lithography, Ann. Rev. Mat. Sci. 28:153-184 (1998).
Yamamoto, et al. Chemical modification of Ce(IV)/EDTA-base artificial restriction DNA cutter for versatile manipulation of double-stranded DNA. Nucleic Acids Research. 2007; 35(7):e53.
Zhang, “Combinatorial marking of cells and organelles with reconstituted fluorescent proteins”, Cell, 119:137-144 (Oct. 1, 2004).
Zhang, et al. Degradable disulfide core-cross-linked micelles as a drug delivery system prepared from vinyl functionalized nucleosides via the RAFT process. Biomacromolecules. Nov. 2008;9(11):3321-31. doi: 10.1021/bm800867n. Epub Oct. 9, 2008.
Zhao, J., et al., “Preparation of hemoglobin-loaded Nano-sized particles with porous structure as oxygen carriers,” Biomaterials, vol. 28, pp. 1414-1422 (2007).
Zhu, et al. Synthesis and self-assembly of highly incompatible polybutadienepoly(hexafluoropropoylene oxide) diblock copolymers. Journal of Polymer Science Part B: Polymer Physics. 2005; 43(24):3685-3694.
Zimmermann et at., Microscale production of hybridomas by hypo-osmolar electrofusion. Hum⋅ Antibodies Hybridomas. Jan. 1992;3(1 ): 14-8.
Abate, A.R. et al. “Beating Poisson encapsulation statistics using close-packed ordering” Lab on a Chip (Sep. 21, 2009) 9(18):2628-2631.
Adamson et al., “Production of arrays of chemically distinct nanolitre plugs via repeated splitting in microfluidic devices”, Lab Chip 6(9): 1178-1186 (Sep. 2006).
Agasti, S.S. et al. “Photocleavable DNA barcode-antibody conjugates allow sensitive and multiplexed protein analysis in single cell” J Am Chem Soc (2012) 134(45):18499-18502.
Altemos et al., “Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly,” PLOS Computational Biology, May 15, 2014, vol. 10, Issue 5, 14 pages.
Amini, S. et al. “Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing” Nature Genetics (2014) 46:1343-1349 doi:10.1038/ng.3119.
Anna et al.: Formation of dispersions using “flow focusing” in microchannels: Applied Physics Letters, vol. 82, No. 3, pp. 364-366 (2003).
Baret, “Surfactants in droplet-based microfluidics” Lab Chip (12(3):422-433 (2012).
Beer et al. On-Chip, Real-Time, Single-Copy Polymerase Chain Reaction in Picoliter Droplets. Anal Chem 79:8471-8475 (2007).
Brenner, et al. “In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs.” Proc Natl Acad Sci U S A. Feb. 15, 2000;97(4):1665-70.
Buchman GW, et al. Selective RNA amplification: a novel method using dUMP-containing primers and uracil DNA glycosylase. PCR Methods Appl. Aug. 1993; 3(1):28-31.
Burns, et al. An Integrated Nanoliter DNA Analysis Device. Science. Oct. 16, 1998;282(5388):484-7.
Burns, et al. Microfabricated structures for integrated DNA analysis. Proc Natl Acad Sci U S A. May 28, 1996; 93(11): 5556-5561.
Burns, et al. The intensification of rapid reactions in multiphase systems using slug flow in capillaries. Lab Chip. Sep. 2001;1(1):10-15. Epub Aug. 9, 2001.
Chen, et al. Chemical transfection of cells in picoliter aqueous droplets in fluorocarbon oil. Anal Chem. Nov. 15, 2011;83(22):8816-20. doi: 10.1021/ac2022794. Epub Oct. 17, 2011.
Chien et al. “Multiport flow-control system for lab-on-a-chip microfluidic devices”, Fresenius J. Anal Chem, 371:106-111 (Jul. 27, 2001).
Chu, et al. Controllable monodisperse multiple emulsions. Angew Chem Int Ed Engl. 2007;46(47):8970-4.
Clausell-Tormos et al., “Droplet-based microfluidic platforms for the encapsulation and screening of mammalian cells and multicellular organisms”, Chem. Biol. 15:427-437 (2008).
Co-pending PCT/US2019/046940, filed Aug. 16, 2019.
Co-pending U.S. Appl. No. 16/575,280, filed Sep. 18, 2019.
Co-pending U.S. Appl. No. 16/434,076, filed Jun. 6, 2019.
Co-pending U.S. Appl. No. 16/434,084, filed Jun. 6, 2019.
Co-pending U.S. Appl. No. 16/434,095, filed Jun. 6, 2019.
Co-pending U.S. Appl. No. 16/434,102, filed Jun. 6, 2019.
Co-pending U.S. Appl. No. 16/530,930, filed Aug. 2, 2019.
Damean, et al. Simultaneous measurement of reactions in microdroplets filled by concentration gradients. Lab Chip. Jun. 21, 2009;9(12):1707-13. doi: 10.1039/b821021g. Epub Mar. 19, 2009.
Depristo et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet 43:491-498 (2011).
Drmanac et al., Sequencing by hybridization (SBH): advantages, achievements, and opportunities. Adv Biochem Eng Biotechnol. 2002;77 :75-101.
Duffy et al., Rapid Protyping of Microfluidic Systems and Polydimethylsiloxane, Anal Chem 70:4974-4984 (1998).
Eastburn, et al. Ultrahigh-throughput mammalian single-cell reverse-transcriptase polymerase chain reaction in microfluidic droplets. Anal Chem. Aug. 20, 2013;85(16):8016-21. doi: 10.1021/ac402057q. Epub Aug. 8, 2013.
Esser-Kahn, et al. Triggered release from polymer capsules. Macromolecules. 2011; 44:5539-5553.
Gericke, et al. Functional cellulose beads: preparation, characterization, and applications. Chemical reviews 113.7 (2013): 4812-4836.
Guo, et al. Droplet microfluidics for high-throughput biological assays. Lab Chip. Jun. 21, 2012;12(12):2146-55. doi: 10.1039/c2lc21147e. Epub Feb. 9, 2012.
Gyarmati, et al. Reversible disulphide formation in polymer networks: a versatile functional group from synthesis to applications. European Polymer Journal. 2013; 49:1268-1286.
Hashimshony, et al. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification. Cell Rep. Sep. 27, 2012;2(3):666-73. doi: 10.1016/j.celrep.2012.08.003. Epub Aug. 30, 2012.
Hiatt, et al. Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods. Feb. 2010; 7(2):119-22. Epub Jan. 17, 2010.
Holtze, et al. Biocompatible surfactants for water-in-fluorocarbon emulsions. Lab Chip. Oct. 2008;8(10):1632-9. doi: 10.1039/b806706f. Epub Sep. 2, 2008.
Islam, et al. Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing. Nat Protoc. Apr. 5, 2012;7(5):813-28. doi: 10.1038/nprot.2012.022.
Jaitin, et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. Feb. 14, 2014;343(6172):776-9. doi: 10.1126/science.1247651.
Kaper, et al. Supporting Information for “Whole-genome haplotyping by dilution, amplification, and sequencing.” Proc Natl Acad Sci U S A. Apr. 2, 2013;110(14):5552-7. doi: 10.1073/pnas.1218696110. Epub Mar. 18, 2013.
Kaper, et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc Natl Acad Sci U S A. Apr. 2, 2013;110(14):5552-7. doi: 10.1073/pnas.1218696110. Epub Mar. 18, 2013.
Kenis, et al. Microfabrication Inside Capillaries Using Multiphase Laminar Flow Patterning. Science. 1999; 285:83-85.
Kivioja, et al. Counting absolute Nos. of molecules using unique molecular identifiers. Nat Methods. Nov. 20, 2011;9(1):72-4.
Klein, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. May 21, 2015; 161:1187-1201.
Korlach et al., Methods in Enzymology, Real-Time DNA Sequencing from Single Polymerase Molecules, (2010) 472:431-455.
Koster et al., “Drop-based microfluidic devices for encapsulation of single cells”, Lab on a Chip The Royal Soc. of Chern. 8: 1110-1115 (2008).
Lagally, et al. Single-Molecular DNA Amplification and Analysis in an Integrated Microfluidic Device. Anal Chem. Feb. 1, 2001;73(3):565-70.
Lennon et al. A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biology 11:R15 (2010).
Madl, et al. “Bioorthogonal Strategies for Engineering Extracellular matrices”, Madal, Chritopher, Adv. Funct. Master. Jan. 19, 2018, vol. 28, 1706046, pp. 1-21.
Mair, et al. Injection molded microfluidic chips featuring integrated interconnects. Lab Chip. Oct. 2006;6(10):1346-54. Epub Jul. 31, 2006.
Microfluidic ChipShop. Microfluidic product catalogue. Oct. 2009.
Nisisako, et al. Droplet formation in a microchannel network. Lab Chip. Feb. 2002;2(1):24-6. Epub Jan. 18, 2002.
Nisisako, T. et al. Droplet Formation in a Microchannel on PMMA Plate. Micro Total Analysis Systems. 2001. Kluwer Academic Publishers. pp. 137-138.
Nisisako, T et al., Microfluidics large-scale integration on a chip for mass production of monodisperse droplets and particles, The Royal Society of Chemistry: Lab Chip, (Nov. 23, 2007) 8:287-293.
Novak, R. et al., “Single cell multiplex gene detection and sequencing using microfluidicallygenerated agarose emulsions” Angew. Chem. Int. Ed. Engl. (2011) 50(2):390-395.
Orakdogen, N. “Novel responsive poly(N,N-dimethylaminoethyl methacrylate) gel beads: preparation, mechanical properties and pH-dependent swelling behavior” J Polym Res (2012) 19:9914.
Perrott, Jimmy. Optimization and Improvement of Emulsion PCR for the lon Torrent Next-Generation Sequencing Platform. (2011) Thesis.
Plunkett, et al. Chymotrypsin responsive hydrogel: application of a disulfide exchange protocol for the preparation of methacrylamide containing peptides. Biomacromolecules. Mar.-Apr. 2005;6(2):632-7.
Priest, et al. Generation of Monodisperse Gel Emulsions in a Microfluidic Device, Applied Physics Letters, 88:024106 (2006).
Ramsey, J.M. “The burgeoning power of the shrinking laboratory” Nature Biotech (1999) 17:1061-1062.
Ramskold et al. (2012) “Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells” Nature Biotechnology 30(8):777-782.
Roche. Using Multiplex Identifier (MID) Adaptors for the GS FLX Titanium Chemistry Basic MID Set Genome Sequencer FLX System, Technical Bulletin 004-2009, (Apr. 1, 2009) pp. 1-7. URL:http://454.com/downloads/my454/documentation/technical-bulletins/TCB-09004 UsingMultiplexIdentifierAdaptorsForTheGSFLXTitaniumSeriesChemistry-BasicMIDSet.pdf.
Rotem, A. et al., “High-throughput single-cell labeling (Hi-SCL) for RNA-Seq using drop-based microfluidics” PLOS One (May 22, 2015) 0116328 (14 pages).
Saikia, et al. Simultaneous multiplexed amplicon sequencing and transcriptome profiling in single cells. Nat Methods. Jan. 2019;16(1):59-62. doi: 10.1038/s41592-018-0259-9. Epub Dec. 17, 2018.
Schubert, et al. Microemulsifying fluorinated oils with mixtures of fluorinated and hydrogenated surfactants. Colloids and Surfaces A; Physicochemical and Engineering Aspects, 84(1994) 97-106.
Seiffert, et al. Microfluidic fabrication of smart microgels from macromolecular precursors. 2010. Polymer.
Seiffert, S. et al., “Smart microgel capsules from macromolecular precursors” J. Am. Chem. Soc. (2010) 132:6606-6609.
Shah, “Fabrication of mono disperse thermosensitive microgels and gel capsules in micro fluidic devices”, Soft Matter, 4:2303-2309 (2008).
Smith, et al. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Research, 38(13): e142 (2010).
Song, et al. Reactions in droplets in microfluidic channels. Angew Chem Int Ed Engl. Nov. 13, 2006;45(44):7336-56.
Tewhey et al. The importance of phase information for human genomics, Nat Rev Genet (2011) 12:215-223.
Thaxton, C.S. et al. “A Bio-Bar-Code Assay Based Upon Dithiothreitol Oligonucleotide Release” Anal Chem (2005) 77:8174-8178.
Theberge, et al. Microdroplets in microfluidics: an evolving platform for discoveries in chemistry and biology. Angew Chem Int Ed Engl. Aug. 9, 2010;49(34):5846-68. doi: 10.1002/anie.200906653.
Thorsen, et al. Dynamic pattern formation in a vesicle-generating microfluidic device. Physical Review Letters. American Physical Society. 2001; 86(18):4163-4166.
Tonelli, et al. Perfluoropolyether functional oligomers: unusual reactivity in organic chemistry. Journal of fluorine chemistry. 2002; 118(1)″107-121.
Turchinovich, et al. “Capture and Amplification by Tailing and Switching (CATS): An Ultrasensitive Ligation-Independent Method for Generation of DNA Libraries for Deep Sequencing from Picogram Amounts of DNA and RNA.” RNA Biology 11.7 (2014): 817-828. PMC. Web. Nov. 13, 2017.
Uttamapinant, et al. Fast, cell-compatible click chemistry with copper-chelating azides for biomolecular labeling.Angew. Chem. Int. End. Engl., Jun. 11, 2012: 51(24) pp. 5852-5856.
Wagner, et al. Biocompatible fluorinated polyglycerols for droplet microfluidics as an alternative to PEG-based copolymer surfactants. Lab Chip. Jan. 7, 2016;16(1):65-9. doi: 10.1039/c5lc00823a. Epub Dec. 2, 2015.
Weigl, et al. Microfluidic Diffusion-Based Separation and Detection. Science. 1999; pp. 346-347.
Williams, et al. Amplification of complex gene libraries by emulsion PCR. Nature Methods. 2006;3(7):545-50.
Zhang, et al. One-step fabrication of supramolecular microcapsules from microfluidic droplets. Science. Feb. 10, 2012;335(6069):690-4. doi: 10.1126/science.1215416.
Zhang, et al. Reconstruction of DNA sequencing by hybridization. Bioinformatics. Jan. 2003;19(1):14-21.
Zheng, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. Jan. 16, 2017;8:14049. doi: 10.1038/ncomms14049.
Zhu, et al. Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. Biotechniques. Apr. 2001;30(4):892-7.
Zheng, X. SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants, Department of Bios . . . .
Co-pending U.S. Appl. No. 18/152,650, inventor Shastry; Shankar, filed Jan. 10, 2023.
Co-pending U.S. Appl. No. 18/198,763, inventors Schnall-Levin; Michael et al., filed May 17, 2023.
Co-pending U.S. Appl. No. 18/392,684, inventors Fernandes; Sunjay Jude et al., filed Dec. 21, 2023.
Co-pending U.S. Appl. No. 18/643,684, inventor Bava; Felice Alessio, filed Apr. 23, 2024.

Related Publications (1)

	Number	Date	Country
	20200020417 A1	Jan 2020	US

Provisional Applications (3)

Number	Date	Country
61979973	Apr 2014	US
61916566	Dec 2013	US
61872597	Aug 2013	US

Continuations (1)

	Number	Date	Country
Parent	14470746	Aug 2014	US
Child	16428656		US

Sequencing methods

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

International Classifications

Disclaimer

Term Extension

Abstract