Single-cell genomics has become a mainstay technology used to dissect multicellular organisms and tissues that are composed of cells with diverse functions1-4. The power of this approach has been demonstrated in several cell atlas studies: novel cell types have been discovered that further led to the elucidation of new mechanisms; complex cellular interactions and transitions associated with disease initiation or progression have been revealed; cross-species analyses have shed light on evolutionary processes5-7. The use of single-cell technology in studying cancer is especially important. Regulatory mechanisms underlying drug resistance or immune evasion are elusive and complex, and tumor cell heterogeneity tumor is a major contributing factor to this complexity, making it particularly challenging to dissect these mechanisms with bulk techniques8-10. Single-cell technologies have greatly enhanced our understanding of tumor heterogeneity and accelerated mechanistic discovery. At the phenotype level, single-cell RNA-seq (scRNA-seq) has been used to uncover drug-resistant melanoma subpopulations and to characterize cancer stem cell subpopulations in glioblastoma7,11-14. scRNA-seq has also enabled a more comprehensive phenotypic understanding of the tumor microenvironment (TME) in many cancers including glioma and colorectal cancer15-19. At the genotype level, genomic instability contributes to cancer initiation, progression, relapse, and metastasis8. With single-cell whole genome sequencing (scWGS), the clonal structure of the tumor can be resolved, and evolutionary analysis based on copy number aberrations (CNAs) can reveal tumor progression20,21.
Evidently, both the genomic and transcriptomic heterogeneity of tumors contributes to the disease, and understanding the importance of both in cancer studies is crucial. Several single-cell methods that interrogate DNA and RNA simultaneously in the same cell (scWGS-RNAseq) have been developed22-26. However, these first-generation scWGS-RNAseq methods have not been widely applied since their invention, largely because they require physical separation of DNA and RNA, often by physical separation of the nucleus from the cytoplasm, sometimes by physical separation of polyadenylated RNA from the rest of the cell by polyT-bead based fishing. These separation techniques are labor-intensive and technically demanding, time-consuming, requiring well-trained experimental techniques, or require special microfluidic devices. Furthermore, they are not applicable to frozen samples, where it is impossible to obtain intact single-cell suspensions. As such, existing scWGS-RNAseq methods cannot be applied to the vast majority of primary biobanked tumor samples. All in all, these limitations make first generation scWGS-RNAseq methods not easily accessible.
Therefore, there remains a need for methods of co-profiling of DNA-encoded and RNA-encoded information from a single cell.
The present invention relates to a novel single-cell DNA and RNA co-amplification method, scONE-seq, which enables co-profiling of the transcriptome and genome from the same single cell or nucleus in a one-container reaction. In certain embodiments, the invention pertains to a barcoding strategy that introduces a 6-base long DNA-specific and RNA-specific barcodes to each type of nuclei acid during the single-cell DNA/RNA amplification process, while also incorporating unique molecular identifiers (UMIs)27-29. Thus, DNA and RNA reads can be amplified together by a shared primer region, but later distinguished in-silico by their respective specific barcode information after sequencing. Compared to the first generation scWGS-RNAseq methods, scONE-seq has several advantages: it has a simplified library construction workflow; it is compatible with standard biology workflows such as fluorescence-activated cell sorting (FACS); being a one-pot (i.e., one container or one tube) reaction, its throughput can be easily scaled up using liquid-handling robots; most importantly, scONE-seq does not require physically separating the DNA and RNA, and is therefore applicable to a variety of sample types including single nuclei. In certain embodiments, frozen clinical samples and cell types that are difficult to dissociate into single-cell suspensions, which are intractable with scDR-seq methods, can be profiled using scONE-seq.
The method is a DNA/RNA barcoding strategy, which tags the DNA and RNA with different nucleic acid barcodes respectively prior to single cell DNA/RNA co-amplification. Amplification adapters can also be added and used to co-amplify the DNA and reverse transcribed cDNA with the same primer sets, generating the sequencing library. After sequencing, DNA and RNA reads can be computationally distinguished by demultiplexing of the barcode.
The methods of the invention can be useful for any type of cell. Methods of the invention can be applied to the co-profiling of whole genomes and total transcriptomes from a single cell. The methods are particularly useful for studying diseases such as cancer, in which the genome and transcriptome reflect different facets of the disease progression. The methods can also be used to study viral activity within cells, as infected cells harbor viral DNA and RNA in addition to the endogenous genome and transcriptome. The subject methods can further be used identify bacteria and their interactions with phage. In certain embodiments, methods can also be used to screen for drugs and discover new drugs or drug functions. In certain embodiments, the methods can be used with microfluidics devices.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication, with color drawing(s), will be provided by the Office upon request and payment of the necessary fee.
SEQ ID NO: 1: Exemplary Adapter for RNA sequence
SEQ ID NO: 2: Exemplary Amplification primer
SEQ ID NO: 3: Annealing sequence
SEQ ID NO: 4: One-Tn5 Exemplary Adapter for DNA sequence
SEQ ID NO: 5: Exemplary Adapter for RNA sequence
SEQ ID NO: 6: Exemplary Adapter for RNA sequence
SEQ ID NO: 7: Exemplary amplification primer
SEQ ID NO: 8: Mosaic Sequence
SEQ ID NO: 9: Read1-Tn5 sequence/Read 1 primer
SEQ ID NO: 10: I7 index primer
SEQ ID NO: 11: I5 index primer
SEQ ID NO: 12: Read 2 primer
As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Further, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.” The transitional terms/phrases (and any grammatical variations thereof) “comprising,” “comprises,” “comprise,” include the phrases “consisting essentially of,” “consists essentially of” “consisting,” and “consists.”
The phrases “consisting essentially of” or “consists essentially of” indicate that the claim encompasses embodiments containing the specified materials or steps and those that do not materially affect the basic and novel characteristic(s) of the claim.
The term “about” means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed.
In the present disclosure, ranges are stated in shorthand, to avoid having to set out at length and describe each and every value within the range. Any appropriate value within the range can be selected, where appropriate, as the upper value, lower value, or the terminus of the range. For example, a range of 1-10 represents the terminal values of 1 and 10, as well as the intermediate values of 2, 3, 4, 5, 6, 7, 8, 9, and all intermediate ranges encompassed within 1-10, such as 2-5, 2-8, and 7-10. Also, when ranges are used herein, combinations and sub-combinations of ranges (e.g., subranges within the disclosed range) and specific embodiments therein are intended to be explicitly included.
The terms “label,” “detectable label, “detectable moiety,” and like terms refer to a composition detectable by spectroscopic, photochemical, biochemical, immunochemical, chemical, or other physical means. For example, useful labels include fluorescent dyes (fluorophores), luminescent agents, electron-dense reagents, enzymes (e.g., as commonly used in an ELISA), biotin, enzymes acting on a substrate (e.g., horseradish peroxidase), digoxigenin, 32P and other isotopes, haptens, and proteins which can be made detectable, e.g., by incorporating a radiolabel into the peptide or used to detect antibodies specifically reactive with the peptide. The term includes combinations of single labeling agents, e.g., a combination of fluorophores that provides a unique detectable signature, e.g., a barcode. A barcode is a sequence of about 4 to about 10 nucleotides or about 5 to about 8 nucleotides that are used to distinguish between different samples during sequence analysis.
As used herein, the term “positive,” when referring to a result or signal, indicates the presence of an analyte or item that is being detected in a sample. The term “negative,” when referring to a result or signal, indicates the absence of an analyte or item that is being detected in a sample. Positive and negative are typically determined by comparison to at least one control, e.g., a threshold level that is required for a sample to be determined positive, or a negative control (e.g., a known blank). A “control” sample or value refers to a sample that serves as a reference, usually a known reference, for comparison to a test sample. For example, a test sample can be taken from a test condition, e.g., in the presence of a test compound, and compared to samples from known conditions, e.g., in the absence of the test compound (negative control), or in the presence of a known compound (positive control). A control can also represent an average value gathered from a number of tests or results. One of skill in the art will recognize that controls can be designed for assessment of any number of parameters, and will understand which controls are valuable in a given situation and be able to analyze data based on comparisons to control values. Controls are also valuable for determining the significance of data. For example, if values for a given parameter are variable in controls, variation in test samples will not be considered as significant.
As used herein, a “calibration control” is similar to a positive control, in that it includes a known amount of a known analyte. In the case of a PCR assay, the calibration control can be designed to include known amounts of multiple known analytes. The amount of analyte(s) in the calibration control can be set at a minimum cut-off amount, e.g., so that a higher amount will be considered “positive” for the analyte(s), while a lower amount will be considered “negative” for the analyte(s). In some cases, multilevel calibration controls can be used, so that a range of analyte amounts can be more accurately determined. For example, an assay can include calibration controls at known low and high amounts, or known minimal, intermediate, and maximal amounts.
As used herein, “subject,” “patient,” “individual” and grammatical equivalents thereof are used interchangeably and refer to, except where indicated, mammals, such as humans and non-human primates, as well as rabbits, felines, canines, rats, mice, squirrels, goats, pigs, deer, and other mammalian species. The term does not necessarily indicate that the subject has been diagnosed with a particular disease, but typically refers to an individual under medical or veterinary supervision. A patient can be an individual that is seeking treatment, monitoring, adjustment or modification of an existing therapeutic regimen, etc.
The term “biological sample” or “sample from a subject” encompasses a variety of sample types obtained from an organism. The term encompasses bodily fluids such as blood, blood components, saliva, nasal mucous, serum, plasma, cerebrospinal fluid (CSF), urine and other liquid samples of biological origin, solid tissue biopsy, tumor, tissue cultures, or supernatant taken from cultured patient cells. In the context of the present disclosure, the biological sample is typically a cell or nucleus sample with detectable amounts of nucleic acids. The biological sample can be processed prior to assay, e.g., to lyse cells. The term encompasses samples that have been manipulated after their procurement, such as by treatment with reagents, solubilization, sedimentation, or enrichment for certain components.
As used herein, the term “nucleic acid” or “polynucleotide” refers to deoxyribonucleic acids (DNA) or ribonucleic acids (RNA) and polymers thereof in either single- or double-stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, and mRNA encoded by a gene.
As used herein, the term “gene” means the segment of DNA involved in producing a polypeptide chain; it includes regions preceding and following the coding region (leader and trailer) involved in the transcription/translation of the gene product and the regulation of the transcription/translation, as well as intervening sequences (introns) between individual coding segments (exons).
As used herein, the terms “identical” or percent “identity”, in the context of describing two or more polynucleotide or amino acid sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (for example, a nucleotide probe used in the method of this invention has at least 70% sequence identity, preferably 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identity, to a target sequence or complementary sequence thereof), when compared and aligned for maximum correspondence over a comparison window, or designated region as measured using one of the following sequence comparison algorithms or by manual alignment and visual inspection. Such sequences are then said to be “substantially identical”. With regard to polynucleotide sequences, this definition also refers to the complement of a test sequence.
As used herein, the term “multiplexing” refers to a process in which multiple samples or multiple types of biomolecules are pooled together for signal readout and processing, such as, for example, mixing sequences from multiple single cells into one pool for sequence amplification or sequencing together; or, in another example, generating a mixture of sequences derived from genomic DNA and RNA for amplification or sequencing together.
As used herein, the term “demultiplexing” refers to a process in which of converting the signal/readout from multiple sample origins into separate signals/readouts, which can be performed after a multiplexing experiment is conducted in order to recover the sample-specific information from the pooled/multiplexed readout. For example, converting the sequencing information containing sequences from multiple single cells into sequences derived from each original single cell, possibly based on barcode/adapter/index tags on these sequences that identify their cell-of-origin. In another example, converting the sequence information from the co-amplification of DNA and RNA from a single cell, into sequences derived from DNA of that cell separate from the sequences derived from RNA of that cell, possibly based on barcode/adapter/index tags on these sequences that identify their original biomolecule type.
As used herein, the term “adapter” refers to a nucleic acid component, generally DNA, which provides a means of addressing a nucleic acid fragment to which it is subsequently joined. For example, in certain embodiments, an adapter comprises a nucleotide sequence that permits identification, recognition, and/or molecular or biochemical manipulation of the DNA to which the adapter is attached (e.g., by providing a site for annealing an oligonucleotide, such as a primer for extension by a DNA polymerase, or an oligonucleotide for capture or for a ligation reaction). Adapters may be or include a region that is an indexing/barcoding sequence used to identify the sample source (e.g., cell or tissue) from which each nucleic acid originated to allow multiplexing of molecules from different sample sources for high-throughput amplification and/or sequencing. Alternatively or additionally, indexing/barcoding sequences can be used to distinguish those nucleic acids derived from DNA from nucleic acids derived from RNA (e.g., cDNA) to allow pooling of DNA and RNA from the same sample for high-throughput amplification and/or sequencing. For example, a “DNA-specific barcode” can be used to identify sequences originating from genomic DNA molecules and an “RNA-specific barcode” can be used to identify sequences originating from RNA molecules. Adapters can be added to a nucleic acid, for example, by various enzymatic methods including but not limited to reverse transcription, ligation, tagmentation, PCR, or any combination thereof
In certain embodiments, the subject invention provides an isolated synthetic nucleic acid adapter in which the adapter can be recognized by the transposase. Transposases can act in complex with specific DNA sequences or adapters, which can form stable complexes with transposases and thus render them active. The adapters can comprise transposase recognition sequences found in nature, or they also can be modified native sequences.
In certain embodiments, the adapter can comprise one or more double-stranded DNA (dsDNA) or single-stranded DNA (ssDNA) sequences. The sequences can be included to allow attachment of generated DNA fragments to sequencing chips, such as Illumina chips, and allow identification of the source of the target DNA and RNA. The adapter can be designed for other types of sequencing, including for example Ion Torrent and DNBSEQ. The adapter can comprise at least one of the following: an amplification primer sequence, a DNA-specific or RNA-specific barcode, a Seq-1 primer, an annealing sequence, and a mosaic. In certain embodiments, the DNA-specific barcode and RNA-specific barcode can be used to differentiate between DNA sequences and RNA sequences in a single sample. In certain embodiments, the adapter can comprise the following exemplary sequence: GTCTCGTGGGCTCGG ATCGT NNNNN TTTTTTTTTTTTTTTTTTTTVN (SEQ ID NO: 1). In certain embodiments, a plurality of adapters, each of which with distinct sequences, can be added to a reaction mixture. In preferred embodiments, the adapter of SEQ ID NO: 1 can be added into a reaction mixture with additional adapters to, for example, achieve capture of non-polyadenylated RNA from the sample, and such an adapter can comprise the following two exemplary classes of sequences: GTCTCGTGGGCTCGG ATCGT NNNNN GGG HN (SEQ ID NO: 5), and GTCTCGTGGGCTCGG ATCGT NNNNN TTT VN (SEQ ID NO: 6). In certain embodiments, an exemplary amplification primer in the adapter is GTCTCGTGGGCTCGG (SEQ ID NO: 2) or GATGTGTGGAGGTCTCGTGGGCTCGG (SEQ ID NO: 7), which is complementary to the Illumina sequencing primer Seq-1. In preferred embodiments, the PCR primer sequence is a sequence within an adapter sequence that is shared between a plurality of adapters that permits simultaneous amplification of nucleotide sequences derived from a sample, including, for example, both DNA and RNA sequences. In certain embodiments, an exemplary RNA barcode in the adapter is ATCGT. In certain embodiments, an exemplary DNA barcode in the adapter is TCATG. In certain embodiments, the UMI in the adapter can be any sequence of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more nucleotides, preferably 5 nucleotides in length (i.e., 25), and can be used to uniquely tag each DNA and RNA molecule. In certain embodiments, an exemplary annealing sequence in the adapter: TTTTTTTTTTTTTTTTTTTTVN (SEQ ID NO: 3). In certain embodiments, the Tn5 transposase can recognize specific sequences to form a complex. The mosaic sequence can be the recognition sequence for Tn5, such as, for example exemplary sequence: [phos]CTGTCTCTTATACACATCT (SEQ ID NO: 8). The Illumina platform can design a sequence (Read 1 sequencing primer) to perform sequencing. The Read 1 sequencing primer can include two parts, the mosaic and Seq-1 sequence. In certain embodiments, an adapter or plurality of adapters can be used to tag DNA with a DNA recognition barcode. Additional, the adapter or plurality of adapter used to tag DNA can be used for assembly with Tn5 during the initial tagmentation step. In certain embodiments, the adapter has the sequence: GTCTCGTGGGCTCGG TCATG NNNNN AGATGTGTATAAGAGACAG (One-Tn5) (SEQ ID NO: 4). In preferred embodiments, the co-amplification of DNA and RNA are achieved by performing PCR using a common primer that is shared between the tagged DNA and RNA molecules.
In certain embodiments, an adapter that is complementary to the Tn5 adapter used for tagging DNA, and also contains Read1 sequencing primer region, can be added to the cDNA and gDNA library created by amplifying the tagged DNA and RNA molecules. During a second Tn5 library construction step, this adapter can be assembled with the second round of Tn5. The mosaic sequence can be shared for the two Tn5 steps. In certain embodiments, the sequence of the adapter can be TCGTCGGCAGCGTC AGATGTGTATAAGAGACAG (Read1-Tn5) (SEQ ID NO: 9).
In certain embodiments, one or more adapters that contain primer sequences that amplify a nucleic acid region (or amplicon) of at least 200 bp, about 200 bp to about 6000 bp, about 200 bp to about 4000 bp, about 200 bp to about 3000 bp, about 200 bp to about 2000 bp, about 200 bp to about 1000 bp, about 200 bp to about 750 bp, about 200 bp to about 500 bp, about 200 bp to about 1000 bp, about 200 bp to about 500 bp, or about 300 to about 500 bp is provided by the subject invention. The primer for the amplification reactions can be designed according to known algorithms or by a skilled artisan. For example, algorithms implemented in commercially available or custom software can be used to design primers for amplifying the target sequences based on the complementarity and stringency of said primers to the target region. Stringency refers to hybridization conditions chosen to optimize binding of polynucleotide sequences with different degrees of complementarity. Stringency is affected by factors such as temperature, salt conditions, the presence of organic solvents in the hybridization mixtures, and the lengths and base compositions of the sequences to be hybridized and the extent of base mismatching, and the combination of parameters is more important than the absolute measure of any one factor.
Typically, the primer sequences can be at least 12 bases, more often about 15, about 18, about 20, about 21, about 22, about 23, about 24, about 25, about 26, about 27, about 28, about 29, about 30, or more base pairs in length. In preferred embodiments, the primer sequence is about 26 base pairs in length. Primers are typically designed so that all primers participating in a particular reaction have melting temperatures that are within 5° C., and most preferably within 2° C. of each other. Primers are further designed to avoid priming on themselves or each other. Primer and/or adapter concentration should be sufficient to bind to the amount of target sequences that are amplified so as to provide an accurate assessment of the quantity of amplified sequence. Those of skill in the art will recognize that the amount of concentration of primer and/or adapter will vary according to the binding affinity of the primers as well as the quantity of sequence to be bound.
In certain embodiments, adapters can be designed to hybridize to a nucleic acid sequence, or portions thereof. In certain embodiments, the complementary nucleotide segment of the primer is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 40, 50, or 100 base pairs long, or longer. In preferred embodiments, the complementary nucleotide segment of the adapters is about 15 to about 60 base pairs, preferably about 16 to about 50 base pairs, more preferably about 17 to about 40 base pairs, more preferably about 17 to about 35 base pairs, more preferably about 18 to about 25 base pairs. In certain embodiments, the primers can be 100% complementary to a target sequence or at least 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence complementarity. In certain embodiments, the sequence of the primer can also have multiple possible alternative nucleotides represented by the IUPAC notation of, for example, R, Y, S, W, K, M, B, D, H, V, N, or a gap (“-” or “.”) nucleotide. In certain embodiments, adapters can be designed to ligate to a nucleic acid sequence, or portions thereof.
The invention provides methods for co-profiling of DNA-encoded and RNA-encoded information from a single cell. The method can use a series of molecular biology reactions that act preferentially on DNA and RNA sequentially to tag the DNA and RNA with different nucleic acid barcodes respectively prior to single cell DNA/RNA co-amplification.
To achieve single cell genome and transcriptome parallel sequencing, we devised the method to co-amplify RNA with DNA in a reaction (
In certain embodiments, following co-amplification, PCR amplified fDNA and cDNA can be either shortened for sequencing or the fragments can be used directly for sequencing. These fragments can be length around about 2000 to about 6000 base pairs in length (fDNA) and about 200 to about 2000 base pairs in length (cDNA). In certain embodiments, next-generation sequencing (NGS) requires shorter nucleotide lengths than other types of sequencing. Therefore, to make the sequences appropriate lengths for short read sequencing in platforms, such as, for example, the Illumina platform, a second Tn5 tagmentation can be used to fragment the co-amplified fDNA and cDNA. This second tagmentation can insert adapters that contain a sequencing primer, such as, for example, the Read1 primer (SEQ ID NO: 9) (Illumina-specific), so that the final products can be dsDNA fragments of about 400 to about 800 base pairs in length and contain a sequencing primer, such as, for example, Read1 (SEQ ID NO: 9) and Read2 (SEQ ID NO: 12). In certain embodiments, any further specific sequencing primers can be added in more PCR steps after this step.
In certain embodiments, a “sample index” sequence is included in the DNA barcode adapter and RNA barcode adapter, where the same sample has the same sample index for both the DNA and RNA molecules. This allows the co-amplified library from multiple different single cells to be pooled and sequenced simultaneously in a multiplexed fashion.
Any high-throughput method for sequencing can be used in the practice of the invention. DNA sequencing approaches include, but are not limited to: dideoxy sequencing reactions using labeled terminators (the Sanger method) in various formats, sequencing by synthesis, pyrosequencing, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, high-throughput single-molecule sequencing.
After the co-amplified library is sequenced, data processing can be used to filter and separate reads from DNA/RNA. In certain embodiments, the sequenced library of reads can be aligned to the known “DNA barcode” and “RNA barcode” to determine whether any given read is from DNA or RNA, thus separating the two. In certain embodiments, the “fastp”, “seqkit”, and/or “seqtk” programs can be used to perform the separation. After separating the DNA from the RNA, the known adapter sequences can be trimmed/removed using programs such as Cutadapt, the FASTX toolkit, and others. In certain embodiments, the separated DNA/RNA reads can be aligned to a reference sequence dataset, either computationally or manually; for example, the DNA reads aligned to an organism's genome reference and the RNA reads aligned to an organism's transcriptome reference. Following reads preprocessing, the DNA and RNA data can be analyzed separately using standard single-cell analysis computational pipelines, such as, for example, Seurat, Gingko, or CHISEL to perform the data analysis and visualization. In other embodiments, the separated DNA/RNA reads can be assembled from short reads into longer reads or contigs representing a longer contiguous nucleic acid sequence, for the purpose of de novo sequence or genome/transcriptome assembly. The assembly can be done using a pipeline composed of several different programs such as Spades (short reads), Canu (long reads), Velvet, followed by BUSCO, and the like.
In certain embodiments, fragmentation of DNA can be achieved by enzymatic digestion or physical methods such as sonication, nebulization or hydrodynamic shearing. The fragmentation of the DNA can be achieved using a Tn5 transposase, as described in Zahn, H., Steif, A., Laks, E. et al. Scalable whole-genome single-cell library preparation without preamplification. Nat Methods 14, 167-173 (2017). See worldwide website: doi.org/10.1038/nmeth.4140.
The method provides a DNA/RNA barcoding strategy, which tags the DNA and RNA with different nucleic acid barcodes prior to single cell DNA/cDNA co-amplification. Amplification adapters are also added and used to co-amplify the DNA and reverse transcribed cDNA with the same primer sets, generating the sequencing library. In certain embodiments, the Tn5 transposase can also ligate an adapter to a cDNA nucleic acid sequence and/or DNA nucleic acid sequence. After sequencing, DNA and RNA reads can be computationally distinguished by demultiplexing of the barcode.
In certain embodiments, the cell from a sample can be added to a lysis buffer that contains components such as, for example, SDS, Triton X-100, or Tween-20 before amplification and/or reverse transcription of the nucleic acids targets; optionally the lysis buffer contains nuclease inhibitors like RNAse inhibitor. The cell from the sample can also be added to a buffer containing protease, such as, for example proteinase K or Thermolabile Proteinase K (New England Biolabs, Ipswich MA), before amplification and/or reverse transcription of the nucleic acids targets.
In certain embodiments, the detection of the at least one single-stranded or double stranded nucleic acid is carried out in an enzyme-based nucleic acid amplification method.
The expression “enzyme-based nucleic acid amplification method” relates to any method wherein enzyme-catalyzed nucleic acid synthesis occurs.
Such an enzyme-based nucleic acid amplification method can be preferentially selected from the group constituted of Polymerase Chain Reaction (PCR), notably encompassing all PCR based methods known in the art, such as reverse transcriptase PCR (RT-PCR), simplex and multiplex PCR, real time PCR, end-point PCR, quantitative or qualitative PCR and combinations thereof. These enzyme-based nucleic acid amplification method are well known to the man skilled in the art and are notably described in Saiki et al. (1988) Science 239:487, EP 200 362 and EP 201 184 (PCR); Fahy et al. (1991) PCR Meth. Appl. 1:25-33 (3SR, Self-Sustained Sequence Replication); EP 329 822 (NASBA, Nucleic Acid Sequence-Based Amplification); U.S. Pat. No. 5,399,491 (TMA, Transcription Mediated Amplification), Walker et al. (1992) Proc. Natl. Acad. Sci. USA 89:392-396 (SDA, Strand Displacement Amplification); EP 0 320 308 (LCR, Ligase Chain Reaction); Bustin & Mueller (2005) Clin. Sci. (London) 109:365-379 (real-time Reverse-Transcription PCR).
In some embodiments, the enzyme-based nucleic acid amplification method is selected from the group consisting of Polymerase Chain Reaction (PCR) and Reverse-Transcriptase-PCR (RT-PCR), multiplex PCR or RT-PCR and real time PCR or RT-PCR. In other embodiments, the enzyme-based nucleic acid amplification method is a real time, optionally multiplex, PCR, quantitative PCR or RT-PCR method.
Exemplary PCR reaction conditions typically comprise either two or three step cycles. Two step cycles have a denaturation step followed by a hybridization/elongation step. Three step cycles comprise a denaturation step followed by a hybridization step followed by a separate elongation step. The polymerase reactions are incubated under conditions in which the primers hybridize to the target sequences and are extended by a polymerase. The amplification reaction cycle conditions are selected so that the primers hybridize specifically to the target sequence and are extended.
Successful PCR amplification requires high yield, high selectivity, and a controlled reaction rate at each step. Yield, selectivity, and reaction rate generally depend on the temperature, and optimal temperatures depend on the composition and length of the polynucleotide, enzymes and other components in the reaction system. In addition, different temperatures may be optimal for different steps. Optimal reaction conditions may vary, depending on the target sequence and the composition of the primer. Thermal cyclers such as, for example, real-time PCR systems provide the necessary control of reaction conditions to optimize the PCR process for a particular assay. For instance, a real-time PCR system may be programmed by selecting temperatures to be maintained, time durations for each cycle, number of cycles, and the like. In some embodiments, temperature gradients may be programmed so that different sample wells may be maintained at different temperatures, and so on.
In certain embodiments, the target nucleic acid sequence can be RNA and DNA. RNA or DNA can be artificially synthesized or isolated from natural sources. In some embodiments, the RNA target nucleic acid sequence can be a ribonucleic acid such as RNA, mRNA, piRNA, tRNA, rRNA, ncRNA, gRNA, shRNA, siRNA, snRNA, miRNA and snoRNA More preferably the DNA or RNA is biologically active or encodes a biologically active polypeptide. The DNA or RNA template can also be present in any useful amount.
Reverse transcriptases useful in the present invention can be any polymerase that exhibits reverse transcriptase activity. Several reverse transcriptases are known in the art and are commercially available (e.g., from Bio-Rad Laboratories, Inc., Hercules, CA; Boehringer Mannheim Corp., Indianapolis, Ind.; Life Technologies, Inc., Rockville, Md.; New England Biolabs, Inc., Beverley, Mass.; Perkin Elmer Corp., Norwalk, Conn.; Pharmacia LKB Biotechnology, Inc., Piscataway, N.J.; Qiagen, Inc., Valencia, Calif.; Stratagene, La Jolla, Calif.). In some embodiments, the reverse transcriptase can be Avian Myeloblastosis Virus reverse transcriptase (AMV-RT), Moloney Murine Leukemia Virus reverse transcriptase (M-MLV-RT), Human Immunovirus reverse transcriptase (HIV-RT), EIAV-RT, RAV2-RT, C. hydrogenoformans DNA Polymerase, rTth DNA polymerase, SUPERSCRIPT I, SUPERSCRIPT II, SUPERSCRIPT III, and mutants, variants and derivatives thereof. It is to be understood that a variety of reverse transcriptases can be used in the present invention, including reverse transcriptases not specifically disclosed above, without departing from the scope or preferred embodiments disclosed herein.
DNA polymerases useful in the present invention can be any polymerase capable of replicating a DNA molecule. Preferred DNA polymerases are thermostable polymerases and polymerases that have exonuclease activity, which are especially useful in PCR. Thermostable polymerases are isolated from a wide variety of thermophilic bacteria, such as Thermus aquaticus (Taq), Thermus brockianus (Tbr), Thermus flavus (Tfl), Thermus ruber (Tru), Thermus thermophilus (Tth), Thermococcus litoralis (Tli) and other species of the Thermococcus genus, Thermoplasma acidophilum (Tac), Thermotoga neapolitana (Tne), Thermotoga maritima (Tma), and other species of the Thermotoga genus, Pyrococcus furiosus (Pfu), Pyrococcus woesei (Pwo) and other species of the Pyrococcus genus, Bacillus sterothemophilus (Bst), Sulfolobus acidocaldarius (Sac) Sulfolobus solfataricus (Sso), Pyrodictium occultum (Poc), Pyrodictium abyssi (Pab), and Methanobacterium thermoautotrophicum (Mth), and mutants, variants or derivatives thereof. Preferred DNA polymerases have strand displacement activity; however, a polymerase with strand displacement activity is not required and other methods known in the art of displacing nucleotide strands can be used in the subject invention, such as, for example, heating the nucleotide strands. In preferred embodiments, a high fidelity polymerase can be used. In certain embodiments, a single polymerase can be used or two or more distinct polymerases can be used. In certain embodiments, the polymerase is KAPA HiFi (Roche, Basel, Switzerland)
Many DNA polymerases are known in the art and are commercially available (e.g., from Bio-Rad Laboratories, Inc., Hercules, CA; Boehringer Mannheim Corp., Indianapolis, Ind.; Life Technologies, Inc., Rockville, Md; New England Biolabs, Inc., Beverley, Mass.; Perkin Elmer Corp., Norwalk, Conn.; Pharmacia LKB Biotechnology, Inc., Piscataway, N.J.; Qiagen, Inc., Valencia, Calif.; Stratagene, La Jolla, Calif). In some embodiments, the DNA polymerase can be Taq, Tbr, Tfl, Tru, Tth, Tli, Tac, Tne, Tma, Tih, Tfi, Pfu, Pwo, Kod, Bst, Sac, Sso, Poc, Pab, Mth, Pho, ES4, VENT™, DEEPVENT™, and active mutants, variants and derivatives thereof. It is to be understood that a variety of DNA polymerases can be used in the present invention, including DNA polymerases not specifically disclosed above, without departing from the scope or preferred embodiments thereof.
In certain embodiments, the proportion of DNA or RNA in the final sequencing library can be adjusted by changing the Tn5 concentration in the initial reaction mixture.
In a preferred embodiment, the reactions according to the invention can also contain further reagents suitable for a PCR step. Such reagents are known to those skilled in the art, and include water, like nuclease-free water, RNase free water, DNAse-free water, PCR-grade water; salts, like magnesium, magnesium chloride, potassium; buffers such as Tris; enzymes; nucleotides like deoxynucleotides, dideoxunucleotides, dNTPs, dATP, dTTP, dCTP, dGTP, dUTP and modified nucleotides such as deaza-, locked nucleic acid, and peptide nucleic acid; other reagents, like DTT and/or RNase inhibitors; and polynucleotides like polyT and polydT.
The methods of the subject invention can be easy to use and simple to adopt, requiring no additional or specialized equipment beyond what is available in standard biology laboratories and using standard wet lab operating procedures. The methods can be automatable and scalable, as it only requires standard pipetting steps and thus can be adapted to use liquid handling robots for high-throughput applications. In certain embodiments, the methods of the subject invention can be comparable in accuracy and sensitivity to existing single-cell profiling methods that profile only DNA or RNA from a single cell. In certain embodiments, the methods of the subject invention can be superior in accuracy and sensitivity to existing single-cell profiling methods that profile both DNA and RNA from a single cell. The subject method, scONE-seq, can enable numerous previously intractable single cell multi-omic experiments, and lead to new discoveries in the life sciences.
In Table 1, some of the key advantages of scONE-seq compared to G&T-seq and DR-seq. From the published data, DR-seq can suffer from GC-bias in the DNA amplification, and the overall amplification uniformity is worse (
The time required for each protocol was estimated from published versions. Due to having only one reaction for each cell, scONE-seq can take at least about 8 hours per plate, and can use a single purification step at the very end. Overall, scONE-seq produces better data with less experimental time and lower cost.
In certain embodiments, the RNA and DNA from a single sample need not be physically separated during the reaction, and can still be differentially tagged within a single reaction compartment. In certain embodiments, the subject method achieves simultaneous tagging and amplification of DNA and RNA in a one-container reaction. In certain embodiments, the subject methods do not require any specifically designed device (such as a microfluidic chip) to achieve co-profiling of DNA and RNA from the same single cell. In certain embodiments, the subject methods can be automated using robots or other high-throughput platforms, such as, for example, a microfluidic platform. This allows the experiment to the scaled up easily to orders of magnitude higher throughput, which can enable the DNA-RNA co-profiling of previously unattainable orders-of-magnitude number of single cells. The throughput is versatile and easy to control, thereby making the method appropriate for small scale use as well as large scale applications.
In certain embodiments, the methods provided by the subject invention can be used to amplify one or more DNA nucleic acid sequences and one or more cDNA sequences derived from RNA nucleic acid sequences from a single cell or nucleus. In certain embodiments, the methods can be used to amplify nucleic acid sequences of fresh cell samples, such as, for example, peripheral blood mononuclear cells (PBMCs) and cells lines. In certain embodiments, the methods can be used to amplify nucleic acid sequences of nuclei from frozen tissue samples, such as, for example, tumor specimens that have been frozen for years. In certain embodiments, the population of cells can be determined, such as, for example, the cell populations of B-cells, T-cells, and NK cells, based on, for example, the gene expression markers. In certain embodiments, different genome and transcriptome profiles can be determined using the subject methods. In certain embodiments, the RNA sequence can be used to determine gene expression markers. In certain embodiments, the DNA sequences can be used to determine copy number alterations (CNAs).
In certain embodiments, the subject methods can be used to probe virus-host interactions. By co-profiling DNA and RNA from a virus and a host cell, the distribution of the virus can be determined within a host. Furthermore, the virus abundance within the host cell can be correlated with the virus gene expression. Using the virus abundance information, all genes with the virus can be selected and the correlated genes can be analyzed for viral patterns, such as, for example, cells could be separated into virus-rich cells and virus-poor cells. In certain embodiments, the methods can be applicable to subcellular level components such as, for example, single nuclei, which also contains both DNA and RNA. In certain embodiments, the methods can be also applicable to any biomolecule in any context that is tagged with DNA or RNA.
The methods of the invention can be useful for any type of cell. Methods of the invention are applied most straightforwardly to the co-profiling of whole genomes and total transcriptomes from the same single cell. The methods can be used for identifying diseases, such as, for example, cancer, in which the genome and transcriptome reflect different facets of the disease progression. The genome reveals the genomic instability and mutational landscape typically associated with cancer initiation and progression; the transcriptome reflects the cell's functional/molecular identity which could be associated with its stemness, the level of differentiation of the cancer, and inform prognoses for patients. The methods are also particularly useful for studying viral activity within cells, as infected cells harbor viral DNA and RNA in addition to their own endogenous genome and transcriptome, and depending on the type of virus, DNA or RNA, it is useful to interrogate both the DNA and RNA to observe the activity of the virus in the cell and its effects on cellular behavior. In addition to eukaryotic cells and their infecting viruses, this application could also include the interrogation of prokaryotes such as bacteria and their interactions with phage. The methods can also be useful for studying any type of symbiont-host interaction, such as, for example, the interaction of bacteriocytes and their host eukaryotic cell. The method may also be used for de novo genome and transcriptome assembly for an organism. The method may also be generalized to drug screening and discovery.
In certain embodiments, the subject methods can be compatible with frozen tissue samples that have been stored for at least hours, days, months, or years. This feature makes it easier to plan and perform larger-scale clinical multi-omic single-cell studies in two ways: first, by enabling studies on existing biobanked samples, which we have demonstrated herein; second, for studies on new samples, it also removes the burden of having to immediately process tissues from clinical researchers whose priority is patient care.
In certain embodiments, the subject methods, including scONE-seq, can be used on frozen glioblastoma (GBM) tissue. In certain embodiments, the subject methods, including scONE-seq, can be used to observe and characterize the differentiated tumor clones, which supports the idea that tumor clones can produce a differentiation hierarchy7,58,59. The existence of clone 1 was confirmed using both independent 10×Genomics snRNA-seq as well as immunostaining on tissue sections. scRNA-seq-only based cancer studies could underestimate important layers of tumor heterogeneity, and that simultaneous direct DNA measurement could contribute meaningful and informative insight on tumor evolution. Meanwhile, the clonal analysis based on scWGS-only data also ignores the complex interactions within a tumor microenvironment. By deciphering the genetic and phenotypic heterogeneity within the tumor ecosystem with the subject methods, including scONE-seq, we can reveal the interplays of clonal expansion, tumor cells differentiation hierarchy, and tumor microenvironment (TME).
Compared to other scDR-seq methods, the subject methods, including scONE-seq, can have much higher throughput. In certain embodiments, the subject methods, including scONE-seq, also possesses very high scalability. Alternatively, producing scONE-seq and droplet-based single cell data in parallel and then integrating them, is also a useful complementary, multi-omics approach to study cancer with high throughput. Moreover, additional processing can be added to the scONE-seq workflow to enable profiling of more layers of information: to detect chromatin accessibility simultaneously, an additional nuclei tagmentation step60-62 with customized ATAC adaptors could be added before FACS sorting; Similarly, quantitative protein estimation63 could be achieved by using DNA-barcoded antibodies before single-cell sorting steps of the scONE-seq (see Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14, 865-868 (2017).); and by jointly performing whole-exome capturing or any hybridized target sequencing panels with the standard scONE-seq library, paired high-depth single-cell somatic mutation information could also be integrated into the scONE-seq dataset.
HCT116, NPC43, HUVEC, and H9 cells are dissociated with trypsin-EDTA (0.25%) solution (Thermo Fisher, Waltham, MA) and stained with propidium iodide (10 mg/ml, Thermo Fisher) to exclude dead cells.
Fresh whole human blood was taken in the clinic center of the HKUST from a healthy donor. Lymphocytes were isolated via Ficoll-Paque PLUS (GE Healthcare, Chicago, IL) density centrifugation. The red blood cells were removed with 1×Red Blood Cell lysis buffer (Thermo Fisher).
The months-old frozen IDH1-mutant glioblastoma tissue (stored at −80° C.) was obtained from Prince of Wales Hospital. The nuclei isolation protocol is based on previous studies64,65. In brief, the homogenization method was used to prepare nuclei. The homogenization douncer should be cleaned with ethanol, bleach, and RNase-out and then rinsed with NF water. 100 mg frozen tissue was put into pre-chilled glass douncer contained 1 ml of 1× homogenization buffer (5 mM CaCl2, 3 mM Mg(Ac)2, 10 mM Tris, 16.7 μM PMSF, 167 μM β-mercaptoethanol, 320 mM sucrose, 0.1 mM EDTA, 0.1% NP40, 1 U/ml RNase inhibitor, 1× Proteinase Inhibitor, pH=7.8). The homogenized suspension was then filtered with a 35 μm cell strainer (Corning, Corning, NY), and nuclei can be spin down at 1000 g, 10 mins, 4° C. Nuclei were resuspended in 3.0 ml low sucrose buffer (320 mM sucrose, 10 mM HEPES, 5 mM CaCl2, 3 mM Mg(Ac)2, 0.1 mM EDTA, 1 mM DTT, 1U/ml RNase inhibitor, 1× Proteinase Inhibitor, pH=8.0). To remove cell debris, we then put 12.5 ml of density sucrose buffer (1 M sucrose, 10 mM HEPES, 3 mM Mg(Ac)2, 1 mM DTT, pH=8.0) underneath the low sucrose buffer homogenate to centrifuge at 3200 g for 20 min at 4° C. The nuclei were now in a flicking motion and can be stained with DAPI (Thermo Fisher).
Cells or nuclei were then loaded to Aria III flow cytometer (BD Biosciences, Franklin Lakes, NJ) to sort single cells into PCR tubes (96 or 384 PCR plates) containing Lysis buffer. The lysis buffer consisted of 2.5 U/μl RNase Inhibitor (NEB, Ipswich, MA), 0.15% Triton X-100 (Sigma, St. Louis, MO), and 6 μM DTT (Thermo Fisher). The sorted sample can be stored at −80° C. for months.
Generation of scONE-Seq Libraries.
To start the scONE-seq pre-amplification, the proteinase K (Sigma) was used to completely lysis cells or nuclei. Tagmentation reaction was performed to fragment the genome DNA and add the DNA-specific barcode. This reaction includes the following components, 6 mM MgCl2, 0.5 mM dNTP (NEB), 8.5 mM TAPs-NaOH, 1.5 U/μ1 RNase Inhibitor, 0.05 U KAPA polymerase (Roche), 8% PEG8000, and Tn5 with custom adaptor (GTCTCGTGGGCTCGGTCATG AGATGTGTATAAGAGACAG (SEQ ID NO: 4)) (Novoprotein Suzhou, Jiangsu, China)33,37. The reaction was incubated at 55° C. for 10 mins followed by 72° C. for 10 mins. Then, proteinase K or thermolabile proteinase K (NEB) was used to deactivate the enzyme in the buffer. Thereafter, we performed reverse transcription with the following components, 40 U SuperScript™ III Reverse Transcriptase (Thermo Fisher), 70 mM Tris-HCl, 1.5 U/μ1 RNase Inhibitor, 8 mM MgCl2, 7 μM DTT and 0.15 μM RT primers (GTCTCGTGGGCTCGGATCG TTTTTTTTTTTTTTTTTTTTVN (SEQ ID NO: 1); GTCTCGTGGGCTCGGATCGTNNNNNGGGHN (SEQ ID NO: 5); GTCTCGTGGGCTCGGATCGTTTTVN (SEQ ID NO: 6)). Reverse transcription was carried out at 12° C. for 12 sec followed gradient increasing to of 50° C. for 50 min and 55° C. for 50 min. Subsequently, the residual primers and RNA were removed with thermolabile EXO I (NEB), RNase If (NEB), and RNase H (NEB). Then, the terminal transferase (NEB) was used to add the C-tail to cDNA fragments. This reaction was performed at 37° C. for 5 mins and the enzyme was immediately deactivated with thermolabile proteinase K. Second strand synthesis was then performed by adding, 0.3 μM 3′ adaptor (GTCTCGTGGGCTCGGATCGTNNNNNGGGHN (SEQ ID NO: 5)), 1 μl KAPA HIFI Fidelity Buffer (5×), 0.7 mM (NH4)2SO4, and 0.1 μl KAPA Polymerase. The reaction was incubated at 72° C. for 5 min; 10 cycles of (1 min at 48° C.; 1 min at 72° C.); and 5 min at 72° C., in a thermal cycler. Additional residual primers removal reaction was performed with Exo I (NEB). Lastly, 14 μl KAPA HotStart ReadyMix (2×), 1.5 mM (NH4)2SO4, 2% DMSO (Thermo Fisher), 1.2 μM amplification primer (GATGTGTGGAGGTCTCGTGGGCTCGG (SEQ ID NO: 7)) was added to amplify DNA and RNA simultaneously. The PCR was performed at 98° C. for 4 min; 18-20 cycles of (20 s at 98° C.; 4.25 min at 72° C.); and 10 min at 72° C., in a thermal cycler.
Pre-amplified samples were purified with Ampure XP beads (Beckman, Brea, CA). Samples were diluted to 0.1 ng/μl and performed tagmentation reaction with the following components, 1× TAPs buffer (50 mM TAPS-NaOH, 25 mM MgCl2, PH=8.0), 8% PEG8000, 0.001 μl Tn5 (TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG) (SEQ ID NO: 9). The reaction was performed at 55° C. for 15 min. Samples were then amplified with Illumina (San Diego, CA) sequencing index primers (Table 2) (Sangon, Shanghai, China) by using KAPA HiFi HotStart Polymerase Kit (Roche, Basel, Switzerland). The enrichment PCR was incubated at 95° C. for 10 min; 10-11 cycles of (20 sat 98° C.; 15 sat 60° C.; 30 s at 72° C.); and 2 min at 72° C., in a thermal cycler. Samples were then pooled and purified with Ampure XP beads. scDASH protocol was then used to remove the abundant ribosome and mitochondrial RNA66,67. Double-size selection can be performed to optimize the library size. The library was then sequenced on Illumina NextSeq500.
The 17 Index primer is a standard Illumina sequence for sequencing (it is only added when the library is ready to be sequenced). The IS Index primer is the equivalent of 17 index primer on the other side of the sequenced read. Table 2 shows a custom version of IS Index primer used by our method to enable sequencing of scONE-seq products using the Illumina platform (SEQ ID NO: 11). A standard Illumina IS Index primer will not work with the subject scONE-seq libraries. This primer is, like 17 Index primer, added directly to the flow cell during sequencing. Read2 sequence is the equivalent of Read1 on the other side of the sequenced read. Here, Read2 is customized to work with scONE-seq co-amplified products. The standard Illumina Read2 would not work.
Sequencing data was firstly filtered with fastp68. Fastq files were then separated into DNA fastq files, RNA fastq files, and Unmatched fastq files with seqkit, seqtk, and bbduk69-71. During this process, UMI of the reads was extracted and labeled to fastq files head with fastp68.
DNA fastq files were mapped to hg38 (see worldwide website: ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) with BWA mem72. To perform UMI-based deduplication, read2 reads in bam files were extracted with samtools73 and deduplicated with umi_tools74. The deduplicated read2 reads were used to extract its paired read1 and these paired fastq were then re-aligned to hg38 with BWA mem72,75.
If only performing the counts-based copy number variation analysis, Ginkgo was used to generate the normalized counts76. If performing the allele-specific copy number variation analysis, CHISEL was used to generate both allele frequency information77,78. The integer copy number calculation was based on previous studies79-81. In this pipeline, the segmentation was performed with copynumber and aCGH82.
UMI-based deduplication was also performed with RNA fastq files. The workflow kept the same except replacing the BWA with STAR83. Then, the fastq files can be quantified with Kallisto84 (cDNA quantification) or Salmon85 (pre-mature RNA quantification) (ref1-2). 10× snRNA-seq data was quantified with kb-python86. The expression data were analyzed using Seurat with sctransform pipeline (normalization, dimension reduction, dataset integration, finding clusters, differential gene analysis)87-89. The GBM cellular states scoring was performed following the original paper90. RNA-based CNV inferring was performed with copykat91. The ligand-receptor analysis was performed with CellChat92.
Plots were created using the ggplot2 R package93,94. Heatmaps were created with the ComplexHeatmap package95. R Figures were prepared in Inkscape96.
Slides were obtained from Dr. Danny Chan (Prince of Wales Hospital). Xylene and ethanol were used to remove wax. Antigen retrieval was performed with Sodium Citrate Buffer (Thermo Fisher) at 98° C. for 15 min. IDH1(R132H) antibody (Dionava, 1:50) and ADCY8 antibody (Abcam, 1:200, Cambridge, UK) were added to slides and incubated at 4° C. overnight in a humiliating box. Secondary antibodies (anti-mouse, anti-rabbit, Thermo Fisher) were used to provide the fluorescent signal. Mounting buffer with DAPI (Abcam) was used to stain the nucleus and retain fluorescence. The images were taken with Zeiss Axio Scan.Z1 Slide Scanner (Zeiss, Jena, Germany).
All patents, patent applications, provisional applications, and publications referred to or cited herein are incorporated by reference in their entirety, including all figures and tables, to the extent they are not inconsistent with the explicit teachings of this specification.
Following are examples that illustrate procedures for practicing the invention. These examples should not be construed as limiting. All percentages are by weight and all solvent mixture proportions are by volume unless otherwise noted.
To achieve single-cell genome and transcriptome co-profiling, we devised a workflow to amplify RNA and DNA simultaneously (
To benchmark the method, we used HCT116 colon cancer cell line to compare the single cell RNA data generated by this method with data generated by the current standard for single-cell RNA-Seq, called SmartSeq2. SmartSeq2 is used to profile RNA only from a single cell. Due to the difference in chemistry, we expect a dramatic difference between data generated by our method that uses a total RNA capture, compared to SmartSeq2 that employs polyT selection process. Our method is very comparable in performance metrics such as gene detection sensitivity, read coverage across the genome, and gene body coverage for transcripts (
To benchmark the method against other single-cell DNA/RNA co-profiling methods, we took published data from previously developed methods DR-seq23 and G&T-seq (see Macaulay, I., Haerty, W., Kumar, P. et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods 12, 519-522 (2015).) and compared them against our data. The DNA reads generated by our method for single-cell whole genome sequencing shows substantial improvement in sequence coverage uniformity as compared to the other methods (
We next demonstrated our method on primary human cells to show that it can be used on fresh tissue samples. We successfully co-profiled DNA and RNA from a sample of peripheral blood mononuclear cells (PBMCs) isolated from human whole blood using our method. From this data it is possible to identify all the expected cell populations such as B-cells, T-cells and NK cells, based on the gene expression markers (
We also showed the applicability of our method to probe virus-host interactions using nasopharyngeal carcinoma (NPC) as an example. NPC is a type of cancer that harbors Epstein-Barr virus (EBV), and due to the viral interactions this cell type harbors both transcriptomic and genomic heterogeneity. By co-profiling DNA and RNA from an EBV+
NPC cell line using our method, we were able to observe the heterogeneous virus distribution among NPC cancer cells (
To characterize the transcriptome generated by scONE-seq, we benchmarked it against Smart-seq2 (SS2)36,37 using a variety of test samples: extracted RNA-free E. coli genomes (mock DNA), extracted DNA-free human total RNA (mock RNA), as well as a mixture of the two (i.e., E. coli DNA mixed with human total RNA); and cultured HCT116 single cells. We evaluated the sensitivity by assessing the number of genes detected in each of the benchmark mock and HCT116 samples and found that scONE-seq detected more genes per cell than SS2 (
Next, we sought to validate the whole genome sequencing (WGS) capability of scONE-seq. Lorenz curves39 compare the coverage uniformity for each method, showing a good performance by scONE-seq (
Summarily, the analysis of scRNA-seq and scWGS data generated using benchmark samples shows that scONE-seq can profile genome and transcriptome data from the same single cell without compromising data quality as compared to existing standard methods.
After thoroughly assessing the technical performance of scONE-seq, we next applied it to known biologically heterogeneous samples to evaluate whether it can accurately identify cellular subtypes within a mixed population. To do so, we performed scONE-seq on four different cell lines, as well as on a primary peripheral blood mononuclear cell (PMBC) sample from a healthy donor.
First, we analyzed the cell-line dataset containing 86 HCT116 cells, 143 NPC43 cells, 37 HUVEC cells, and 17 H9 cells to check for accurate cell-type assignment. With unsupervised graph-based clustering, cells from the same cell lines successfully clustered together (
Next, we used lymphocytes from PBMC to test scONE-seq cell-types clustering accuracy in primary samples. We prepared sequencing libraries with scONE-seq and SS2 from the same PBMC sample for comparison. After quality control filtering to remove low-quality cells and potential doublets (see Methods), we collected 200 cells for scONE-seq and 194 cells for Smartseq2. With unsupervised graph-based clustering, we found no difference in the cell-type composition between the two methods (
These results collectively demonstrate that scONE-seq RNA data can accurately capture the biological variation within a heterogeneous sample.
The analysis above shows the feasibility of scONE-seq for the cell-type assignment using RNA data. Next, we evaluated the performance of clone identification with scONE-seq WGS data. Here, we utilized scONE-seq WGS data that was obtained simultaneously from the cell lines used in the previous cell-type assignments analysis, and delineated the CNAs clonal structure of all four cell lines, followed by hierarchical clustering with their copy number profiles (
Glioblastoma (GBM) is one of the most aggressive malignant tumors originating in the brain46,47. When studying GBM or other brain tissues using single-cell technology, it is challenging to obtain intact dissociated whole single cells, especially neurons with their complex morphology, and could lead to biases in cell-type sampling48. As such, for brain single-cell profiling, single nucleus isolation is more widely used. To profile both the genotypic and phenotypic heterogeneity in a biobanked GBM sample, we apply scONE-seq on single nuclei isolated from a months-old snap frozen GBM specimen: a second recurrent GBM sample with IDH1 (R132H), TP53 (P278S), ATRX (R781*) mutation (
First, we delineated the clonal structure of this GBM sample. Using dimension reduction with normalized counts data (500 kb genome bins) we clustered cells into four distinct genomic states (
Next, we analyzed the RNA data from this dataset. First, we performed unsupervised graph-based clustering on scONE-seq RNA data, obtaining multiple cell clusters that were then annotated based on their RNA markers. We found this tumor contains macrophages, neurons, astrocytes, oligodendrocytes, and tumor cells based on canonical cell type gene signatures (
In addition to the phylogenetic tree obtained from DNA data, which dissects the clonality, we are also able to use paired RNA data to superimpose the cell-type information onto the clonal information to identify clonal subpopulations with unique functional, phenotypic features. To do so, we mapped the clonal information to the RNA UMAP to visualize the clonal distribution among different cell types (
The clone 1 subpopulation appears rare within the second recurrent tumor (2.06% of cells sampled with scONE-seq), and phenotypically resembles normal astrocytes.
To verify the existence of clone 1 cells, we first identified gene markers unique to clone 1, including XIST, RFX3, ADCY8, and GRIA1, which can distinguish them from other subpopulations (
Then, we performed histological analyses on FFPE sections from the primary tumor and from the second recurrence tumor to verify the presence of clone 1 cells at different stages of tumor progression. IDH-1(R132H) was selected as the tumor marker as the patient carried IDH1 mutation, and anti-ADCY8 is expected to mark some normal neurons and normal astrocytes in addition to clone 1 cells (
In our staining experiments, we noted that the clone 1 cells appear more abundant near the tumor margins. The presence of these tumor cells with normal-like phenotype in the infiltrated tumor regions prompted us to examine the potential role of clone 1 cells in signaling and cell-cell communication, as the infiltrated regions are an important part of the tumor microenvironment (TME). Several studies have demonstrated that glioma cells can form synaptic structures with normal neurons as a signaling conduit within the tumor51-55. Specifically, this was found to occur via tumor microtubes displaying AMPA receptors (AMPAR), a glutamate receptor subtype54,55. AMPARs are tetrameric, and there are four subunit proteins involved, Glut1-4, encoded by the genes GRIA1-4 respectively56,57. Interestingly, we found the GRIA genes to be differentially expressed between the different tumor clones in our sample (
The invention may be better understood by reference to certain illustrative examples, including but not limited to the following:
Embodiment 1. A method for the amplification of at least one RNA sequence and at least one DNA sequence from a sample, comprising:
Embodiment 2. The method of embodiment 1, wherein the sample comprises a single cell and/or a nucleus.
Embodiment 3. The method of embodiment 2, wherein the single cell is a bacterial cell, an archaeal cell, or a eukaryotic cell.
Embodiment 4. The method of embodiment 2, wherein step b) further comprises lysing the cell to isolate the RNA sequence and the DNA sequence from the cell.
Embodiment 5. The method of embodiment 1, wherein step c) further comprises providing a plurality of adapters that anneal to the RNA sequence and/or the DNA sequence in the sample.
Embodiment 6. The method of embodiment 5, wherein the plurality of adapters is between about 2 and about 100, about 2 to about 5, or about 4.
Embodiment 7. The method of embodiment 5, wherein step d) further comprises providing at least 2 or at least 3 adapters that anneal to two or more RNA sequences in the sample.
Embodiment 8. The method of embodiment 1, wherein the first, second, or third DNA oligonucleotide adapters further comprise a mosaic sequence and a Seq-1 primer sequence.
Embodiment 9. The method of embodiment 1, wherein the transposase is a Tn5 transposase.
Embodiment 10. The method of embodiment 1, wherein steps a)-i) are carried out in one container.
Embodiment 11. The method of embodiment 1, further comprising:
Embodiment 12. The method of embodiment 11, further comprising:
Embodiment 13. A set of oligonucleotide adapters, wherein each adapter comprises an amplification primer sequence, a DNA-specific or RNA-specific barcode, a unique molecular identifier sequence, and an annealing sequence, wherein one oligonucleotide adapter has RNA-specific barcode and the other oligonucleotide adapter has an RNA-specific barcode.
Embodiment 14. The set of oligonucleotide adapters of embodiment 13, wherein the adapters further comprise a mosaic and a Seq-1 primer.
Embodiment 15. An oligonucleotide adapter, wherein the adapter comprises an amplification primer sequence, a DNA-specific or RNA-specific barcode, a unique molecular identifier sequence, and an annealing sequence.
Embodiment 16. The oligonucleotide adapter of embodiment 15, wherein the adapter further comprises a mosaic and/or a Seq-1 primer.
Embodiment 17. The oligonucleotide of embodiment 16, wherein the adapter comprises a nucleotide sequence of SEQ ID NO: 1, SEQ ID NO: 4, SEQ ID NO: 5, or SEQ ID NO: 6 or a nucleotide sequence having at least 95% identity to the nucleotide sequence of SEQ ID NO: 1, SEQ ID NO: 4, SEQ ID NO: 5, or SEQ ID NO: 6.
It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and the scope of the appended claims. In addition, any elements or limitations of any invention or embodiment thereof disclosed herein can be combined with any and/or all other elements or limitations (individually or in any combination) or any other invention or embodiment thereof disclosed herein, and all such combinations are contemplated with the scope of the invention without limitation thereto.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/093,368, filed Oct. 19, 2020, which is hereby incorporated by reference in its entirety including any tables, figures, or drawings.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2021/000713 | 10/19/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63093368 | Oct 2020 | US |